Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Identity Column #1959

Closed
1 of 5 tasks
felipepessoto opened this issue Aug 3, 2023 · 14 comments
Closed
1 of 5 tasks

[Feature Request] Identity Column #1959

felipepessoto opened this issue Aug 3, 2023 · 14 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@felipepessoto
Copy link
Contributor

felipepessoto commented Aug 3, 2023

Feature request

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Overview

Identity Column (writer version 6) as defined by https://github.com/delta-io/delta/blob/master/PROTOCOL.md#identity-columns.

Design doc: https://docs.google.com/document/d/1G8Vj6wOxswMx1JklllLoSn-obEpJ-iE_Lhpbd-RfIr4/edit?usp=sharing

PR:

Motivation

This is probably the biggest missing part in Open Source Spark Delta.

Further details

Willingness to contribute

@c27kwan volunteered to work on this feature and posted a design doc here.

@felipepessoto felipepessoto added the enhancement New feature or request label Aug 3, 2023
@felipepessoto
Copy link
Contributor Author

@dennyglee, @allisonport-db, do you have any update about this? This feature probably is the most important missing feature in OSS Delta.

Thanks.

@felipepessoto
Copy link
Contributor Author

@tdas any chance this can be prioritized for next release?

Thanks.

@keen85
Copy link

keen85 commented Feb 8, 2024

duplicate of #1072?

@felipepessoto
Copy link
Contributor Author

I think so. But I would update #1072 to be broader. The way the request is made it seems the Identity feature is already done, and it is only the DeltaTableBuilder API that is missing.

@c27kwan
Copy link
Contributor

c27kwan commented Mar 26, 2024

I'm interested on working on this!

@c27kwan
Copy link
Contributor

c27kwan commented Mar 27, 2024

I can't modify the main comment because i'm not a maintainer. Here's the design doc : https://docs.google.com/document/d/1G8Vj6wOxswMx1JklllLoSn-obEpJ-iE_Lhpbd-RfIr4/edit?usp=sharing

@felipepessoto
Copy link
Contributor Author

@c27kwan that is great.

Have you discussed with any of the maintainers about your intention to contribute? I’m asking because this is a big feature and I just want to make sure they aren’t internally working on it and they are open to accept your implementation.

Thanks.

@vkorukanti
Copy link
Collaborator

Hi @felipepessoto, we don't have anyone else working on this feature. Had an offline chat with @c27kwan before assigning the issue to @c27kwan. Feel free to look at the design doc and post any questions you have.

vkorukanti pushed a commit that referenced this issue Apr 11, 2024
## Description
This PR is part of #1959

In this PR, we introduce the IdentityColumnsTableFeature to test-only so
that we can start developing with it.

Note, we do not add support to minWriterVersion 6 yet to
properties.defaults.minWriterVersion because that will enable the table
feature outside of testing.

## How was this patch tested?
Existing tests pass. 

## Does this PR introduce _any_ user-facing changes?
No, this is a test-only change.
andreaschat-db pushed a commit to andreaschat-db/delta that referenced this issue Apr 16, 2024
## Description
This PR is part of delta-io#1959

In this PR, we introduce the IdentityColumnsTableFeature to test-only so
that we can start developing with it.

Note, we do not add support to minWriterVersion 6 yet to
properties.defaults.minWriterVersion because that will enable the table
feature outside of testing.

## How was this patch tested?
Existing tests pass. 

## Does this PR introduce _any_ user-facing changes?
No, this is a test-only change.
tdas pushed a commit that referenced this issue Apr 18, 2024
#### Which Delta project/connector is this regarding?
- [x] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

## Description
This PR is part of #1959

In this PR, we introduce IdentityColumn.scala, a common file which
contains most of the helpers for Identity Columns, necessary for
unblocking future PRs.

## How was this patch tested?
This PR commits dead code. Existing tests pass.

## Does this PR introduce _any_ user-facing changes?
No.
@tdas tdas added this to the 3.3.0 milestone Apr 19, 2024
scottsand-db pushed a commit that referenced this issue Apr 25, 2024
#### Which Delta project/connector is this regarding?

- [x] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

## Description

This PR is part of #1959

In this PR, we introduce the `GenerateIdentityValues` UDF used for
populating Identity Column values. The UDF is not used in Delta in this
PR yet.

`GenerateIdentityValues` is a simple non-deterministic UDF which keeps a
counter with the user specified `start` and `step`. It counts in
increments of `numPartitions` so that it can be parallelized in
different tasks.

## How was this patch tested?
New test suite and unit tests for the UDF.

## Does this PR introduce _any_ user-facing changes?
No.
allisonport-db pushed a commit that referenced this issue Apr 30, 2024
#### Which Delta project/connector is this regarding?
- [x] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

## Description
This PR is part of #1959
* We introduce `generateAlwaysAsIdentity` and
`generatedByDefaultAsIdentity`APIs into DeltaColumnBuilder so that users
can create Delta table with Identity column.
* We guard the creation of identity column tables with a feature flag
until development is complete.

## How was this patch tested?
New tests. 

## Does this PR introduce _any_ user-facing changes?

<!--
If yes, please clarify the previous behavior and the change this PR
proposes - provide the console output, description and/or an example to
show the behavior difference if possible.
If possible, please also clarify if this is a user-facing change
compared to the released Delta Lake versions or within the unreleased
branches such as master.
If no, write 'No'.
-->
Yes, we introduce `generateAlwaysAsIdentity` and
`generatedByDefaultAsIdentity` interfaces to DeltaColumnBuilder for
creating identity columns.
**Interfaces**
```
def generatedAlwaysAsIdentity(): DeltaColumnBuilder
def generatedAlwaysAsIdentity(start: Long, step: Long): DeltaColumnBuilder
def generatedByDefaultAsIdentity(): DeltaColumnBuilder
def generatedByDefaultAsIdentity(start: Long, step: Long): DeltaColumnBuilder
```
When the `start` and the `step` parameters are not specified, they
default to `1L`. `generatedByDefaultAsIdentity` allows users to insert
values into the column while a column specified
with`generatedAlwaysAsIdentity` can only ever have system generated
values.

**Example Usage**
```
// Creates a Delta identity column.
io.delta.tables.DeltaTable.columnBuilder(spark, "id")
      .dataType(LongType)
      .generatedAlwaysAsIdentity()
// Which is equivalent to the call
io.delta.tables.DeltaTable.columnBuilder(spark, "id")
      .dataType(LongType)
      .generatedAlwaysAsIdentity(start = 1L, step = 1L)
```
@c27kwan
Copy link
Contributor

c27kwan commented Jul 12, 2024

Sorry for the lack of update in 2.5 months -- I was on vacation for a month and haven't had opportunity to return to this. I've been talking to @zhipengmao-db and he volunteered to pick up the remainder of the implementation so we can make progress again. 🎉

allisonport-db pushed a commit that referenced this issue Jul 19, 2024
)

#### Which Delta project/connector is this regarding?
- [x] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

## Description
This PR is part of #1959

In this PR, we enable basic ingestion for Identity Columns. 
* We use a custom UDF `GenerateIdentityValues` to generate values when
not supplemented by the user.
* We introduce classes to help update and track the high watermark of
identity columns.
* We also do some cleanup/ improve readability for
ColumnWithDefaultExprUtils

Note: This does NOT enable Ingestion with MERGE INTO yet. That will come
in a follow up PR, to make this easier to review.

## How was this patch tested?
We introduce a new test suite IdentityColumnIngestionSuite.

## Does this PR introduce _any_ user-facing changes?
No.
scottsand-db pushed a commit that referenced this issue Aug 8, 2024
<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  3. Be sure to keep the PR description updated to reflect all changes.
  4. Please write your PR title to summarize what this PR proposes.
5. If possible, provide a concise example to reproduce the issue for a
faster review.
6. If applicable, include the corresponding issue number in the PR title
and link it in the body.
-->

#### Which Delta project/connector is this regarding?
<!--
Please add the component selected below to the beginning of the pull
request title
For example: [Spark] Title of my pull request
-->

- [x] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

## Description

<!--
- Describe what this PR changes.
- Describe why we need the change.
 
If this PR resolves an issue be sure to include "Resolves #XXX" to
correctly link and close the issue upon merge.
-->
This PR is part of #1959

In this PR, we extend the addColumn interface in DeltaTableBuilder to
allow for Identity Columns creation.

Resolves #1072

## How was this patch tested?

<!--
If tests were added, say they were added here. Please make sure to test
the changes thoroughly including negative and positive cases if
possible.
If the changes were tested in any way other than unit tests, please
clarify how you tested step by step (ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future).
If the changes were not tested, please explain why.
-->

New tests.

## Does this PR introduce _any_ user-facing changes?

<!--
If yes, please clarify the previous behavior and the change this PR
proposes - provide the console output, description and/or an example to
show the behavior difference if possible.
If possible, please also clarify if this is a user-facing change
compared to the released Delta Lake versions or within the unreleased
branches such as master.
If no, write 'No'.
-->
We update the arguments of addColumn method: 
- Support a new data type for parameter `generatedAlwaysAs`. Users can
specify `generatedAlwaysAs` as `IdentityGenerator` to add an identity
column that is GENERATED ALWAYS.

- Add a new parameter `generatedByDefaultAs`. Users can specify
`generatedByDefaultAs` as `IdentityGenerator` to add an identity column
that is GENERATED BY DEFAULT.

- Users can optionally pass in `start` (default = 1) and `step` (default
= 1) values to construct `IdentityGenerator` object, which specify the
start and step value to generate the identity column.


Interface
```
 def addColumn(
        self,
        colName: str,
        dataType: Union[str, DataType],
        nullable: bool = True,
        generatedAlwaysAs: Optional[Union[str, IdentityGenerator]] = None,
        generatedByDefaultAs: Optional[IdentityGenerator] = None,
        comment: Optional[str] = None,
) -> "DeltaTableBuilder"
```
Example Usage

```
 DeltaTable.create()
    .tableName("tableName")
    .addColumn("id", dataType=LongType(), generatedAlwaysAs=IdentityGenerator())
    .execute()

 DeltaTable.create()
    .tableName("tableName")
    .addColumn("id", dataType=LongType(), generatedAlwaysAs=IdentityGenerator(start=1, step=1))
    .execute()

 DeltaTable.create()
    .tableName("tableName")
    .addColumn("id", dataType=LongType(), generatedByDefaultAs=IdentityGenerator())
    .execute()

 DeltaTable.create()
    .tableName("tableName")
    .addColumn("id", dataType=LongType(), generatedByDefaultAs=IdentityGenerator(start=1, step=1))
    .execute()
```

---------

Co-authored-by: Carmen Kwan <[email protected]>
scottsand-db pushed a commit that referenced this issue Aug 13, 2024
<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  3. Be sure to keep the PR description updated to reflect all changes.
  4. Please write your PR title to summarize what this PR proposes.
5. If possible, provide a concise example to reproduce the issue for a
faster review.
6. If applicable, include the corresponding issue number in the PR title
and link it in the body.
-->

#### Which Delta project/connector is this regarding?
<!--
Please add the component selected below to the beginning of the pull
request title
For example: [Spark] Title of my pull request
-->

- [x] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

## Description

<!--
- Describe what this PR changes.
- Describe why we need the change.
 
If this PR resolves an issue be sure to include "Resolves #XXX" to
correctly link and close the issue upon merge.
-->
This PR is part of #1959.
The PR relaxes metadata conflict for identity column SYNC high water
mark operation. When winning transaction contains identity column
metadata change and the current transaction does not contain metadata
change, we mark the current transaction as no metadata conflict.

## How was this patch tested?

<!--
If tests were added, say they were added here. Please make sure to test
the changes thoroughly including negative and positive cases if
possible.
If the changes were tested in any way other than unit tests, please
clarify how you tested step by step (ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future).
If the changes were not tested, please explain why.
-->
A new test suite.
## Does this PR introduce _any_ user-facing changes?

<!--
If yes, please clarify the previous behavior and the change this PR
proposes - provide the console output, description and/or an example to
show the behavior difference if possible.
If possible, please also clarify if this is a user-facing change
compared to the released Delta Lake versions or within the unreleased
branches such as master.
If no, write 'No'.
-->
No.
scottsand-db pushed a commit that referenced this issue Aug 14, 2024
<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  3. Be sure to keep the PR description updated to reflect all changes.
  4. Please write your PR title to summarize what this PR proposes.
5. If possible, provide a concise example to reproduce the issue for a
faster review.
6. If applicable, include the corresponding issue number in the PR title
and link it in the body.
-->

#### Which Delta project/connector is this regarding?
<!--
Please add the component selected below to the beginning of the pull
request title
For example: [Spark] Title of my pull request
-->

- [x] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

## Description
This PR is part of #1959 .
It add more tests for Identity Column to test
- logging identity column properties and stats
- reading table should not see identity column properties
- compatibility with table of older protocols
- identity value generation starting at range boundaries of long data
type

<!--
- Describe what this PR changes.
- Describe why we need the change.
 
If this PR resolves an issue be sure to include "Resolves #XXX" to
correctly link and close the issue upon merge.
-->

## How was this patch tested?
Test only change.
<!--
If tests were added, say they were added here. Please make sure to test
the changes thoroughly including negative and positive cases if
possible.
If the changes were tested in any way other than unit tests, please
clarify how you tested step by step (ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future).
If the changes were not tested, please explain why.
-->

## Does this PR introduce _any_ user-facing changes?

<!--
If yes, please clarify the previous behavior and the change this PR
proposes - provide the console output, description and/or an example to
show the behavior difference if possible.
If possible, please also clarify if this is a user-facing change
compared to the released Delta Lake versions or within the unreleased
branches such as master.
If no, write 'No'.
-->
No.
allisonport-db pushed a commit that referenced this issue Aug 15, 2024
<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  3. Be sure to keep the PR description updated to reflect all changes.
  4. Please write your PR title to summarize what this PR proposes.
5. If possible, provide a concise example to reproduce the issue for a
faster review.
6. If applicable, include the corresponding issue number in the PR title
and link it in the body.
-->

#### Which Delta project/connector is this regarding?
<!--
Please add the component selected below to the beginning of the pull
request title
For example: [Spark] Title of my pull request
-->

- [x] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

## Description

<!--
- Describe what this PR changes.
- Describe why we need the change.
 
If this PR resolves an issue be sure to include "Resolves #XXX" to
correctly link and close the issue upon merge.
-->

It makes `MergeIntoCommandBase` extend a trait
`SupportsNonDeterministicExpression` in Spark that logical plans can
extend to check whether it can allow non-deterministic expressions and
pass the CheckAnalysis rule.

`MergeIntoCommandBase` extends `SupportsNonDeterministicExpression` to
check whether all the conditions in the Merge command are deterministic.

This is harmless and allows more flexible usage of merge. For example,
we use a non-deterministic UDF to generate identity values for identity
columns, so it is required to allow non-deterministic expressions in
updated/inserted column values of merge statements in order to support
merge on target tables with identity columns. So this PR is part of
#1959.



## How was this patch tested?
New test cases.
<!--
If tests were added, say they were added here. Please make sure to test
the changes thoroughly including negative and positive cases if
possible.
If the changes were tested in any way other than unit tests, please
clarify how you tested step by step (ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future).
If the changes were not tested, please explain why.
-->

## Does this PR introduce _any_ user-facing changes?

<!--
If yes, please clarify the previous behavior and the change this PR
proposes - provide the console output, description and/or an example to
show the behavior difference if possible.
If possible, please also clarify if this is a user-facing change
compared to the released Delta Lake versions or within the unreleased
branches such as master.
If no, write 'No'.
-->

Yes.
We are changing the behavior to allow non-deterministic expressions in
updated/inserted column values of merge statements. We still don't allow
non-deterministic expressions in conditions of merge statements.

e.g. 
We currently don't allow the merge statement to add a random noise to
the value that is inserted in merge

```
MERGE INTO target USING source
ON target.key = source.key
WHEN MATCHED THEN UPDATE SET target.value = source.value + rand()
```

Now we are allowing this as this may be helpful in terms of data privacy
to not disclose the actual data while preserving the data properties
e.g. mean values etc.
allisonport-db pushed a commit that referenced this issue Aug 19, 2024
<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  3. Be sure to keep the PR description updated to reflect all changes.
  4. Please write your PR title to summarize what this PR proposes.
5. If possible, provide a concise example to reproduce the issue for a
faster review.
6. If applicable, include the corresponding issue number in the PR title
and link it in the body.
-->

#### Which Delta project/connector is this regarding?
<!--
Please add the component selected below to the beginning of the pull
request title
For example: [Spark] Title of my pull request
-->

- [x] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

## Description

<!--
- Describe what this PR changes.
- Describe why we need the change.
 
If this PR resolves an issue be sure to include "Resolves #XXX" to
correctly link and close the issue upon merge.
-->
This PR is part of #1959 .
The change refactors `IdentityColumnTestUtils` to reuse
`createTableWithIdColAndIntValueCol` to create tables and to unify the
column names in identity column tests.
## How was this patch tested?

<!--
If tests were added, say they were added here. Please make sure to test
the changes thoroughly including negative and positive cases if
possible.
If the changes were tested in any way other than unit tests, please
clarify how you tested step by step (ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future).
If the changes were not tested, please explain why.
-->
This is test only change.
## Does this PR introduce _any_ user-facing changes?

<!--
If yes, please clarify the previous behavior and the change this PR
proposes - provide the console output, description and/or an example to
show the behavior difference if possible.
If possible, please also clarify if this is a user-facing change
compared to the released Delta Lake versions or within the unreleased
branches such as master.
If no, write 'No'.
-->
No.
vkorukanti pushed a commit that referenced this issue Aug 22, 2024
…3594)

## Description
This PR is part of #1959 .
`IdentityColumnSuite` is flaky due to using duplicate table name
'identity_test' is used across tests. This PR generates all table names
in identity column related suites by using UUID to make them unique.

## How was this patch tested?
It is test only change.
vkorukanti pushed a commit that referenced this issue Aug 23, 2024
## Description
This PR is part of #1959 .

It supports MERGE command to provide system generated IDENTITY values in
INSERT and UPDATE actions. Unlike INSERT, where the identity columns
that needs writing are collected in
`WriteIntoDelta.writeAndReturnCommitData` exactly before writing in
`TransactionalWrite.writeFiles`, MERGE expressions are resolved earlier.

Specifically, we resolve the table's identity columns to track for high
water marks in `PreprocessTableMerge.apply`. The column set will be
passed to `OptimisticTransaction` and be written in
`TransactionalWrite.writeFiles`.

## How was this patch tested?
New test suite `IdentityColumnDMLScalaSuite`.
vkorukanti pushed a commit that referenced this issue Aug 26, 2024
## Description
This PR is part of #1959 .
We have implemented identity column support and all the tests passed. We
now can move identity column feature out of developer mode.

## How was this patch tested?
Existent tests.
longvu-db pushed a commit to longvu-db/delta that referenced this issue Aug 28, 2024
…3566)

## Description
This PR is part of delta-io#1959 .

It supports MERGE command to provide system generated IDENTITY values in
INSERT and UPDATE actions. Unlike INSERT, where the identity columns
that needs writing are collected in
`WriteIntoDelta.writeAndReturnCommitData` exactly before writing in
`TransactionalWrite.writeFiles`, MERGE expressions are resolved earlier.

Specifically, we resolve the table's identity columns to track for high
water marks in `PreprocessTableMerge.apply`. The column set will be
passed to `OptimisticTransaction` and be written in
`TransactionalWrite.writeFiles`.

## How was this patch tested?
New test suite `IdentityColumnDMLScalaSuite`.
longvu-db pushed a commit to longvu-db/delta that referenced this issue Aug 28, 2024
## Description
This PR is part of delta-io#1959 .
We have implemented identity column support and all the tests passed. We
now can move identity column feature out of developer mode.

## How was this patch tested?
Existent tests.
@tigerhawkvok
Copy link

Very exciting! Will this make it to the 3.2.1 release?

@zhipengmao-db
Copy link
Contributor

Very exciting! Will this make it to the 3.2.1 release?

It will be in 3.3 release.

@felipepessoto
Copy link
Contributor Author

@zhipengmao-db do you know the ETA to release 3.3.0?

If you are not planning additional changes, should we close this as done?

Thanks

@zhipengmao-db
Copy link
Contributor

@felipepessoto The ETA for 3.3.0 is 11/20. We could close it as done already. Thanks!

@felipepessoto
Copy link
Contributor Author

Completed #3598

allisonport-db pushed a commit that referenced this issue Dec 9, 2024
#### Which Delta project/connector is this regarding?
<!--
Please add the component selected below to the beginning of the pull
request title
For example: [Spark] Title of my pull request
-->

- [x] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

## Description
This PR is part of #1959

In this PR, we flip the SQLConf that guards the creation of Identity
Column from false to true. Without this, we cannot create identity
columns in Delta Spark!

## How was this patch tested?

Existing tests pass.
## Does this PR introduce _any_ user-facing changes?

Yes, it enables the creation of Identity Columns.
allisonport-db pushed a commit to allisonport-db/delta that referenced this issue Dec 9, 2024
#### Which Delta project/connector is this regarding?
<!--
Please add the component selected below to the beginning of the pull
request title
For example: [Spark] Title of my pull request
-->

- [x] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

## Description
This PR is part of delta-io#1959

In this PR, we flip the SQLConf that guards the creation of Identity
Column from false to true. Without this, we cannot create identity
columns in Delta Spark!

## How was this patch tested?

Existing tests pass.
## Does this PR introduce _any_ user-facing changes?

Yes, it enables the creation of Identity Columns.

(cherry picked from commit 7224677)
allisonport-db added a commit that referenced this issue Dec 9, 2024
#### Which Delta project/connector is this regarding?

- [x] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

## Description
This PR is part of #1959

In this PR, we flip the SQLConf that guards the creation of Identity
Column from false to true. Without this, we cannot create identity
columns in Delta Spark!

## How was this patch tested?

Existing tests pass.
## Does this PR introduce _any_ user-facing changes?

Yes, it enables the creation of Identity Columns.

(cherry picked from commit 7224677)

Co-authored-by: Carmen Kwan <[email protected]>
maltevelin added a commit to maltevelin/delta that referenced this issue Dec 28, 2024
… sorted on Z-order value.

Signed-off-by: Malte Velin <[email protected]>
Author: Malte Sølvsten Velin <[email protected]>
Date:   Sat Dec 28 20:10:01 2024 +0100

    Add configuration property to toggle sorting output on Z-order value.
    Signed-off-by: Malte Velin <[email protected]>

commit 82e940f17f51a0ebeaac0b03441b13875da3c439
Author: Fred Storage Liu <[email protected]>
Date:   Fri Dec 20 17:02:18 2024 -0800

    Fix indentation in CloneTableBase (#3996)

    <!--
    Thanks for sending a pull request!  Here are some tips for you:
    1. If this is your first time, please read our contributor guidelines:
    https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
    2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
    Your PR title ...'.
      3. Be sure to keep the PR description updated to reflect all changes.
      4. Please write your PR title to summarize what this PR proposes.
    5. If possible, provide a concise example to reproduce the issue for a
    faster review.
    6. If applicable, include the corresponding issue number in the PR title
    and link it in the body.
    -->

    #### Which Delta project/connector is this regarding?
    <!--
    Please add the component selected below to the beginning of the pull
    request title
    For example: [Spark] Title of my pull request
    -->

    - [x] Spark
    - [ ] Standalone
    - [ ] Flink
    - [ ] Kernel
    - [ ] Other (fill in here)

    ## Description

    Fix indentation in CloneTableBase

    ## How was this patch tested?

    <!--
    If tests were added, say they were added here. Please make sure to test
    the changes thoroughly including negative and positive cases if
    possible.
    If the changes were tested in any way other than unit tests, please
    clarify how you tested step by step (ideally copy and paste-able, so
    that other reviewers can test and check, and descendants can verify in
    the future).
    If the changes were not tested, please explain why.
    -->

    ## Does this PR introduce _any_ user-facing changes?

    <!--
    If yes, please clarify the previous behavior and the change this PR
    proposes - provide the console output, description and/or an example to
    show the behavior difference if possible.
    If possible, please also clarify if this is a user-facing change
    compared to the released Delta Lake versions or within the unreleased
    branches such as master.
    If no, write 'No'.
    -->

commit 4dbadbbf8ddd0a12273ac9521d61bc89196dc80d
Author: Carmen Kwan <[email protected]>
Date:   Thu Dec 19 22:39:44 2024 +0100

    [Spark] Make Identity Column High Water Mark updates consistent (#3989)

    #### Which Delta project/connector is this regarding?
    <!--
    Please add the component selected below to the beginning of the pull
    request title
    For example: [Spark] Title of my pull request
    -->

    - [x] Spark
    - [ ] Standalone
    - [ ] Flink
    - [ ] Kernel
    - [ ] Other (fill in here)

    ## Description

    Currently:
    - When we do a MERGE, we will always call `setTrackHighWaterMarks` on
    the transaction. This will have an effect if there is an INSERT clause
    in the MERGE.
    - If we `setTrackHighWaterMarks`, we collect the max/min of the column
    using `DeltaIdentityColumnStatsTracker`. This stats tracker is only
    invoked on files that are written/rewritten. These min/max values are
    compared with the existing high watermark. If the high watermark doesn't
    exist, we will keep as high watermark the largest of the max or the
    lowest of the min without checking against the starting value of the
    identity column.
    - If an identity column did not generate a value yet, the high watermark
    is None and isn't stored in the table. This is true for GENERATED ALWAYS
    AS IDENTITY tables when it is empty and true for GENERATED BY DEFAULT AS
    IDENTITY tables when it only has user inserted values for the identity
    column.
    - If you run a MERGE UPSERT that only ends up updating values in a
    GENERATED BY DEFAULT table that doesn't have a high watermark yet, we
    will write a new high watermark that is the highest for the updated
    file, which may be lower than the starting value specified for the
    identity column.

    Proposal:
    - This PR makes all high water mark updates go through the same
    validation function by default. It will not update the high watermark if
    it violates the start or the existing high watermark. Exception is if
    the table already has a corrupted high water mark.
    - This does NOT prevent the scenario where we automatically set the high
    watermark for a generated by default column based on user inserted
    values when it does respect the start.
    - Previously, we did not do high water mark rounding on the
    `updateSchema` path. This seems erroneous as the min/max values can be
    user inserted. We fix that in this PR.
    - Previously, we did not validate that on SYNC identity, the result of
    max can be below the existing high water mark. Now, we also do check
    this invariant and block it by default. A SQLConf has been introduced to
    allow reducing the high water mark if the user wants.
    - We add logging to catch bad high water mark.

    ## How was this patch tested?
    New tests that were failing prior to this change.

    ## Does this PR introduce _any_ user-facing changes?
    No

commit ae4982ce267052c526fef638a88ce86f7d85e583
Author: Allison Portis <[email protected]>
Date:   Thu Dec 19 11:42:13 2024 -0800

    [Kernel] Fix flaky test for the Timer class for metrics (#3946)

    <!--
    Thanks for sending a pull request!  Here are some tips for you:
    1. If this is your first time, please read our contributor guidelines:
    https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
    2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
    Your PR title ...'.
      3. Be sure to keep the PR description updated to reflect all changes.
      4. Please write your PR title to summarize what this PR proposes.
    5. If possible, provide a concise example to reproduce the issue for a
    faster review.
    6. If applicable, include the corresponding issue number in the PR title
    and link it in the body.
    -->

    #### Which Delta project/connector is this regarding?
    <!--
    Please add the component selected below to the beginning of the pull
    request title
    For example: [Spark] Title of my pull request
    -->

    - [ ] Spark
    - [ ] Standalone
    - [ ] Flink
    - [X] Kernel
    - [ ] Other (fill in here)

    ## Description

    Fixes a flaky test.

    ## How was this patch tested?

    Unit test fix.

    ## Does this PR introduce _any_ user-facing changes?

    No.

commit da58cad55741313852005cf2d84a7f2e0280bf2b
Author: Allison Portis <[email protected]>
Date:   Wed Dec 18 19:35:07 2024 -0800

    [Kernel] Remove CC code from SnapshotManager (#3986)

    <!--
    Thanks for sending a pull request!  Here are some tips for you:
    1. If this is your first time, please read our contributor guidelines:
    https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
    2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
    Your PR title ...'.
      3. Be sure to keep the PR description updated to reflect all changes.
      4. Please write your PR title to summarize what this PR proposes.
    5. If possible, provide a concise example to reproduce the issue for a
    faster review.
    6. If applicable, include the corresponding issue number in the PR title
    and link it in the body.
    -->

    #### Which Delta project/connector is this regarding?
    <!--
    Please add the component selected below to the beginning of the pull
    request title
    For example: [Spark] Title of my pull request
    -->

    - [ ] Spark
    - [ ] Standalone
    - [ ] Flink
    - [ ] Kernel
    - [ ] Other (fill in here)

    ## Description

    We are re-thinking the design of the Coordinated Commits table feature
    and much of this snapshot code will be refactored. Remove it for now as
    it greatly complicates our snapshot construction, and hopefully we can
    be more intentional in our code design/organization when re-implementing
    it.

    https://github.com/delta-io/delta/commit/fc81d1247d66cc32e454e985f0cfc81447f897b6
    already removed the public interfaces and made it such that
    `SnapshotImpl::getTableCommitCoordinatorClientHandlerOpt` never returned
    a handler.

    ## How was this patch tested?

    Existing tests should suffice.

    ## Does this PR introduce _any_ user-facing changes?

    No.

commit 34f02d8858faf2d74465a40c22edb548e0626c05
Author: Cuong Nguyen <[email protected]>
Date:   Wed Dec 18 14:46:52 2024 -0800

    [Spark] Avoid unnecessarily calling update and some minor clean up in tests (#3965)

commit 1cd6fed7987ad15e7d8b2d593c4579ce865f4cbe
Author: Andreas Chatzistergiou <[email protected]>
Date:   Wed Dec 18 23:01:54 2024 +0100

    [Spark] Drop feature support in DeltaTable Scala/Python APIs (#3952)

    <!--
    Thanks for sending a pull request!  Here are some tips for you:
    1. If this is your first time, please read our contributor guidelines:
    https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
    2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
    Your PR title ...'.
      3. Be sure to keep the PR description updated to reflect all changes.
      4. Please write your PR title to summarize what this PR proposes.
    5. If possible, provide a concise example to reproduce the issue for a
    faster review.
    6. If applicable, include the corresponding issue number in the PR title
    and link it in the body.
    -->

    #### Which Delta project/connector is this regarding?
    <!--
    Please add the component selected below to the beginning of the pull
    request title
    For example: [Spark] Title of my pull request
    -->

    - [x] Spark
    - [ ] Standalone
    - [ ] Flink
    - [ ] Kernel
    - [ ] Other (fill in here)

    ## Description

    <!--
    - Describe what this PR changes.
    - Describe why we need the change.

    If this PR resolves an issue be sure to include "Resolves #XXX" to
    correctly link and close the issue upon merge.
    -->

    This PR adds drop feature support in the DeltaTable API for both scala
    and python APIs.

    ## How was this patch tested?

    <!--
    If tests were added, say they were added here. Please make sure to test
    the changes thoroughly including negative and positive cases if
    possible.
    If the changes were tested in any way other than unit tests, please
    clarify how you tested step by step (ideally copy and paste-able, so
    that other reviewers can test and check, and descendants can verify in
    the future).
    If the changes were not tested, please explain why.
    -->
    Added UTs.

    ## Does this PR introduce _any_ user-facing changes?

    <!--
    If yes, please clarify the previous behavior and the change this PR
    proposes - provide the console output, description and/or an example to
    show the behavior difference if possible.
    If possible, please also clarify if this is a user-facing change
    compared to the released Delta Lake versions or within the unreleased
    branches such as master.
    If no, write 'No'.
    -->
    Yes. See description.

commit baa55187fd32bb4b0f97fd1d2305db4e0dd7d44e
Author: Carmen Kwan <[email protected]>
Date:   Wed Dec 18 20:21:45 2024 +0100

    [Spark][TEST-ONLY] More tests updating Identity Column high water mark (#3985)

    #### Which Delta project/connector is this regarding?

    - [x] Spark
    - [ ] Standalone
    - [ ] Flink
    - [ ] Kernel
    - [ ] Other (fill in here)

    ## Description

    Test-only PR. Add one more test for updating the identity column high
    water mark when it not already available.

    ## How was this patch tested?

    Test-only PR.

    ## Does this PR introduce _any_ user-facing changes?
    No.

commit f577290c5dec0b76130397cc0a050f9030b12035
Author: Rahul Shivu Mahadev <[email protected]>
Date:   Tue Dec 17 13:55:03 2024 -0800

    [Spark] Fix auto-conflict handling logic in Optimize to handle DVs (#3981)

    <!--
    Thanks for sending a pull request!  Here are some tips for you:
    1. If this is your first time, please read our contributor guidelines:
    https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
    2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
    Your PR title ...'.
      3. Be sure to keep the PR description updated to reflect all changes.
      4. Please write your PR title to summarize what this PR proposes.
    5. If possible, provide a concise example to reproduce the issue for a
    faster review.
    6. If applicable, include the corresponding issue number in the PR title
    and link it in the body.
    -->

    #### Which Delta project/connector is this regarding?

    - [x] Spark
    - [ ] Standalone
    - [ ] Flink
    - [ ] Kernel
    - [ ] Other (fill in here)

    ## Description
    Bug: There was an existing long standing bug where the custom conflict
    detection logic in Optimize does not catch concurrent transactions that
    add DVs. e.g. AddFile(path='a') -> AddFile(path='a', dv='dv1').

    Fix: Updated the conflict resolution to consider a composite key of
    (path, dvId) instead of just depending on path.

    ## How was this patch tested?
    - unit tests

    ## Does this PR introduce _any_ user-facing changes?
    no

commit fc81d1247d66cc32e454e985f0cfc81447f897b6
Author: Scott Sandre <[email protected]>
Date:   Fri Dec 13 11:14:09 2024 -0800

    [Kernel] Remove Coordinated Commits from public API (#3938)

    #### Which Delta project/connector is this regarding?

    - [ ] Spark
    - [ ] Standalone
    - [ ] Flink
    - [X] Kernel
    - [ ] Other (fill in here)

    ## Description

    We are re-thinking the design of the Coordinated Commits table feature
    (currently still in RFC). Thus, we should remove it from the public
    Kernel API for Delta 3.3 release.

    To summarize the changes of this PR

    - I remove `getCommitCoordinatorClientHandler` from the `Engine`
    interface
    - I move various previously `public` CC interfaces and classes to be
    `internal` now
    - `SnapshotImpl::getTableCommitCoordinatorClientHandlerOpt` is hardcoded
    to return an empty optional
    - Delete failing test suites and unapplicable utils

    ## How was this patch tested?

    Existing CI tests.

    ## Does this PR introduce _any_ user-facing changes?

    We remove coordinated commits from the public kernel API.

commit 2f5673e0432962cb834e103dbc79ce8aea9a4e37
Author: Thang Long Vu <[email protected]>
Date:   Fri Dec 13 01:09:27 2024 +0100

    [Docs] Update documentation for Row Tracking to include Row Tracking Backfill introduced in Delta 3.3 (#3968)

    <!--
    Thanks for sending a pull request!  Here are some tips for you:
    1. If this is your first time, please read our contributor guidelines:
    https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
    2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
    Your PR title ...'.
      3. Be sure to keep the PR description updated to reflect all changes.
      4. Please write your PR title to summarize what this PR proposes.
    5. If possible, provide a concise example to reproduce the issue for a
    faster review.
    6. If applicable, include the corresponding issue number in the PR title
    and link it in the body.
    -->

    #### Which Delta project/connector is this regarding?
    <!--
    Please add the component selected below to the beginning of the pull
    request title
    For example: [Spark] Title of my pull request
    -->

    - [ ] Spark
    - [ ] Standalone
    - [ ] Flink
    - [ ] Kernel
    - [X] Other (Docs)

    ## Description
    - Update the [Row Tracking
    blog](https://docs.delta.io/latest/delta-row-tracking.html#-limitations).
    Before, we mention in the limitation that we cannot enable Row Tracking
    on non-empty tables. Now, with [Row Tracking Backfill
    release](https://github.com/delta-io/delta/releases/) in Delta 3.3, we
    are now enable Row Tracking on non-empty tables.
    - Explicitly mention that you can enable Row Tracking on existing tables
    from Delta 3.3.

    <!--
    - Describe what this PR changes.
    - Describe why we need the change.

    If this PR resolves an issue be sure to include "Resolves #XXX" to
    correctly link and close the issue upon merge.
    -->

    ## How was this patch tested?
    N/A
    <!--
    If tests were added, say they were added here. Please make sure to test
    the changes thoroughly including negative and positive cases if
    possible.
    If the changes were tested in any way other than unit tests, please
    clarify how you tested step by step (ideally copy and paste-able, so
    that other reviewers can test and check, and descendants can verify in
    the future).
    If the changes were not tested, please explain why.
    -->

    ## Does this PR introduce _any_ user-facing changes?
    N/A
    <!--
    If yes, please clarify the previous behavior and the change this PR
    proposes - provide the console output, description and/or an example to
    show the behavior difference if possible.
    If possible, please also clarify if this is a user-facing change
    compared to the released Delta Lake versions or within the unreleased
    branches such as master.
    If no, write 'No'.
    -->

commit 259751b51d73831fd6222d98178091b037ef0d7a
Author: Thang Long Vu <[email protected]>
Date:   Fri Dec 13 01:09:17 2024 +0100

    [Docs][3.3] Update documentation for Row Tracking to include Row Tracking Backfill introduced in Delta 3.3 (#3969)

    <!--
    Thanks for sending a pull request!  Here are some tips for you:
    1. If this is your first time, please read our contributor guidelines:
    https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
    2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
    Your PR title ...'.
      3. Be sure to keep the PR description updated to reflect all changes.
      4. Please write your PR title to summarize what this PR proposes.
    5. If possible, provide a concise example to reproduce the issue for a
    faster review.
    6. If applicable, include the corresponding issue number in the PR title
    and link it in the body.
    -->

    #### Which Delta project/connector is this regarding?
    <!--
    Please add the component selected below to the beginning of the pull
    request title
    For example: [Spark] Title of my pull request
    -->

    - [ ] Spark
    - [ ] Standalone
    - [ ] Flink
    - [ ] Kernel
    - [X] Other (Docs)

    ## Description
    - Cherry-pick https://github.com/delta-io/delta/pull/3968 into Delta
    3.3.
    <!--
    - Describe what this PR changes.
    - Describe why we need the change.

    If this PR resolves an issue be sure to include "Resolves #XXX" to
    correctly link and close the issue upon merge.
    -->

    ## How was this patch tested?
    N/A
    <!--
    If tests were added, say they were added here. Please make sure to test
    the changes thoroughly including negative and positive cases if
    possible.
    If the changes were tested in any way other than unit tests, please
    clarify how you tested step by step (ideally copy and paste-able, so
    that other reviewers can test and check, and descendants can verify in
    the future).
    If the changes were not tested, please explain why.
    -->

    ## Does this PR introduce _any_ user-facing changes?
    N/A
    <!--
    If yes, please clarify the previous behavior and the change this PR
    proposes - provide the console output, description and/or an example to
    show the behavior difference if possible.
    If possible, please also clarify if this is a user-facing change
    compared to the released Delta Lake versions or within the unreleased
    branches such as master.
    If no, write 'No'.
    -->

commit d0be1d7b6c376b5d7cf7fba5daf039a2638cd7b9
Author: Zhipeng Mao <[email protected]>
Date:   Thu Dec 12 20:01:50 2024 +0100

    Add identity column doc (#3935)

    <!--
    Thanks for sending a pull request!  Here are some tips for you:
    1. If this is your first time, please read our contributor guidelines:
    https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
    2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
    Your PR title ...'.
      3. Be sure to keep the PR description updated to reflect all changes.
      4. Please write your PR title to summarize what this PR proposes.
    5. If possible, provide a concise example to reproduce the issue for a
    faster review.
    6. If applicable, include the corresponding issue number in the PR title
    and link it in the body.
    -->

    #### Which Delta project/connector is this regarding?
    <!--
    Please add the component selected below to the beginning of the pull
    request title
    For example: [Spark] Title of my pull request
    -->

    - [x] Spark
    - [ ] Standalone
    - [ ] Flink
    - [ ] Kernel
    - [ ] Other (fill in here)

    ## Description
    It adds doc for identity column.
    <!--
    - Describe what this PR changes.
    - Describe why we need the change.

    If this PR resolves an issue be sure to include "Resolves #XXX" to
    correctly link and close the issue upon merge.
    -->

    ## How was this patch tested?
    Doc change.
    <!--
    If tests were added, say they were added here. Please make sure to test
    the changes thoroughly including negative and positive cases if
    possible.
    If the changes were tested in any way other than unit tests, please
    clarify how you tested step by step (ideally copy and paste-able, so
    that other reviewers can test and check, and descendants can verify in
    the future).
    If the changes were not tested, please explain why.
    -->

    ## Does this PR introduce _any_ user-facing changes?
    No.
    <!--
    If yes, please clarify the previous behavior and the change this PR
    proposes - provide the console output, description and/or an example to
    show the behavior difference if possible.
    If possible, please also clarify if this is a user-facing change
    compared to the released Delta Lake versions or within the unreleased
    branches such as master.
    If no, write 'No'.
    -->

commit fdf887d6104582955ad75d3f7297b36d249d91d1
Author: Zhipeng Mao <[email protected]>
Date:   Thu Dec 12 19:59:20 2024 +0100

    [SPARK] Add test for Identity Column merge metadata conflict (#3971)

    <!--
    Thanks for sending a pull request!  Here are some tips for you:
    1. If this is your first time, please read our contributor guidelines:
    https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
    2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
    Your PR title ...'.
      3. Be sure to keep the PR description updated to reflect all changes.
      4. Please write your PR title to summarize what this PR proposes.
    5. If possible, provide a concise example to reproduce the issue for a
    faster review.
    6. If applicable, include the corresponding issue number in the PR title
    and link it in the body.
    -->

    #### Which Delta project/connector is this regarding?
    <!--
    Please add the component selected below to the beginning of the pull
    request title
    For example: [Spark] Title of my pull request
    -->

    - [x] Spark
    - [ ] Standalone
    - [ ] Flink
    - [ ] Kernel
    - [ ] Other (fill in here)

    ## Description
    It adds a test for identity column to verify merge will be aborted if
    high water mark is changed after analysis and before execution.
    <!--
    - Describe what this PR changes.
    - Describe why we need the change.

    If this PR resolves an issue be sure to include "Resolves #XXX" to
    correctly link and close the issue upon merge.
    -->

    ## How was this patch tested?
    Test-only.
    <!--
    If tests were added, say they were added here. Please make sure to test
    the changes thoroughly including negative and positive cases if
    possible.
    If the changes were tested in any way other than unit tests, please
    clarify how you tested step by step (ideally copy and paste-able, so
    that other reviewers can test and check, and descendants can verify in
    the future).
    If the changes were not tested, please explain why.
    -->

    ## Does this PR introduce _any_ user-facing changes?
    No.
    <!--
    If yes, please clarify the previous behavior and the change this PR
    proposes - provide the console output, description and/or an example to
    show the behavior difference if possible.
    If possible, please also clarify if this is a user-facing change
    compared to the released Delta Lake versions or within the unreleased
    branches such as master.
    If no, write 'No'.
    -->

commit 58f94afafd16a19644fef7130a46cb8a93d18ec8
Author: Dhruv Arya <[email protected]>
Date:   Thu Dec 12 10:58:13 2024 -0800

    [PROTOCOL][Version Checksum] Remove references to Java-specific Int.MaxValue and Long.MaxValue (#3961)

    <!--
    Thanks for sending a pull request!  Here are some tips for you:
    1. If this is your first time, please read our contributor guidelines:
    https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
    2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
    Your PR title ...'.
      3. Be sure to keep the PR description updated to reflect all changes.
      4. Please write your PR title to summarize what this PR proposes.
    5. If possible, provide a concise example to reproduce the issue for a
    faster review.
    6. If applicable, include the corresponding issue number in the PR title
    and link it in the body.
    -->

    #### Which Delta project/connector is this regarding?
    <!--
    Please add the component selected below to the beginning of the pull
    request title
    For example: [Spark] Title of my pull request
    -->

    - [ ] Spark
    - [ ] Standalone
    - [ ] Flink
    - [ ] Kernel
    - [X] Other (PROTOCOL)

    ## Description

    <!--
    - Describe what this PR changes.
    - Describe why we need the change.

    If this PR resolves an issue be sure to include "Resolves #XXX" to
    correctly link and close the issue upon merge.
    -->
    Fixes a Version Checksum spec changes introduced in
    https://github.com/delta-io/delta/pull/3777. The last two bin bounds for
    Deleted File Count Histogram right now are defined in terms of Java's
    Int.MaxValue and Long.MaxValue. This PR makes the spec language
    independent by inlining the actual values of these bounds.

    ## How was this patch tested?

    <!--
    If tests were added, say they were added here. Please make sure to test
    the changes thoroughly including negative and positive cases if
    possible.
    If the changes were tested in any way other than unit tests, please
    clarify how you tested step by step (ideally copy and paste-able, so
    that other reviewers can test and check, and descendants can verify in
    the future).
    If the changes were not tested, please explain why.
    -->
    N/A

    ## Does this PR introduce _any_ user-facing changes?

    <!--
    If yes, please clarify the previous behavior and the change this PR
    proposes - provide the console output, description and/or an example to
    show the behavior difference if possible.
    If possible, please also clarify if this is a user-facing change
    compared to the released Delta Lake versions or within the unreleased
    branches such as master.
    If no, write 'No'.
    -->
    No

commit 05cdd3cd4752dbb826f6bcfa4ba1d46ef1b246ee
Author: Anton Erofeev <[email protected]>
Date:   Thu Dec 12 17:20:08 2024 +0300

    [Kernel] Fix incorrect load protocol and metadata time log (#3964)

    <!--
    Thanks for sending a pull request!  Here are some tips for you:
    1. If this is your first time, please read our contributor guidelines:
    https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
    2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
    Your PR title ...'.
      3. Be sure to keep the PR description updated to reflect all changes.
      4. Please write your PR title to summarize what this PR proposes.
    5. If possible, provide a concise example to reproduce the issue for a
    faster review.
    6. If applicable, include the corresponding issue number in the PR title
    and link it in the body.
    -->

    #### Which Delta project/connector is this regarding?
    <!--
    Please add the component selected below to the beginning of the pull
    request title
    For example: [Spark] Title of my pull request
    -->

    - [ ] Spark
    - [ ] Standalone
    - [ ] Flink
    - [X] Kernel
    - [ ] Other (fill in here)

    ## Description
    Resolves #3948
    Fixed incorrect load protocol and metadata time log
    <!--
    - Describe what this PR changes.
    - Describe why we need the change.

    If this PR resolves an issue be sure to include "Resolves #XXX" to
    correctly link and close the issue upon merge.
    -->

    ## How was this patch tested?
    Unit tests
    <!--
    If tests were added, say they were added here. Please make sure to test
    the changes thoroughly including negative and positive cases if
    possible.
    If the changes were tested in any way other than unit tests, please
    clarify how you tested step by step (ideally copy and paste-able, so
    that other reviewers can test and check, and descendants can verify in
    the future).
    If the changes were not tested, please explain why.
    -->

    ## Does this PR introduce _any_ user-facing changes?
    No
    <!--
    If yes, please clarify the previous behavior and the change this PR
    proposes - provide the console output, description and/or an example to
    show the behavior difference if possible.
    If possible, please also clarify if this is a user-facing change
    compared to the released Delta Lake versions or within the unreleased
    branches such as master.
    If no, write 'No'.
    -->

commit 19d89f6ba0803b0f4c1826a521c27ababdd50864
Author: Jiaheng Tang <[email protected]>
Date:   Wed Dec 11 18:28:25 2024 -0800

    Update liquid clustering docs (#3958)

    <!--
    Thanks for sending a pull request!  Here are some tips for you:
    1. If this is your first time, please read our contributor guidelines:
    https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
    2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
    Your PR title ...'.
      3. Be sure to keep the PR description updated to reflect all changes.
      4. Please write your PR title to summarize what this PR proposes.
    5. If possible, provide a concise example to reproduce the issue for a
    faster review.
    6. If applicable, include the corresponding issue number in the PR title
    and link it in the body.
    -->

    #### Which Delta project/connector is this regarding?
    <!--
    Please add the component selected below to the beginning of the pull
    request title
    For example: [Spark] Title of my pull request
    -->

    - [ ] Spark
    - [ ] Standalone
    - [ ] Flink
    - [ ] Kernel
    - [x] Other (docs)

    ## Description

    <!--
    - Describe what this PR changes.
    - Describe why we need the change.

    If this PR resolves an issue be sure to include "Resolves #XXX" to
    correctly link and close the issue upon merge.
    -->
    Add docs for OPTIMIZE FULL, in-place migration, and create table from
    external location.
    ## How was this patch tested?

    <!--
    If tests were added, say they were added here. Please make sure to test
    the changes thoroughly including negative and positive cases if
    possible.
    If the changes were tested in any way other than unit tests, please
    clarify how you tested step by step (ideally copy and paste-able, so
    that other reviewers can test and check, and descendants can verify in
    the future).
    If the changes were not tested, please explain why.
    -->
    ![127 0 0 1_8000_delta-clustering html
    (6)](https://github.com/user-attachments/assets/4148e5e0-3aad-403a-bb91-641f08a500b7)

    ## Does this PR introduce _any_ user-facing changes?

    <!--
    If yes, please clarify the previous behavior and the change this PR
    proposes - provide the console output, description and/or an example to
    show the behavior difference if possible.
    If possible, please also clarify if this is a user-facing change
    compared to the released Delta Lake versions or within the unreleased
    branches such as master.
    If no, write 'No'.
    -->
    No

commit 30d74a6b8d5a305ce4a6ab625f69d0b9b93e6f92
Author: Carmen Kwan <[email protected]>
Date:   Wed Dec 11 21:25:09 2024 +0100

    [Spark][TEST-ONLY] Identity Column replace tests for partitioned tables (#3960)

    #### Which Delta project/connector is this regarding?

    - [x] Spark
    - [ ] Standalone
    - [ ] Flink
    - [ ] Kernel
    - [ ] Other (fill in here)

    ## Description
    Quick follow up for https://github.com/delta-io/delta/pull/3937
    Expand test to cover partitioned tables too.

    ## How was this patch tested?
    Test only change. New tests and existing tests pass.

    ## Does this PR introduce _any_ user-facing changes?
    No.

commit 10972577202783720f5e61925ee7d7c6fc204a78
Author: Fred Storage Liu <[email protected]>
Date:   Wed Dec 11 11:50:01 2024 -0800

    Update Delta uniform documentation to include ALTER enabling (#3927)

    <!--
    Thanks for sending a pull request!  Here are some tips for you:
    1. If this is your first time, please read our contributor guidelines:
    https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
    2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
    Your PR title ...'.
      3. Be sure to keep the PR description updated to reflect all changes.
      4. Please write your PR title to summarize what this PR proposes.
    5. If possible, provide a concise example to reproduce the issue for a
    faster review.
    6. If applicable, include the corresponding issue number in the PR title
    and link it in the body.
    -->

    #### Which Delta project/connector is this regarding?
    <!--
    Please add the component selected below to the beginning of the pull
    request title
    For example: [Spark] Title of my pull request
    -->

    - [ ] Spark
    - [ ] Standalone
    - [ ] Flink
    - [ ] Kernel
    - [x] Other (fill in here)

    ## Description

    Update Delta uniform documentation to include ALTER enabling

    ## How was this patch tested?

    <!--
    If tests were added, say they were added here. Please make sure to test
    the changes thoroughly including negative and positive cases if
    possible.
    If the changes were tested in any way other than unit tests, please
    clarify how you tested step by step (ideally copy and paste-able, so
    that other reviewers can test and check, and descendants can verify in
    the future).
    If the changes were not tested, please explain why.
    -->

    ## Does this PR introduce _any_ user-facing changes?

    <!--
    If yes, please clarify the previous behavior and the change this PR
    proposes - provide the console output, description and/or an example to
    show the behavior difference if possible.
    If possible, please also clarify if this is a user-facing change
    compared to the released Delta Lake versions or within the unreleased
    branches such as master.
    If no, write 'No'.
    -->

commit 57d0e3b42f60d133db9c4a81a432804803d9955b
Author: Fred Storage Liu <[email protected]>
Date:   Wed Dec 11 07:37:09 2024 -0800

    Expose Delta Uniform write commit size in logs (#3898)

    <!--
    Thanks for sending a pull request!  Here are some tips for you:
    1. If this is your first time, please read our contributor guidelines:
    https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
    2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
    Your PR title ...'.
      3. Be sure to keep the PR description updated to reflect all changes.
      4. Please write your PR title to summarize what this PR proposes.
    5. If possible, provide a concise example to reproduce the issue for a
    faster review.
    6. If applicable, include the corresponding issue number in the PR title
    and link it in the body.
    -->

    #### Which Delta project/connector is this regarding?
    <!--
    Please add the component selected below to the beginning of the pull
    request title
    For example: [Spark] Title of my pull request
    -->

    - [x] Spark
    - [ ] Standalone
    - [ ] Flink
    - [ ] Kernel
    - [ ] Other (fill in here)

    ## Description
    Expose Delta Uniform write commit size in logs

commit 407d4c99b437636cde2fcc5c52039bb19510bb64
Author: Kaiqi Jin <[email protected]>
Date:   Wed Dec 11 07:36:30 2024 -0800

    Use default partition value during uniform conversion when partition value is missing (#3924)

    <!--
    Thanks for sending a pull request!  Here are some tips for you:
    1. If this is your first time, please read our contributor guidelines:
    https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
    2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
    Your PR title ...'.
      3. Be sure to keep the PR description updated to reflect all changes.
      4. Please write your PR title to summarize what this PR proposes.
    5. If possible, provide a concise example to reproduce the issue for a
    faster review.
    6. If applicable, include the corresponding issue number in the PR title
    and link it in the body.
    -->

    #### Which Delta project/connector is this regarding?
    <!--
    Please add the component selected below to the beginning of the pull
    request title
    For example: [Spark] Title of my pull request
    -->

    - [x] Spark
    - [ ] Standalone
    - [ ] Flink
    - [ ] Kernel
    - [ ] Other (fill in here)

    ## Description

    <!--
    - Describe what this PR changes.
    - Describe why we need the change.

    If this PR resolves an issue be sure to include "Resolves #XXX" to
    correctly link and close the issue upon merge.
    -->

    Previously, missing <key, value> pair in the partitionValues map was not
    handled correctly, resulting in a Delta -> Iceberg conversion failure.
    To fix this, this PR use default value correctly for missing entries in
    the partitionValues map.

    ## How was this patch tested?

    <!--
    If tests were added, say they were added here. Please make sure to test
    the changes thoroughly including negative and positive cases if
    possible.
    If the changes were tested in any way other than unit tests, please
    clarify how you tested step by step (ideally copy and paste-able, so
    that other reviewers can test and check, and descendants can verify in
    the future).
    If the changes were not tested, please explain why.
    -->
    Existing tests

    ## Does this PR introduce _any_ user-facing changes?

    <!--
    If yes, please clarify the previous behavior and the change this PR
    proposes - provide the console output, description and/or an example to
    show the behavior difference if possible.
    If possible, please also clarify if this is a user-facing change
    compared to the released Delta Lake versions or within the unreleased
    branches such as master.
    If no, write 'No'.
    -->
    No

commit e3a613dfa550defb86a05a57d9fef52daa86e8da
Author: Cuong Nguyen <[email protected]>
Date:   Tue Dec 10 15:35:16 2024 -0800

    [Spark] Pass catalog table to DeltaLog API call sites, part 3 (#3949)

    <!--
    Thanks for sending a pull request!  Here are some tips for you:
    1. If this is your first time, please read our contributor guidelines:
    https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
    2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
    Your PR title ...'.
      3. Be sure to keep the PR description updated to reflect all changes.
      4. Please write your PR title to summarize what this PR proposes.
    5. If possible, provide a concise example to reproduce the issue for a
    faster review.
    6. If applicable, include the corresponding issue number in the PR title
    and link it in the body.
    -->

    #### Which Delta project/connector is this regarding?
    <!--
    Please add the component selected below to the beginning of the pull
    request title
    For example: [Spark] Title of my pull request
    -->

    - [x] Spark
    - [ ] Standalone
    - [ ] Flink
    - [ ] Kernel
    - [ ] Other (fill in here)

    ## Description
    Fix a number of code paths where we want to pass catalog table to the
    commit coordinator client via DeltaLog API

    <!--
    - Describe what this PR changes.
    - Describe why we need the change.

    If this PR resolves an issue be sure to include "Resolves #XXX" to
    correctly link and close the issue upon merge.
    -->

    ## How was this patch tested?
    Unit tests
    <!--
    If tests were added, say they were added here. Please make sure to test
    the changes thoroughly including negative and positive cases if
    possible.
    If the changes were tested in any way other than unit tests, please
    clarify how you tested step by step (ideally copy and paste-able, so
    that other reviewers can test and check, and descendants can verify in
    the future).
    If the changes were not tested, please explain why.
    -->

    ## Does this PR introduce _any_ user-facing changes?
    No.
    <!--
    If yes, please clarify the previous behavior and the change this PR
    proposes - provide the console output, description and/or an example to
    show the behavior difference if possible.
    If possible, please also clarify if this is a user-facing change
    compared to the released Delta Lake versions or within the unreleased
    branches such as master.
    If no, write 'No'.
    -->

commit b39d5b328ffa8e1071fc6aab78cfb345c8f2d8f7
Author: Fred Storage Liu <[email protected]>
Date:   Tue Dec 10 09:33:33 2024 -0800

    Add sizeInBytes API for Delta clone (#3942)

    <!--
    Thanks for sending a pull request!  Here are some tips for you:
    1. If this is your first time, please read our contributor guidelines:
    https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
    2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
    Your PR title ...'.
      3. Be sure to keep the PR description updated to reflect all changes.
      4. Please write your PR title to summarize what this PR proposes.
    5. If possible, provide a concise example to reproduce the issue for a
    faster review.
    6. If applicable, include the corresponding issue number in the PR title
    and link it in the body.
    -->

    #### Which Delta project/connector is this regarding?
    <!--
    Please add the component selected below to the beginning of the pull
    request title
    For example: [Spark] Title of my pull request
    -->

    - [x] Spark
    - [ ] Standalone
    - [ ] Flink
    - [ ] Kernel
    - [ ] Other (fill in here)

    ## Description

    Add sizeInBytes API for Delta clone

    ## How was this patch tested?

    existing UT

    ## Does this PR introduce _any_ user-facing changes?

    <!--
    If yes, please clarify the previous behavior and the change this PR
    proposes - provide the console output, description and/or an example to
    show the behavior difference if possible.
    If possible, please also clarify if this is a user-facing change
    compared to the released Delta Lake versions or within the unreleased
    branches such as master.
    If no, write 'No'.
    -->

commit 61ac84d4579fdf99465861991f1a0fb697fa0325
Author: Cuong Nguyen <[email protected]>
Date:   Tue Dec 10 09:07:26 2024 -0800

    [SPARK] Clean up vacuum-related code (#3931)

    <!--
    Thanks for sending a pull request!  Here are some tips for you:
    1. If this is your first time, please read our contributor guidelines:
    https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
    2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
    Your PR title ...'.
      3. Be sure to keep the PR description updated to reflect all changes.
      4. Please write your PR title to summarize what this PR proposes.
    5. If possible, provide a concise example to reproduce the issue for a
    faster review.
    6. If applicable, include the corresponding issue number in the PR title
    and link it in the body.
    -->

    #### Which Delta project/connector is this regarding?
    <!--
    Please add the component selected below to the beginning of the pull
    request title
    For example: [Spark] Title of my pull request
    -->

    - [x] Spark
    - [ ] Standalone
    - [ ] Flink
    - [ ] Kernel
    - [ ] Other (fill in here)

    ## Description

    <!--
    - Describe what this PR changes.
    - Describe why we need the change.

    If this PR resolves an issue be sure to include "Resolves #XXX" to
    correctly link and close the issue upon merge.
    -->
    This PR cleans up a few things
    + In scala API, use the `VacuumTableCommand` instead of calling
    `VacuumCommand.gc` directly,
    + Pass `DeltaTableV2` to `VacuumCommand.gc` instead of `DeltaLog`.
    + Use `DeltaTableV2` in tests instead of `DeltaLog`.

    ## How was this patch tested?
    Unit tests
    <!--
    If tests were added, say they were added here. Please make sure to test
    the changes thoroughly including negative and positive cases if
    possible.
    If the changes were tested in any way other than unit tests, please
    clarify how you tested step by step (ideally copy and paste-able, so
    that other reviewers can test and check, and descendants can verify in
    the future).
    If the changes were not tested, please explain why.
    -->

    ## Does this PR introduce _any_ user-facing changes?

    <!--
    If yes, please clarify the previous behavior and the change this PR
    proposes - provide the console output, description and/or an example to
    show the behavior difference if possible.
    If possible, please also clarify if this is a user-facing change
    compared to the released Delta Lake versions or within the unreleased
    branches such as master.
    If no, write 'No'.
    -->
    No

commit 79e518ba81505384695ec4a71ba0013eeb860646
Author: Johan Lasperas <[email protected]>
Date:   Tue Dec 10 17:50:10 2024 +0100

    [Spark] Allow missing fields with implicit casting during streaming write (#3822)

    ## Description
    Follow-up on https://github.com/delta-io/delta/pull/3443 that introduced
    implicit casting during streaming write to delta tables.

    The feature was shipped disabled due to a regression found in testing
    where writing data with missing struct fields start being rejected.
    Streaming writes are one of the few inserts that allows missing struct
    fields.

    This change allows configuring the casting behavior used in MERGE,
    UPDATE and streaming writes wrt to missing struct fields.

    ## How was this patch tested?
    Extensive tests were added in
    https://github.com/delta-io/delta/pull/3762 in preparation for this
    changes, covering for all inserts (SQL, dataframe, append/overwrite,
    ..):
    - Missing top-level columns and nested struct fields.
    - Extra top-level columns and nested struct fields with schema
    evolution.
    - Position vs. name based resolution for top-level columns and nested
    struct fields.
    with e.p. the goal of ensuring that enabling implicit casting in stream
    writes here doesn't cause any other unwanted behavior change.

    ## This PR introduces the following *user-facing* changes
    From the initial PR: https://github.com/delta-io/delta/pull/3443

    Previously, writing to a Delta sink using a type that doesn't match the
    column type in the Delta table failed with
    `DELTA_FAILED_TO_MERGE_FIELDS`:
    ```
    spark.readStream
        .table("delta_source")
        # Column 'a' has type INT in 'delta_sink'.
        .select(col("a").cast("long").alias("a"))
        .writeStream
        .format("delta")
        .option("checkpointLocation", "<location>")
        .toTable("delta_sink")

    DeltaAnalysisException: [DELTA_FAILED_TO_MERGE_FIELDS] Failed to merge fields 'a' and 'a'
    ```
    With this change, writing to the sink now succeeds and data is cast from
    `LONG` to `INT`. If any value overflows, the stream fails with (assuming
    default `storeAssignmentPolicy=ANSI`):
    ```
    SparkArithmeticException: [CAST_OVERFLOW_IN_TABLE_INSERT] Fail to assign a value of 'LONG' type to the 'INT' type column or variable 'a' due to an overflow. Use `try_cast` on the input value to tolerate overflow and return NULL instead."
    ```

commit 8f344098e0601d04f9bd3fa25306569b3d106e06
Author: jackierwzhang <[email protected]>
Date:   Tue Dec 10 08:49:46 2024 -0800

    Fix schema tracking location check condition against checkpoint location (#3939)

    <!--
    Thanks for sending a pull request!  Here are some tips for you:
    1. If this is your first time, please read our contributor guidelines:
    https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
    2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
    Your PR title ...'.
      3. Be sure to keep the PR description updated to reflect all changes.
      4. Please write your PR title to summarize what this PR proposes.
    5. If possible, provide a concise example to reproduce the issue for a
    faster review.
    6. If applicable, include the corresponding issue number in the PR title
    and link it in the body.
    -->

    #### Which Delta project/connector is this regarding?
    <!--
    Please add the component selected below to the beginning of the pull
    request title
    For example: [Spark] Title of my pull request
    -->

    - [x] Spark
    - [ ] Standalone
    - [ ] Flink
    - [ ] Kernel
    - [ ] Other (fill in here)

    ## Description
    This PR introduces a better way to check if the schema tracking location
    is under the checkpoint location that would work with arbitrary file
    systems and paths.

    ## How was this patch tested?
    New UT.

    ## Does this PR introduce _any_ user-facing changes?
    No

commit fdc2c7f7c7367a50de8734cc9b4520cecc5aeadc
Author: Rajesh Parangi <[email protected]>
Date:   Mon Dec 9 17:50:24 2024 -0800

    Add Documentation for Vacuum LITE (#3932)

    #### Which Delta project/connector is this regarding?

    - [X] Spark
    - [ ] Standalone
    - [ ] Flink
    - [ ] Kernel
    - [ ] Other (fill in here)

    ## Description

    Adds Documentation for Vacuum LITE

    ## How was this patch tested?
    N/A

    ## Does this PR introduce _any_ user-facing changes?

    NO

commit 00fa0ae8a0d2ec9f0e52cbe8ab28274a80e6272b
Author: Carmen Kwan <[email protected]>
Date:   Mon Dec 9 21:02:47 2024 +0100

    [Spark][TEST-ONLY] Identity Column high watermark and replace tests (#3937)

    #### Which Delta project/connector is this regarding?
    - [x] Spark
    - [ ] Standalone
    - [ ] Flink
    - [ ] Kernel
    - [ ] Other (fill in here)

    ## Description
    In this PR, we expand the test coverage for identity columns.
    Specifically, we add more assertions for the high watermarks and cover
    more test scenarios with replacing tables.

    ## How was this patch tested?

    Test-only PR. We expand test coverage.

    ## Does this PR introduce _any_ user-facing changes?
    No.

commit 7224677acda11eb21103112c8b636963874e9071
Author: Carmen Kwan <[email protected]>
Date:   Mon Dec 9 20:32:51 2024 +0100

    [Spark] Enable Identity column SQLConf (#3936)

    #### Which Delta project/connector is this regarding?
    <!--
    Please add the component selected below to the beginning of the pull
    request title
    For example: [Spark] Title of my pull request
    -->

    - [x] Spark
    - [ ] Standalone
    - [ ] Flink
    - [ ] Kernel
    - [ ] Other (fill in here)

    ## Description
    This PR is part of https://github.com/delta-io/delta/issues/1959

    In this PR, we flip the SQLConf that guards the creation of Identity
    Column from false to true. Without this, we cannot create identity
    columns in Delta Spark!

    ## How was this patch tested?

    Existing tests pass.
    ## Does this PR introduce _any_ user-facing changes?

    Yes, it enables the creation of Identity Columns.

commit bb3956f0c8e290725d0b6ab02981d2c5ad462c12
Author: Andreas Chatzistergiou <[email protected]>
Date:   Fri Dec 6 14:51:31 2024 +0100

    [Spark] CheckpointProtectionTableFeature base implementation (#3926)

    <!--
    Thanks for sending a pull request!  Here are some tips for you:
    1. If this is your first time, please read our contributor guidelines:
    https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
    2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
    Your PR title ...'.
      3. Be sure to keep the PR description updated to reflect all changes.
      4. Please write your PR title to summarize what this PR proposes.
    5. If possible, provide a concise example to reproduce the issue for a
    faster review.
    6. If applicable, include the corresponding issue number in the PR title
    and link it in the body.
    -->

    #### Which Delta project/connector is this regarding?
    <!--
    Please add the component selected below to the beginning of the pull
    request title
    For example: [Spark] Title of my pull request
    -->

    - [x] Spark
    - [ ] Standalone
    - [ ] Flink
    - [ ] Kernel
    - [ ] Other (fill in here)

    ## Description

    <!--
    - Describe what this PR changes.
    - Describe why we need the change.

    If this PR resolves an issue be sure to include "Resolves #XXX" to
    correctly link and close the issue upon merge.
    -->

    Base implementation of `CheckpointProtectionTableFeature`. Writers are
    only allowed to cleanup metadata as long as the can truncate history up
    to `requireCheckpointProtectionBeforeVersion` in one go.

    As a second step, the feature can be improved by allowing metadata
    cleanup even when the invariant above does not hold. Metadata cleanup
    could be allowed if the client verifies it supports all writer features
    contained in the history it intends to truncate. This improvement is
    important to support for providing GDPR compliance.

    ## How was this patch tested?

    <!--
    If tests were added, say they were added here. Please make sure to test
    the changes thoroughly including negative and positive cases if
    possible.
    If the changes were tested in any way other than unit tests, please
    clarify how you tested step by step (ideally copy and paste-able, so
    that other reviewers can test and check, and descendants can verify in
    the future).
    If the changes were not tested, please explain why.
    -->
    Added tests in `DeltaRetentionSuite`.

    ## Does this PR introduce _any_ user-facing changes?

    <!--
    If yes, please clarify the previous behavior and the change this PR
    proposes - provide the console output, description and/or an example to
    show the behavior difference if possible.
    If possible, please also clarify if this is a user-facing change
    compared to the released Delta Lake versions or within the unreleased
    branches such as master.
    If no, write 'No'.
    -->
    No.

commit da162a097a25524fc97334f47a180257cb487789
Author: Dhruv Arya <[email protected]>
Date:   Thu Dec 5 17:16:14 2024 -0800

    [Protocol] Add a version checksum to the specification (#3777)

    <!--
    Thanks for sending a pull request!  Here are some tips for you:
    1. If this is your first time, please read our contributor guidelines:
    https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
    2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
    Your PR title ...'.
      3. Be sure to keep the PR description updated to reflect all changes.
      4. Please write your PR title to summarize what this PR proposes.
    5. If possible, provide a concise example to reproduce the issue for a
    faster review.
    6. If applicable, include the corresponding issue number in the PR title
    and link it in the body.
    -->

    #### Which Delta project/connector is this regarding?
    <!--
    Please add the component selected below to the beginning of the pull
    request title
    For example: [Spark] Title of my pull request
    -->

    - [ ] Spark
    - [ ] Standalone
    - [ ] Flink
    - [ ] Kernel
    - [X] Other (PROTOCOL)

    ## Description

    <!--
    - Describe what this PR changes.
    - Describe why we need the change.

    If this PR resolves an issue be sure to include "Resolves #XXX" to
    correctly link and close the issue upon merge.
    -->

    Adds the concept of a Version Checksum to the protocol. This version
    checksum can be emitted on every commit and stores important bits of
    information about the snapshot which can later be used to validate the
    integrity of the delta log.

    ## How was this patch tested?

    <!--
    If tests were added, say they were added here. Please make sure to test
    the changes thoroughly including negative and positive cases if
    possible.
    If the changes were tested in any way other than unit tests, please
    clarify how you tested step by step (ideally copy and paste-able, so
    that other reviewers can test and check, and descendants can verify in
    the future).
    If the changes were not tested, please explain why.
    -->
    N/A

    ## Does this PR introduce _any_ user-facing changes?

    <!--
    If yes, please clarify the previous behavior and the change this PR
    proposes - provide the console output, description and/or an example to
    show the behavior difference if possible.
    If possible, please also clarify if this is a user-facing change
    compared to the released Delta Lake versions or within the unreleased
    branches such as master.
    If no, write 'No'.
    -->
    N/A

commit 8fb17a0160a937307d6fb9276a77403aeb7efc63
Author: Dhruv Arya <[email protected]>
Date:   Thu Dec 5 16:58:21 2024 -0800

    [Spark][Version Checksum] Read Protocol, Metadata, and ICT directly from the Checksum during Snapshot construction (#3920)

    <!--
    Thanks for sending a pull request!  Here are some tips for you:
    1. If this is your first time, please read our contributor guidelines:
    https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
    2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
    Your PR title ...'.
      3. Be sure to keep the PR description updated to reflect all changes.
      4. Please write your PR title to summarize what this PR proposes.
    5. If possible, provide a concise example to reproduce the issue for a
    faster review.
    6. If applicable, include the corresponding issue number in the PR title
    and link it in the body.
    -->

    #### Which Delta project/connector is this regarding?
    <!--
    Please add the component selected below to the beginning of the pull
    request title
    For example: [Spark] Title of my pull request
    -->

    - [X] Spark
    - [ ] Standalone
    - [ ] Flink
    - [ ] Kernel
    - [ ] Other (fill in here)

    ## Description

    <!--
    - Describe what this PR changes.
    - Describe why we need the change.

    If this PR resolves an issue be sure to include "Resolves #XXX" to
    correctly link and close the issue upon merge.
    -->
    Stacked over https://github.com/delta-io/delta/pull/3907.
    This PR makes the Checksum (if available) the source of truth for
    Protocol, Metadata, ICT during snapshot construction. This helps us
    avoid a Spark query and improves performance.

    ## How was this patch tested?

    <!--
    If tests were added, say they were added here. Please make sure to test
    the changes thoroughly including negative and positive cases if
    possible.
    If the changes were tested in any way other than unit tests, please
    clarify how you tested step by step (ideally copy and paste-able, so
    that other reviewers can test and check, and descendants can verify in
    the future).
    If the changes were not tested, please explain why.
    -->
    Added some test cases to existing suites

    ## Does this PR introduce _any_ user-facing changes?

    <!--
    If yes, please clarify the previous behavior and the change this PR
    proposes - provide the console output, description and/or an example to
    show the behavior difference if possible.
    If possible, please also clarify if this is a user-facing change
    compared to the released Delta Lake versions or within the unreleased
    branches such as master.
    If no, write 'No'.
    -->
    No

commit 1ee278ae23bc08a25c448524264622ba106686cd
Author: Allison Portis <[email protected]>
Date:   Wed Dec 4 15:40:15 2024 -0800

    [Kernel][Metrics][PR#4] Adds Counter class (#3906)

    <!--
    Thanks for sending a pull request!  Here are some tips for you:
    1. If this is your first time, please read our contributor guidelines:
    https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
    2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
    Your PR title ...'.
      3. Be sure to keep the PR description updated to reflect all changes.
      4. Please write your PR title to summarize what this PR proposes.
    5. If possible, provide a concise example to reproduce the issue for a
    faster review.
    6. If applicable, include the corresponding issue number in the PR title
    and link it in the body.
    -->

    #### Which Delta project/connector is this regarding?
    <!--
    Please add the component selected below to the beginning of the pull
    request title
    For example: [Spark] Title of my pull request
    -->

    - [ ] Spark
    - [ ] Standalone
    - [ ] Flink
    - [X] Kernel
    - [ ] Other (fill in here)

    ## Description

    Adds a `Counter` that will be used by following PRs to count metrics.

    ## How was this patch tested?

    Adds a unit test.

    ## Does this PR introduce _any_ user-facing changes?

    No.

commit 8cd614107468389a117362f708a540c0263c01e7
Author: Qiyuan Dong <[email protected]>
Date:   Wed Dec 4 23:21:34 2024 +0100

    [Kernel] Add JsonMetadataDomain and RowTrackingMetadataDomain (#3893)

    <!--
    Thanks for sending a pull request!  Here are some tips for you:
    1. If this is your first time, please read our contributor guidelines:
    https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
    2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
    Your PR title ...'.
      3. Be sure to keep the PR description updated to reflect all changes.
      4. Please write your PR title to summarize what this PR proposes.
    5. If possible, provide a concise example to reproduce the issue for a
    faster review.
    6. If applicable, include the corresponding issue number in the PR title
    and link it in the body.
    -->

    #### Which Delta project/connector is this regarding?
    <!--
    Please add the component selected below to the beginning of the pull
    request title
    For example: [Spark] Title of my pull request
    -->

    - [ ] Spark
    - [ ] Standalone
    - [ ] Flink
    - [x] Kernel
    - [ ] Other (fill in here)

    ## Description

    <!--
    - Describe what this PR changes.
    - Describe why we need the change.

    If this PR resolves an issue be sure to include "Resolves #XXX" to
    correctly link and close the issue upon merge.
    -->

    This PR adds the following to Delta Kernel Java:

    - `JsonMetadataDomain.java`: Introduces the base abstract class
    `JsonMetadataDomain` for metadata domains that use JSON as their
    configuration string. Concrete implementations, such as
    `RowTrackingMetadataDomain`, should extend this class to define their
    specific metadata domain. This class provides utility functions for
      - serializing to/deserializing from a JSON configuration string
      - creating a `DomainMetadata` action for committing
      - creating a specific metadata domain instance from a `SnapshotImpl`

    - `RowTrackingMetadataDomain.java`: Implements the metadata domain
    `delta.rowT…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

No branches or pull requests

7 participants