[Spark][3.3] Make Identity Column High Water Mark updates consistent #3990

c27kwan · 2024-12-19T15:36:32Z

Description

Currently:

When we do a MERGE, we will always call setTrackHighWaterMarks on the transaction. This will have an effect if there is an INSERT clause in the MERGE.
If we setTrackHighWaterMarks, we collect the max/min of the column using DeltaIdentityColumnStatsTracker. This stats tracker is only invoked on files that are written/rewritten. These min/max values are compared with the existing high watermark. If the high watermark doesn't exist, we will keep as high watermark the largest of the max or the lowest of the min without checking against the starting value of the identity column.
If an identity column did not generate a value yet, the high watermark is None and isn't stored in the table. This is true for GENERATED ALWAYS AS IDENTITY tables when it is empty and true for GENERATED BY DEFAULT AS IDENTITY tables when it only has user inserted values for the identity column.
If you run a MERGE UPSERT that only ends up updating values in a GENERATED BY DEFAULT table that doesn't have a high watermark yet, we will write a new high watermark that is the highest for the updated file, which may be lower than the starting value specified for the identity column.

Proposal:

This PR makes all high water mark updates go through the same validation function by default. It will not update the high watermark if it violates the start or the existing high watermark. Exception is if the table already has a corrupted high water mark.
This does NOT prevent the scenario where we automatically set the high watermark for a generated by default column based on user inserted values when it does respect the start.
Previously, we did not do high water mark rounding on the updateSchema path. This seems erroneous as the min/max values can be user inserted. We fix that in this PR.
Previously, we did not validate that on SYNC identity, the result of max can be below the existing high water mark. Now, we also do check this invariant and block it by default. A SQLConf has been introduced to allow reducing the high water mark if the user wants.
We add logging to catch bad high water mark.

New tests that were failing prior to this change.

No

larsk-db

LGTM

c27kwan added 2 commits December 19, 2024 16:35

[Spark][3.3] Make Identity Column High Water Mark updates consistent

297f2a7

fix roundToNext

5453963

larsk-db approved these changes Dec 19, 2024

View reviewed changes

allisonport-db merged commit 9b15a8f into delta-io:branch-3.3 Dec 19, 2024
16 of 19 checks passed