Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark][3.3] Make Identity Column High Water Mark updates consistent #3990

Merged
merged 2 commits into from
Dec 19, 2024

Conversation

c27kwan
Copy link
Contributor

@c27kwan c27kwan commented Dec 19, 2024

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Currently:

  • When we do a MERGE, we will always call setTrackHighWaterMarks on the transaction. This will have an effect if there is an INSERT clause in the MERGE.
  • If we setTrackHighWaterMarks, we collect the max/min of the column using DeltaIdentityColumnStatsTracker. This stats tracker is only invoked on files that are written/rewritten. These min/max values are compared with the existing high watermark. If the high watermark doesn't exist, we will keep as high watermark the largest of the max or the lowest of the min without checking against the starting value of the identity column.
  • If an identity column did not generate a value yet, the high watermark is None and isn't stored in the table. This is true for GENERATED ALWAYS AS IDENTITY tables when it is empty and true for GENERATED BY DEFAULT AS IDENTITY tables when it only has user inserted values for the identity column.
  • If you run a MERGE UPSERT that only ends up updating values in a GENERATED BY DEFAULT table that doesn't have a high watermark yet, we will write a new high watermark that is the highest for the updated file, which may be lower than the starting value specified for the identity column.

Proposal:

  • This PR makes all high water mark updates go through the same validation function by default. It will not update the high watermark if it violates the start or the existing high watermark. Exception is if the table already has a corrupted high water mark.
  • This does NOT prevent the scenario where we automatically set the high watermark for a generated by default column based on user inserted values when it does respect the start.
  • Previously, we did not do high water mark rounding on the updateSchema path. This seems erroneous as the min/max values can be user inserted. We fix that in this PR.
  • Previously, we did not validate that on SYNC identity, the result of max can be below the existing high water mark. Now, we also do check this invariant and block it by default. A SQLConf has been introduced to allow reducing the high water mark if the user wants.
  • We add logging to catch bad high water mark.

How was this patch tested?

New tests that were failing prior to this change.

Does this PR introduce any user-facing changes?

No

Copy link
Contributor

@larsk-db larsk-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@allisonport-db allisonport-db merged commit 9b15a8f into delta-io:branch-3.3 Dec 19, 2024
16 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants