[Spark][3.3] Make Identity Column High Water Mark updates consistent #3990
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which Delta project/connector is this regarding?
Description
Currently:
setTrackHighWaterMarks
on the transaction. This will have an effect if there is an INSERT clause in the MERGE.setTrackHighWaterMarks
, we collect the max/min of the column usingDeltaIdentityColumnStatsTracker
. This stats tracker is only invoked on files that are written/rewritten. These min/max values are compared with the existing high watermark. If the high watermark doesn't exist, we will keep as high watermark the largest of the max or the lowest of the min without checking against the starting value of the identity column.Proposal:
updateSchema
path. This seems erroneous as the min/max values can be user inserted. We fix that in this PR.How was this patch tested?
New tests that were failing prior to this change.
Does this PR introduce any user-facing changes?
No