-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support in-commit timestamps in Change Data Feed #581
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #581 +/- ##
==========================================
+ Coverage 82.45% 82.50% +0.04%
==========================================
Files 72 72
Lines 16054 16102 +48
Branches 16054 16102 +48
==========================================
+ Hits 13238 13285 +47
+ Misses 2173 2172 -1
- Partials 643 645 +2 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach looks good, but we're using the wrong column name.
} else if let Some(timestamp) = getters[16].get_long(i, "commitInfo.timestamp")? { | ||
*self.timestamp = timestamp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the Delta spec:
The
commitInfo
action must be the first action in the commit... [and] must include a field namedinCommitTimestamp
, of typelong
The current CommitInfo
in kernel doesn't even define that field; the timestamp
field is just an example from the spec (that predates ICT) of the sorts of information a commit might include, and I believe it's something delta-spark was already emitting for years before ICT came along. I don't know why the ICT spec couldn't just "take over" that existing timestamp
field, nor whether two fields necessarily have the same value.
Also, either now or as a follow-up, we should probably enforce that the commit info does indeed come first? But we'd have to verify that ICT is actually enabled first, since the presence, position, and content of the commit info action is otherwise unconstrained.
Finally, the spec doesn't specifically say what we should do with an ICT that we find in a non-ICT table? Presumably we should ignore it because it's not the timestamp that time travel would use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow this is really annoying. The test tables from delta-rs seem to just pickup the timestamp field in commitInfo. Meanwhile delta-spark uses some external service to get non-ict timestamps.
The issue is that without a commitInfo-based timestamp, we can't get reliable E2E testing for timestamps. Git doesn't preserve timestamps. Perhaps for tests, I should just project the timestamp column out?
Regarding ICT enablement, this is the behaviour of delta-spark:
// If the commit has an In-Commit Timestamp, we should use that as the commit timestamp.
// Note that it is technically possible for a commit range to begin with ICT commits
// followed by non-ICT commits, and end with ICT commits again. Ideally, for these commits
// we should use the file modification time for the first two ranges. However, this
// scenario is an edge case not worth optimizing for.
val ts = commitInfo
.flatMap(_.inCommitTimestamp)
.map(ict => new Timestamp(ict))
.getOrElse(nonICTTimestampsByVersion.get(v).orNull)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so spoke to @zachschuermann and the plan is to fix the delta-rs tests by changing their field from "timestamp" to "inCommitTimestamp"
What changes are proposed in this pull request?
This PR adds support for in-commit timestamps in change data feed. We update the
PreparePhaseVisitor
to readCommitInfo
and extract the timestamp if it is present.This addresses part of #559
How was this change tested?
We check that:
CommitInfo
, the timestamp we extract must match the CommitInfo.TableChangesScanData
.