Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support in-commit timestamps in Change Data Feed #581

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

OussamaSaoudi-db
Copy link
Collaborator

@OussamaSaoudi-db OussamaSaoudi-db commented Dec 9, 2024

What changes are proposed in this pull request?

This PR adds support for in-commit timestamps in change data feed. We update the PreparePhaseVisitor to read CommitInfo and extract the timestamp if it is present.

This addresses part of #559

How was this change tested?

We check that:

  • For a table with CommitInfo, the timestamp we extract must match the CommitInfo.
  • The CommitInfo must be ignored when generating TableChangesScanData.

Copy link

codecov bot commented Dec 9, 2024

Codecov Report

Attention: Patch coverage is 98.11321% with 1 line in your changes missing coverage. Please review.

Project coverage is 82.50%. Comparing base (ed714c5) to head (f45f8ef).

Files with missing lines Patch % Lines
kernel/src/table_changes/log_replay.rs 87.50% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #581      +/-   ##
==========================================
+ Coverage   82.45%   82.50%   +0.04%     
==========================================
  Files          72       72              
  Lines       16054    16102      +48     
  Branches    16054    16102      +48     
==========================================
+ Hits        13238    13285      +47     
+ Misses       2173     2172       -1     
- Partials      643      645       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach looks good, but we're using the wrong column name.

Comment on lines 363 to 364
} else if let Some(timestamp) = getters[16].get_long(i, "commitInfo.timestamp")? {
*self.timestamp = timestamp;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the Delta spec:

The commitInfo action must be the first action in the commit... [and] must include a field named inCommitTimestamp, of type long

The current CommitInfo in kernel doesn't even define that field; the timestamp field is just an example from the spec (that predates ICT) of the sorts of information a commit might include, and I believe it's something delta-spark was already emitting for years before ICT came along. I don't know why the ICT spec couldn't just "take over" that existing timestamp field, nor whether two fields necessarily have the same value.

Also, either now or as a follow-up, we should probably enforce that the commit info does indeed come first? But we'd have to verify that ICT is actually enabled first, since the presence, position, and content of the commit info action is otherwise unconstrained.

Finally, the spec doesn't specifically say what we should do with an ICT that we find in a non-ICT table? Presumably we should ignore it because it's not the timestamp that time travel would use?

Copy link
Collaborator Author

@OussamaSaoudi-db OussamaSaoudi-db Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow this is really annoying. The test tables from delta-rs seem to just pickup the timestamp field in commitInfo. Meanwhile delta-spark uses some external service to get non-ict timestamps.

The issue is that without a commitInfo-based timestamp, we can't get reliable E2E testing for timestamps. Git doesn't preserve timestamps. Perhaps for tests, I should just project the timestamp column out?

Regarding ICT enablement, this is the behaviour of delta-spark:

// If the commit has an In-Commit Timestamp, we should use that as the commit timestamp.
// Note that it is technically possible for a commit range to begin with ICT commits
// followed by non-ICT commits, and end with ICT commits again. Ideally, for these commits
// we should use the file modification time for the first two ranges. However, this
// scenario is an edge case not worth optimizing for.
val ts = commitInfo
  .flatMap(_.inCommitTimestamp)
  .map(ict => new Timestamp(ict))
  .getOrElse(nonICTTimestampsByVersion.get(v).orNull)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so spoke to @zachschuermann and the plan is to fix the delta-rs tests by changing their field from "timestamp" to "inCommitTimestamp"

@OussamaSaoudi-db OussamaSaoudi-db marked this pull request as draft December 11, 2024 00:08
@github-actions github-actions bot added the breaking-change Change that will require a version bump label Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that will require a version bump
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants