Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug with out of date last checkpoint, and clean listing function #354

Merged
merged 15 commits into from
Oct 2, 2024

Conversation

nicklan
Copy link
Collaborator

@nicklan nicklan commented Sep 25, 2024

If we have a _last_checkpoint that is out of date, things can get confused. This code:

  1. Cleans up the listing function a bit
  2. Ensures we end up with the real latest checkpoint
  3. Drops any commit files from the listing that are older than the last checkpoint
  4. warns! if _last_checkpoint is out of date
  5. Adds a test for this case

This code will conflict with #347, so maybe hold of merging until that merges and then I can rebase and clean this up more.

Copy link

codecov bot commented Sep 25, 2024

Codecov Report

Attention: Patch coverage is 84.31373% with 16 lines in your changes missing coverage. Please review.

Project coverage is 74.98%. Comparing base (5b98c85) to head (353673c).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/snapshot.rs 84.31% 13 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #354      +/-   ##
==========================================
+ Coverage   74.88%   74.98%   +0.10%     
==========================================
  Files          43       43              
  Lines        8409     8496      +87     
  Branches     8409     8496      +87     
==========================================
+ Hits         6297     6371      +74     
- Misses       1724     1737      +13     
  Partials      388      388              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

.collect::<Result<Vec<_>, Error>>()?
.into_iter()
let mut max_checkpoint_version = checkpoint_metadata.version;
let mut checkpoint_files = Vec::with_capacity(10);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, any reason why you expect the Vec to be at most size 10?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is odd, because most checkpoints will have just one file?
I wonder if the intent was to pre-size the commit_files to 10, since that's the default checkpoint interval for delta-spark?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense. I've swapped them so we reserve 10 for commits.

Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably fine, but will hold off for the rebase because it should clean up the most important logic.

.collect::<Result<Vec<_>, Error>>()?
.into_iter()
let mut max_checkpoint_version = checkpoint_metadata.version;
let mut checkpoint_files = Vec::with_capacity(10);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is odd, because most checkpoints will have just one file?
I wonder if the intent was to pre-size the commit_files to 10, since that's the default checkpoint interval for delta-spark?

Copy link
Collaborator

@OussamaSaoudi-db OussamaSaoudi-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

);
// we may need to drop some commits that are after the actual last checkpoint
commit_files.retain(|parsed_path| parsed_path.version > max_checkpoint_version);
} else if checkpoint_files.len() != checkpoint_metadata.parts.unwrap_or(1) as usize {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work?

Suggested change
} else if checkpoint_files.len() != checkpoint_metadata.parts.unwrap_or(1) as usize {
} else if checkpoint_files.len() != checkpoint_metadata.parts.unwrap_or(1usize) {

(replace usize with the suffix for whatever type parts has)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no. the issue is that checkpoint_metadata.parts was an i32, so the 1 was implicitly an i32, and trying to make it a usize doesn't work because the cast needs to happen after the unwrap.

Anyway, since the reason we have parts is to compare it against a vector len, I made it a usize in the struct, and we don't have to cast. It means serde will fail if there's a negative number in the json, but that's probably okay since that's a broken _last_checkpoint anyway.

Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm couple nits/followups

kernel/src/snapshot.rs Show resolved Hide resolved
kernel/src/snapshot.rs Show resolved Hide resolved
kernel/src/snapshot.rs Outdated Show resolved Hide resolved
kernel/src/snapshot.rs Show resolved Hide resolved
nicklan and others added 11 commits October 2, 2024 14:49
Co-authored-by: Zach Schuermann <[email protected]>
We already have constants like `DataType::LONG` that can avoid
boilerplate like `DataType::Primitive(PrimitiveType::Long)`, and such
constants can be used in match arms. Update the code to use them.
Turns out we don't use lazy static in many places so potentially
removing the dependency is straight forward. Lazy static itself even
gives an example from the standard library.


https://github.com/rust-lang-nursery/lazy-static.rs?tab=readme-ov-file#standard-library

---------

Co-authored-by: Ryan Johnson <[email protected]>
remove sync-engine from default features so there are no features
enabled by default.
@nicklan nicklan merged commit c81da02 into delta-io:main Oct 2, 2024
12 checks passed
OussamaSaoudi-db pushed a commit to OussamaSaoudi-db/delta-kernel-rs that referenced this pull request Oct 7, 2024
…elta-io#354)

If we have a `_last_checkpoint` that is out of date, things can get
confused. This code:

1. Cleans up the listing function a bit
2. Ensures we end up with the real latest checkpoint
3. Drops any commit files from the listing that are older than the last
checkpoint
4. `warns!` if ` _last_checkpoint` is out of date
5. Adds a test for this case

This code will conflict with delta-io#347, so maybe hold of merging until that
merges and then I can rebase and clean this up more.

---------

Co-authored-by: Nick Lanham <[email protected]>
Co-authored-by: Zach Schuermann <[email protected]>
Co-authored-by: Ryan Johnson <[email protected]>
Co-authored-by: Stephen Carman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants