Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: handle unrecognized file types in Delta log #575

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

cg-cognition
Copy link

What changes are proposed in this pull request?

This PR fixes #496 .

Specifically, we made the following changes:

Modified list_log_files to properly handle unrecognized file types in the Delta log by filtering out unknown patterns while preserving valid commit and checkpoint files. This aligns with the Delta specification requirement to ignore unknown file types.

Changes:

Added explicit filtering for commit and checkpoint files
Improved error handling to silently skip unrecognized patterns
Added comprehensive tests for unusual file patterns including:
Hex-based commit files
Bogus checkpoint numbering
V2 checkpoints
Compacted log files
CRC files

How was this change tested?

Ran cargo test --all-features --all-targets -- --skip read_table_version_hdfs.

I had some issues getting read_table_version_hdfs to work due to a Java dependency and noticed there are a few places where we skipped it. It also seemed unrelated to this change.

Additional Context

This PR was entirely written by Devin with a little bit of review from me. Happy to address any feedback and get this over the finish line.

Original Run: https://preview.devin.ai/sessions/032b39d6100a4277a7441f0a1eed085c

- Modified list_log_files to properly handle unrecognized file types
- Added test coverage for unusual file patterns:
  - Hex decimal commit files
  - Bogus part numbering in checkpoints
  - V2 checkpoint files with UUIDs
  - Compacted log files
  - CRC files

The changes ensure that unrecognized files are silently ignored
while maintaining proper processing of valid log files.
- Changed assertions to verify that invalid patterns are filtered out
- Verify exactly one valid commit file remains
- Remove incorrect unknown file type check
- Add filter to only keep commits and checkpoints
- Remove outdated TODO comment about .crc files
- Ensures unrecognized file patterns are properly filtered out
- Add explicit filter to only keep commits and checkpoints
- Ensures unrecognized file patterns are properly filtered out
- Maintains existing error handling for other types of errors
Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

.filter(|path_res| {
path_res
.as_ref()
.map_or(true, |path| path.is_commit() || path.is_checkpoint())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This keeps commit/checkpoint (map) and errors (or), right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that's right

assert!(result.is_ok(), "list_log_files should not fail");

// Collect the results and verify
let paths: Vec<_> = result.unwrap().collect::<Result<Vec<_>, _>>().unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could also use itertools::Itertools and then

    let paths: Vec<_> = result.unwrap().try_collect().unwrap();

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch! updated

Copy link

codecov bot commented Dec 9, 2024

Codecov Report

Attention: Patch coverage is 91.89189% with 3 lines in your changes missing coverage. Please review.

Project coverage is 82.86%. Comparing base (5816620) to head (2066988).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/log_segment/tests.rs 90.32% 0 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #575      +/-   ##
==========================================
+ Coverage   82.84%   82.86%   +0.02%     
==========================================
  Files          73       73              
  Lines       16426    16461      +35     
  Branches    16426    16461      +35     
==========================================
+ Hits        13608    13641      +33     
  Misses       2174     2174              
- Partials      644      646       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -289,9 +289,13 @@ fn list_log_files(

Ok(fs_client
.list_from(&start_from)?
.map(|meta| ParsedLogPath::try_from(meta?))
// TODO this filters out .crc files etc which start with "." - how do we want to use these kind of files?
.map(|meta| meta.map(|m| ParsedLogPath::try_from(m).ok().and_then(identity)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit: .and_then(identity) is flattening a nested option right? Could use .flatten() in that case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch, i actually didn't know about this -- will update

@cg-cognition
Copy link
Author

@OussamaSaoudi-db can you review this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failed parsing of log files should be ignored
3 participants