-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: handle unrecognized file types in Delta log #575
base: main
Are you sure you want to change the base?
fix: handle unrecognized file types in Delta log #575
Conversation
- Modified list_log_files to properly handle unrecognized file types - Added test coverage for unusual file patterns: - Hex decimal commit files - Bogus part numbering in checkpoints - V2 checkpoint files with UUIDs - Compacted log files - CRC files The changes ensure that unrecognized files are silently ignored while maintaining proper processing of valid log files.
- Changed assertions to verify that invalid patterns are filtered out - Verify exactly one valid commit file remains - Remove incorrect unknown file type check
- Add filter to only keep commits and checkpoints - Remove outdated TODO comment about .crc files - Ensures unrecognized file patterns are properly filtered out
- Add explicit filter to only keep commits and checkpoints - Ensures unrecognized file patterns are properly filtered out - Maintains existing error handling for other types of errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks
.filter(|path_res| { | ||
path_res | ||
.as_ref() | ||
.map_or(true, |path| path.is_commit() || path.is_checkpoint()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This keeps commit/checkpoint (map
) and errors (or
), right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that's right
kernel/src/log_segment/tests.rs
Outdated
assert!(result.is_ok(), "list_log_files should not fail"); | ||
|
||
// Collect the results and verify | ||
let paths: Vec<_> = result.unwrap().collect::<Result<Vec<_>, _>>().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could also use itertools::Itertools
and then
let paths: Vec<_> = result.unwrap().try_collect().unwrap();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch! updated
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #575 +/- ##
==========================================
+ Coverage 82.84% 82.86% +0.02%
==========================================
Files 73 73
Lines 16426 16461 +35
Branches 16426 16461 +35
==========================================
+ Hits 13608 13641 +33
Misses 2174 2174
- Partials 644 646 +2 ☔ View full report in Codecov by Sentry. |
kernel/src/log_segment.rs
Outdated
@@ -289,9 +289,13 @@ fn list_log_files( | |||
|
|||
Ok(fs_client | |||
.list_from(&start_from)? | |||
.map(|meta| ParsedLogPath::try_from(meta?)) | |||
// TODO this filters out .crc files etc which start with "." - how do we want to use these kind of files? | |||
.map(|meta| meta.map(|m| ParsedLogPath::try_from(m).ok().and_then(identity))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit: .and_then(identity)
is flattening a nested option right? Could use .flatten()
in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch, i actually didn't know about this -- will update
@OussamaSaoudi-db can you review this? |
What changes are proposed in this pull request?
This PR fixes #496 .
Specifically, we made the following changes:
Modified list_log_files to properly handle unrecognized file types in the Delta log by filtering out unknown patterns while preserving valid commit and checkpoint files. This aligns with the Delta specification requirement to ignore unknown file types.
Changes:
Added explicit filtering for commit and checkpoint files
Improved error handling to silently skip unrecognized patterns
Added comprehensive tests for unusual file patterns including:
Hex-based commit files
Bogus checkpoint numbering
V2 checkpoints
Compacted log files
CRC files
How was this change tested?
Ran
cargo test --all-features --all-targets -- --skip read_table_version_hdfs
.I had some issues getting
read_table_version_hdfs
to work due to a Java dependency and noticed there are a few places where we skipped it. It also seemed unrelated to this change.Additional Context
This PR was entirely written by Devin with a little bit of review from me. Happy to address any feedback and get this over the finish line.
Original Run: https://preview.devin.ai/sessions/032b39d6100a4277a7441f0a1eed085c