-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LogSegmentBuilder
to construct LogSegment
s
#457
Add LogSegmentBuilder
to construct LogSegment
s
#457
Conversation
Log segment is going to be used by both snapshot and table changes. It makes sense to separate it into its own module
LogSegmentBuilder
to construct LogSegment
s
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #457 +/- ##
==========================================
+ Coverage 79.60% 79.97% +0.37%
==========================================
Files 57 57
Lines 12330 12585 +255
Branches 12330 12585 +255
==========================================
+ Hits 9815 10065 +250
- Misses 1987 1993 +6
+ Partials 528 527 -1 ☔ View full report in Codecov by Sentry. |
…nto snapshot_cleanup
Co-authored-by: Ryan Johnson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm with one change you should make
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few final comments - can we also add a test with a start_checkpoint
and an end_version
Oh and note that |
Co-authored-by: Zach Schuermann <[email protected]>
Co-authored-by: Zach Schuermann <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks for taking some time for this! did we add test from above for a start_checkpoint
and an end_version
check?
kernel/src/log_segment.rs
Outdated
/// 2. Commit file versions will be greater than or equal to `start_version` if specified. | ||
/// 3. If checkpoint(s) is/are present in the range, only commits with versions greater than the most | ||
/// recent checkpoint version are retained. Checkpoints can be omitted (and this rule skipped) | ||
/// whenever the `LogSegment` is created. See `LogSegmentBuilder::with_omit_checkpoint_files`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't exactly roll off the tongue... should be call it without_checkpoint_files
?
kernel/src/log_segment.rs
Outdated
/// - For a Snapshot at version `n`: Its LogSegment is made up of zero or one checkpoint, and all | ||
/// commits between the checkpoint and the end version `n`. | ||
/// - For a TableChanges between versions `a` and `b`: Its LogSegment is made up of zero | ||
/// checkpoints and all commits between versions `a` and `b` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// checkpoints and all commits between versions `a` and `b` | |
/// checkpoints and all commits between versions `a` and `b` (inclusive) |
/// checkpoint files in the log segment. | ||
pub checkpoint_files: Vec<ParsedLogPath>, | ||
/// Checkpoint files in the log segment. | ||
pub checkpoint_parts: Vec<ParsedLogPath>, | ||
} | ||
|
||
impl LogSegment { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: can we rename LogSegment::replay
to something else? I don't think the name was ever super accurate ("replay" jargon in Delta normally implies dedup logic that is missing here). Really this code is just producing a chunked iterator of log actions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good - let me take this in my follow up :)
commit_read_schema, | ||
meta_predicate.clone(), | ||
)? | ||
.read_json_files(&commit_files, commit_read_schema, meta_predicate.clone())? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I think about it, I'm surprised these read_xxx_files
methods didn't already take FileMeta
. Do we have a tracking item to consider changing that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe these take a slice of FileMeta
(location is a FileMeta
- perhaps we need a rename there?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohhhh! I worried for a moment that we were passing just file names. You're right tho, location
is a FileMeta
now in this PR.
kernel/src/log_segment.rs
Outdated
pub(crate) fn new() -> Self { | ||
LogSegmentBuilder { | ||
start_checkpoint: None, | ||
start_version: None, | ||
end_version: None, | ||
sort_commit_files_ascending: false, | ||
omit_checkpoint_files: false, | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just #[derive(Default)]
?
/// Note: Either `start_version` or `start_checkpoint` may be specified. Attempting to build a [`LogSegment`] | ||
/// with both will result in an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the builder model, is it considered better to catch errors when injected to the builder? Or to wait until the end and check everything at once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, @zachschuermann had a good point IMO:
I'd like to reconsider having some sort of bifurcation for a snapshot segment vs. a cdf segment.
Have we considered whether it might be simpler to have two builders that both produce LogSegment
but which take different sets of options? So much of the logic in the build
method is just trying to wrangle the various combinations of options, which is super error prone because it's combinatorial.
Split logic analysis
For a snapshot log segment, we have the following:
- input parameters, both optional:
- checkpoint hint (None means list from zero)
- end version (Some means time travel)
- Note: Some checkpoint hint above Some end version is very possible and should cause it to be ignored
- listing behavior
- prefer to use the checkpoint hint as a listing start (list from 0 otherwise)
- terminate the listing if a file above end version is found
- filter out incomplete checkpoints (multi-part checkpoints may be missing parts)
- Only keep the latest complete checkpoint
- Only keep log files with higher versions than the latest complete checkpoint (if any)
- output
- 0+ checkpoint part files, all the same version
- 0+ log files in contiguous descending version order
- If a checkpoint is present, it is the next version below the lowest log file version
For a CDF log segment we have the following:
- input parameters
- start version (required)
- end version (optional)
- Note: Some end version below the start version is an error and should fail the
build()
- listing behavior
- use the start version as listing start
- terminate the listing if a file above end version is found
- output
- end-start+1 log files in contiguous ascending version order
A nice feature of the split is that the number of possible scenarios, and possible error cases, is vastly reduced. Fewer possible wrong combos means fewer checks to implement means fewer checks that could be implemented incorrectly.
I even start to wonder if we need builders at all, vs. two log segment "constructor" methods?
impl LogSegment {
pub(crate) fn for_snapshot(
fs_client: &dyn FileSystemClient,
table_root: &Url,
checkpoint_hint: Option<_>,
time_travel_version: Option<Version>,
) -> DeltaResult<Self>;
pub(crate) fn for_cdf(
fs_client: &dyn FileSystemClient,
table_root: &Url,
start: Version,
end: Option<Version>,
) -> DeltaResult<Self>;
// private internal constructor that validates the files it is given
fn try_new(checkopint_files: Vec<_>, commit_files: Vec<_>) -> DeltaResult<Self>;
This all suggests that we DO want two independent builders, backed by some common helper methods:
- lower-bounded listing: (lower bound, Option) -> Iterator
- log segment constructor that verifies its files are gap-free
- Consumers of a log segment who need reverse-ordered files (vs. whatever we decide is the "natural" order) should probably just use Iterator::rev?
The main redundancy at that point is that both builders accept an optional upper bound version... and even they are slightly different concepts (tho the effect on listing behavior implementation is the same in practice).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @scovich this is great thanks for the thorough ideas here - do you think I can take this as a follow-up after this gets merged just to unblock @OussamaSaoudi-db's CDF work? I was thinking of something of this sort but ultimately didn't want to derail a ton an spend more time on this. Hopefully these changes will just require only minor modifications to how LogSegment
s are created so seems like I could work on this as @OussamaSaoudi-db moves on.
Regarding the actual implementation for this: I think I totally agree on everything here - but additionally, I'm wondering if we want two different kinds of LogSegment
s. We know that a LogSegment
is some continuous chunk of commits/checkpoint but we have different invariants based on the kind right? CDF LogSegment
s only ever have commits, whereas snapshot LogSegment
s have either commits/checkpoint or only commits. and ordering could/should be different between them?
We could still have a unified LogSegment
with sorted_commit_files
, checkpoint_parts
, and commit_file_sort_order
which I think would capture both, but want to bring up the possibility of an enum LogSegment
which has LogSegment::CDF
and LogSegment::Snapshot
? but maybe that's getting overly specific/specialized and we want to keep it more general? curious to hear your thoughts :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue here: #476
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh and yea I agree reversing the vec on consumer side is probably better - just read that part
kernel/src/log_segment.rs
Outdated
pub(crate) fn with_start_checkpoint(mut self, start_checkpoint: CheckpointMetadata) -> Self { | ||
self.start_checkpoint = Some(start_checkpoint); | ||
self | ||
} | ||
/// Optionally provide a checkpoint metadata to start the log segment. See [`LogSegmentBuilder::with_start_checkpoint`] | ||
/// for details. If `start_checkpoint` is `None`, this is a no-op. | ||
/// | ||
/// Note: Either `start_version` or `start_checkpoint` may be specified. Attempting to build a [`LogSegment`] | ||
/// with both will result in an error. | ||
pub(crate) fn with_start_checkpoint_opt( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should have one "right way" of specifying options and stick to it, rather than providing multiple versions of the same method. Two ideas gleaned from other similar situations:
- One method that takes
Option<Foo>
-- it's easier for unconditional callers to wrap the arg inSome
than it is for conditional callers to break the builder chain withif let Some(...)
control flow. - One method that takes
impl Into<Option<Foo>>
and callers can choose what to pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the impl Into<Option<Foo>>
approach!
/// The commit files are guaranteed to be sorted in ascending order by version. The elements of | ||
/// `checkpoint_parts` are all the parts of the same checkpoint. Checkpoint parts share the same | ||
/// version. | ||
fn list_log_files_with_version( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method was just moved (unchanged) from snapshot.rs?
Same for list_log_files_with_checkpoint
below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list_log_files_with_checkpoint
and list_log_files_with_version
actually had much of the same code twice. I've merged the two so that you list and process files in one place.
Co-authored-by: Zach Schuermann <[email protected]>
…kernel-rs into list_log_cleanup
pub(crate) fn new() -> Self { | ||
LogSegmentBuilder { | ||
start_checkpoint: None, | ||
start_version: None, | ||
end_version: None, | ||
sort_commit_files_ascending: false, | ||
omit_checkpoint_files: false, | ||
} | ||
Self::default() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need a new
method at all? I get mixed messages from other rust code, e.g. Vec::new
is the same as Vec::default
(same for HashMap
).
/// Note: Either `start_version` or `start_checkpoint` may be specified. Attempting to build a [`LogSegment`] | ||
/// with both will result in an error. | ||
pub(crate) fn with_start_checkpoint_opt( | ||
pub(crate) fn with_start_checkpoint( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit?
pub(crate) fn with_start_checkpoint( | |
pub(crate) fn with_checkpoint_hint( |
…hanges (#495) <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md 2. Run `cargo t --all-features --all-targets` to get started testing, and run `cargo fmt`. 3. Ensure you have added or run the appropriate tests for your PR. 4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP] Your PR title ...'. 5. Be sure to keep the PR description updated to reflect all changes. --> ## What changes are proposed in this pull request? <!-- Please clarify what changes you are proposing and why the changes are needed. The purpose of this section is to outline the changes, why they are needed, and how this PR fixes the issue. If the reason for the change is already explained clearly in an issue, then it does not need to be restated here. 1. If you propose a new API or feature, clarify the use case for a new API or feature. 2. If you fix a bug, you can clarify why it is a bug. --> This introduces two methods to construct `LogSegment`. The first is constructing LogSegment for Snapshots using `LogSegment::for_snapshot`. The second constructs LogSegment for the upcoming `TableChanges` type. This PR also refactors log listing functions to reduce duplication in the code. We do so by creating a function `get_parsed_log_files_iter` to list, filter, and parse log files. This adds a test function to `test-utils` called `delta_path_for_multipart_checkpoint`. This function can be used to create a multipart checkpoint path. This replaces the changes proposed in #457 <!-- Uncomment this section if there are any changes affecting public APIs: ### This PR affects the following public APIs If there are breaking changes, please ensure the `breaking-changes` label gets added by CI, and describe why the changes are needed. Note that _new_ public APIs are not considered breaking. --> ## How was this change tested? <!-- Please make sure to add test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested, ideally via a reproducible test documented in the PR description. --> This change introduces tests for the following: - reading log with out of date checkpoint hint - reading log with up to date checkpoint hint - creating snapshot log segment without a checkpoint hint - Creating snapshot with a multi-part checkpoint - Multipart checkpoint with incorrect number of parts fails. - creating snapshot with a start checkpoint and an end time travel version - Creating a snapshot with a checkpoint hint higher than the time travel version - Creating log segments for table changes - Checking contiguity of the log is always preserved. - Checking that `for_table_changes` fails when the start version > end_version This PR also adds an ignored test that checks for desired behaviour. The test `build_snapshot_with_missing_checkpoint_part_no_hint` checks that an incomplete checkpoint is not used in a LogSegment. A checkpoint is incomplete if it does not have all the parts specified in `LogPathFileType::MultiPartCheckpoint.num_parts`. --------- Co-authored-by: Ryan Johnson <[email protected]> Co-authored-by: Zach Schuermann <[email protected]>
done in #495 |
What changes are proposed in this pull request?
This pull request builds on #438 by creating a
LogSegmentBuilder
. This moves all logic for building aLogSegment
out ofSnapshot
. This change is made in anticipation ofTableChanges
, which will represent CDFs and must construct its ownLogSegment
.The builder allows you to specify the following:
LogSegment
.How was this change tested?
New tests are added to check the following:
with_omit_checkpoint_files
is specified.