Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LogSegmentBuilder to construct LogSegments #457

Closed

Conversation

OussamaSaoudi-db
Copy link
Collaborator

@OussamaSaoudi-db OussamaSaoudi-db commented Nov 6, 2024

What changes are proposed in this pull request?

This pull request builds on #438 by creating a LogSegmentBuilder. This moves all logic for building a LogSegment out of Snapshot. This change is made in anticipation of TableChanges, which will represent CDFs and must construct its own LogSegment.

The builder allows you to specify the following:

  • A starting checkpoint
  • The order of the commit files
  • The start version
  • The end version
  • Whether checkpoint files are omitted from the LogSegment.

How was this change tested?

New tests are added to check the following:

  1. Speciying start_version and start_checkpoint results in an error
  2. Non contiguous commit files should result in an error
  3. commit file ordering is correct in ascending and descending cases
  4. Checkpoint files are correctly omitted if with_omit_checkpoint_files is specified.
  5. start and end version work.

@OussamaSaoudi-db OussamaSaoudi-db changed the title List log cleanup [WIP] Add LogSegmentBuilder to construct LogSegments Nov 6, 2024
@github-actions github-actions bot added the breaking-change Change that will require a version bump label Nov 6, 2024
Copy link

codecov bot commented Nov 6, 2024

Codecov Report

Attention: Patch coverage is 92.15686% with 36 lines in your changes missing coverage. Please review.

Project coverage is 79.97%. Comparing base (bb1be88) to head (f8457d9).

Files with missing lines Patch % Lines
kernel/src/log_segment.rs 92.47% 24 Missing and 10 partials ⚠️
kernel/src/snapshot.rs 71.42% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #457      +/-   ##
==========================================
+ Coverage   79.60%   79.97%   +0.37%     
==========================================
  Files          57       57              
  Lines       12330    12585     +255     
  Branches    12330    12585     +255     
==========================================
+ Hits         9815    10065     +250     
- Misses       1987     1993       +6     
+ Partials      528      527       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with one change you should make

kernel/src/log_segment.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few final comments - can we also add a test with a start_checkpoint and an end_version

kernel/src/log_segment.rs Show resolved Hide resolved
kernel/src/log_segment.rs Show resolved Hide resolved
kernel/src/log_segment.rs Outdated Show resolved Hide resolved
kernel/src/log_segment.rs Outdated Show resolved Hide resolved
@zachschuermann
Copy link
Collaborator

Oh and note that LogSegmentBuilder is pub(crate) with developer-visibility pub but everything in it is just pub(crate) maybe we just make it all only pub(crate) for now?

Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for taking some time for this! did we add test from above for a start_checkpoint and an end_version check?

kernel/src/log_segment.rs Outdated Show resolved Hide resolved
/// 2. Commit file versions will be greater than or equal to `start_version` if specified.
/// 3. If checkpoint(s) is/are present in the range, only commits with versions greater than the most
/// recent checkpoint version are retained. Checkpoints can be omitted (and this rule skipped)
/// whenever the `LogSegment` is created. See `LogSegmentBuilder::with_omit_checkpoint_files`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't exactly roll off the tongue... should be call it without_checkpoint_files?

/// - For a Snapshot at version `n`: Its LogSegment is made up of zero or one checkpoint, and all
/// commits between the checkpoint and the end version `n`.
/// - For a TableChanges between versions `a` and `b`: Its LogSegment is made up of zero
/// checkpoints and all commits between versions `a` and `b`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// checkpoints and all commits between versions `a` and `b`
/// checkpoints and all commits between versions `a` and `b` (inclusive)

/// checkpoint files in the log segment.
pub checkpoint_files: Vec<ParsedLogPath>,
/// Checkpoint files in the log segment.
pub checkpoint_parts: Vec<ParsedLogPath>,
}

impl LogSegment {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: can we rename LogSegment::replay to something else? I don't think the name was ever super accurate ("replay" jargon in Delta normally implies dedup logic that is missing here). Really this code is just producing a chunked iterator of log actions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good - let me take this in my follow up :)

commit_read_schema,
meta_predicate.clone(),
)?
.read_json_files(&commit_files, commit_read_schema, meta_predicate.clone())?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think about it, I'm surprised these read_xxx_files methods didn't already take FileMeta. Do we have a tracking item to consider changing that?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe these take a slice of FileMeta (location is a FileMeta - perhaps we need a rename there?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhhh! I worried for a moment that we were passing just file names. You're right tho, location is a FileMeta now in this PR.

Comment on lines 144 to 152
pub(crate) fn new() -> Self {
LogSegmentBuilder {
start_checkpoint: None,
start_version: None,
end_version: None,
sort_commit_files_ascending: false,
omit_checkpoint_files: false,
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just #[derive(Default)]?

Comment on lines +155 to +156
/// Note: Either `start_version` or `start_checkpoint` may be specified. Attempting to build a [`LogSegment`]
/// with both will result in an error.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the builder model, is it considered better to catch errors when injected to the builder? Or to wait until the end and check everything at once?

Copy link
Collaborator

@scovich scovich Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, @zachschuermann had a good point IMO:

I'd like to reconsider having some sort of bifurcation for a snapshot segment vs. a cdf segment.

Have we considered whether it might be simpler to have two builders that both produce LogSegment but which take different sets of options? So much of the logic in the build method is just trying to wrangle the various combinations of options, which is super error prone because it's combinatorial.

Split logic analysis

For a snapshot log segment, we have the following:

  • input parameters, both optional:
    • checkpoint hint (None means list from zero)
    • end version (Some means time travel)
    • Note: Some checkpoint hint above Some end version is very possible and should cause it to be ignored
  • listing behavior
    • prefer to use the checkpoint hint as a listing start (list from 0 otherwise)
    • terminate the listing if a file above end version is found
    • filter out incomplete checkpoints (multi-part checkpoints may be missing parts)
    • Only keep the latest complete checkpoint
    • Only keep log files with higher versions than the latest complete checkpoint (if any)
  • output
    • 0+ checkpoint part files, all the same version
    • 0+ log files in contiguous descending version order
    • If a checkpoint is present, it is the next version below the lowest log file version

For a CDF log segment we have the following:

  • input parameters
    • start version (required)
    • end version (optional)
    • Note: Some end version below the start version is an error and should fail the build()
  • listing behavior
    • use the start version as listing start
    • terminate the listing if a file above end version is found
  • output
    • end-start+1 log files in contiguous ascending version order

A nice feature of the split is that the number of possible scenarios, and possible error cases, is vastly reduced. Fewer possible wrong combos means fewer checks to implement means fewer checks that could be implemented incorrectly.

I even start to wonder if we need builders at all, vs. two log segment "constructor" methods?

impl LogSegment {
   pub(crate) fn for_snapshot(
        fs_client: &dyn FileSystemClient,
        table_root: &Url,
        checkpoint_hint: Option<_>,
        time_travel_version: Option<Version>,
    ) -> DeltaResult<Self>;

    pub(crate) fn for_cdf(
        fs_client: &dyn FileSystemClient,
        table_root: &Url,
        start: Version,
        end: Option<Version>,
    ) -> DeltaResult<Self>;

    // private internal constructor that validates the files it is given
    fn try_new(checkopint_files: Vec<_>, commit_files: Vec<_>) -> DeltaResult<Self>;

This all suggests that we DO want two independent builders, backed by some common helper methods:

  • lower-bounded listing: (lower bound, Option) -> Iterator
  • log segment constructor that verifies its files are gap-free
  • Consumers of a log segment who need reverse-ordered files (vs. whatever we decide is the "natural" order) should probably just use Iterator::rev?

The main redundancy at that point is that both builders accept an optional upper bound version... and even they are slightly different concepts (tho the effect on listing behavior implementation is the same in practice).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @scovich this is great thanks for the thorough ideas here - do you think I can take this as a follow-up after this gets merged just to unblock @OussamaSaoudi-db's CDF work? I was thinking of something of this sort but ultimately didn't want to derail a ton an spend more time on this. Hopefully these changes will just require only minor modifications to how LogSegments are created so seems like I could work on this as @OussamaSaoudi-db moves on.

Regarding the actual implementation for this: I think I totally agree on everything here - but additionally, I'm wondering if we want two different kinds of LogSegments. We know that a LogSegment is some continuous chunk of commits/checkpoint but we have different invariants based on the kind right? CDF LogSegments only ever have commits, whereas snapshot LogSegments have either commits/checkpoint or only commits. and ordering could/should be different between them?

We could still have a unified LogSegment with sorted_commit_files, checkpoint_parts, and commit_file_sort_order which I think would capture both, but want to bring up the possibility of an enum LogSegment which has LogSegment::CDF and LogSegment::Snapshot? but maybe that's getting overly specific/specialized and we want to keep it more general? curious to hear your thoughts :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue here: #476

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and yea I agree reversing the vec on consumer side is probably better - just read that part

Comment on lines 158 to 167
pub(crate) fn with_start_checkpoint(mut self, start_checkpoint: CheckpointMetadata) -> Self {
self.start_checkpoint = Some(start_checkpoint);
self
}
/// Optionally provide a checkpoint metadata to start the log segment. See [`LogSegmentBuilder::with_start_checkpoint`]
/// for details. If `start_checkpoint` is `None`, this is a no-op.
///
/// Note: Either `start_version` or `start_checkpoint` may be specified. Attempting to build a [`LogSegment`]
/// with both will result in an error.
pub(crate) fn with_start_checkpoint_opt(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should have one "right way" of specifying options and stick to it, rather than providing multiple versions of the same method. Two ideas gleaned from other similar situations:

  • One method that takes Option<Foo> -- it's easier for unconditional callers to wrap the arg in Some than it is for conditional callers to break the builder chain with if let Some(...) control flow.
  • One method that takes impl Into<Option<Foo>> and callers can choose what to pass

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the impl Into<Option<Foo>> approach!

/// The commit files are guaranteed to be sorted in ascending order by version. The elements of
/// `checkpoint_parts` are all the parts of the same checkpoint. Checkpoint parts share the same
/// version.
fn list_log_files_with_version(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method was just moved (unchanged) from snapshot.rs?
Same for list_log_files_with_checkpoint below?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list_log_files_with_checkpoint and list_log_files_with_version actually had much of the same code twice. I've merged the two so that you list and process files in one place.

Comment on lines 144 to 145
pub(crate) fn new() -> Self {
LogSegmentBuilder {
start_checkpoint: None,
start_version: None,
end_version: None,
sort_commit_files_ascending: false,
omit_checkpoint_files: false,
}
Self::default()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need a new method at all? I get mixed messages from other rust code, e.g. Vec::new is the same as Vec::default (same for HashMap).

/// Note: Either `start_version` or `start_checkpoint` may be specified. Attempting to build a [`LogSegment`]
/// with both will result in an error.
pub(crate) fn with_start_checkpoint_opt(
pub(crate) fn with_start_checkpoint(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit?

Suggested change
pub(crate) fn with_start_checkpoint(
pub(crate) fn with_checkpoint_hint(

OussamaSaoudi-db added a commit that referenced this pull request Nov 15, 2024
…hanges (#495)

<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md
2. Run `cargo t --all-features --all-targets` to get started testing,
and run `cargo fmt`.
  3. Ensure you have added or run the appropriate tests for your PR.
4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  5. Be sure to keep the PR description updated to reflect all changes.
-->

## What changes are proposed in this pull request?
<!--
Please clarify what changes you are proposing and why the changes are
needed.
The purpose of this section is to outline the changes, why they are
needed, and how this PR fixes the issue.
If the reason for the change is already explained clearly in an issue,
then it does not need to be restated here.
1. If you propose a new API or feature, clarify the use case for a new
API or feature.
  2. If you fix a bug, you can clarify why it is a bug.
-->
This introduces two methods to construct `LogSegment`. The first is
constructing LogSegment for Snapshots using `LogSegment::for_snapshot`.
The second constructs LogSegment for the upcoming `TableChanges` type.

This PR also refactors log listing functions to reduce duplication in
the code. We do so by creating a function `get_parsed_log_files_iter` to
list, filter, and parse log files.

This adds a test function to `test-utils` called
`delta_path_for_multipart_checkpoint`. This function can be used to
create a multipart checkpoint path.

This replaces the changes proposed in #457
<!--
Uncomment this section if there are any changes affecting public APIs:
### This PR affects the following public APIs

If there are breaking changes, please ensure the `breaking-changes`
label gets added by CI, and describe why the changes are needed.

Note that _new_ public APIs are not considered breaking.
-->


## How was this change tested?
<!--
Please make sure to add test cases that check the changes thoroughly
including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please
clarify how you tested, ideally via a reproducible test documented in
the PR description.
-->
This change introduces tests for the following:
- reading log with out of date checkpoint hint
- reading log with up to date checkpoint hint
- creating snapshot log segment without a checkpoint hint
- Creating snapshot with a multi-part checkpoint
- Multipart checkpoint with incorrect number of parts fails.
- creating snapshot with a start checkpoint and an end time travel
version
- Creating a snapshot with a checkpoint hint higher than the time travel
version
- Creating log segments for table changes
- Checking contiguity of the log is always preserved.
- Checking that `for_table_changes` fails when the start version >
end_version

This PR also adds an ignored test that checks for desired behaviour. The
test `build_snapshot_with_missing_checkpoint_part_no_hint` checks that
an incomplete checkpoint is not used in a LogSegment. A checkpoint is
incomplete if it does not have all the parts specified in
`LogPathFileType::MultiPartCheckpoint.num_parts`.

---------

Co-authored-by: Ryan Johnson <[email protected]>
Co-authored-by: Zach Schuermann <[email protected]>
@zachschuermann
Copy link
Collaborator

done in #495

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that will require a version bump
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants