Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve append #46

Merged
merged 33 commits into from
Oct 31, 2024
Merged

Improve append #46

merged 33 commits into from
Oct 31, 2024

Conversation

JanKaul
Copy link
Owner

@JanKaul JanKaul commented Oct 24, 2024

This PR improves the Append Operation in that it selects only a single manifest file to append the new datafiles to. In case the number of datafiles exceeds a certain threshold, the manifest file is split into smaller manifest files.

This should improve insert performance, as well as the overall structure of the manifest_list/manifest tree.

iceberg-rust/src/table/transaction/append.rs Outdated Show resolved Hide resolved
iceberg-rust/Cargo.toml Show resolved Hide resolved
iceberg-rust/src/util/mod.rs Outdated Show resolved Hide resolved
iceberg-rust/src/util/mod.rs Outdated Show resolved Hide resolved
iceberg-rust/src/table/transaction/append.rs Outdated Show resolved Hide resolved
iceberg-rust/src/table/transaction/append.rs Outdated Show resolved Hide resolved
iceberg-rust/Cargo.toml Show resolved Hide resolved
iceberg-rust/src/table/transaction/append.rs Outdated Show resolved Hide resolved
iceberg-rust/src/table/transaction/append.rs Outdated Show resolved Hide resolved
iceberg-rust/src/table/transaction/append.rs Outdated Show resolved Hide resolved
}

/// Splits the datafiles *n_split* times to decrease the number of datafiles per maniefst. 1 split returns 2 outputs vectors, 2 splits return 4, 3 splits return 8 and so on.
pub(crate) fn split_datafiles(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, this should be unit tested as well.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating mocked data for a unit test is quite difficult here. I expanded a datafusion test to cover the manifest splitting case.

iceberg-rust-spec/src/spec/values.rs Outdated Show resolved Hide resolved
iceberg-rust/src/util/mod.rs Outdated Show resolved Hide resolved
iceberg-rust-spec/src/spec/values.rs Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operation.execute() is too big, which makes it hard to read the change. Could you refactor it?

datafusion_iceberg/src/table.rs Outdated Show resolved Hide resolved
@JanKaul JanKaul merged commit f2fff11 into main Oct 31, 2024
1 check passed
@JanKaul JanKaul deleted the improve-append branch October 31, 2024 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants