-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve append #46
Improve append #46
Conversation
} | ||
|
||
/// Splits the datafiles *n_split* times to decrease the number of datafiles per maniefst. 1 split returns 2 outputs vectors, 2 splits return 4, 3 splits return 8 and so on. | ||
pub(crate) fn split_datafiles( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, this should be unit tested as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating mocked data for a unit test is quite difficult here. I expanded a datafusion test to cover the manifest splitting case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Operation.execute()
is too big, which makes it hard to read the change. Could you refactor it?
This PR improves the Append Operation in that it selects only a single manifest file to append the new datafiles to. In case the number of datafiles exceeds a certain threshold, the manifest file is split into smaller manifest files.
This should improve insert performance, as well as the overall structure of the manifest_list/manifest tree.