-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support (order by / sort) for DataFrameWriteOptions #13874
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -74,8 +74,16 @@ impl DataFrame { | |
|
||
let file_type = format_as_file_type(format); | ||
|
||
let plan = if options.sort_by.is_empty() { | ||
self.plan | ||
} else { | ||
LogicalPlanBuilder::from(self.plan) | ||
.sort(options.sort_by)? | ||
.build()? | ||
}; | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shouldn't There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think copy_to inserts the data from whatever the input plan is (if there is a redundant sort the optimizer will remove it). Is that what you are asking? It is an interesting question of how to communicate "this table is sorted" information for newly written files. Is this what you are talking about? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes I mean that after this PR we seem we don't communicate "we are sorted" yet, which is I think would support usecases such as in the issue from @TheBuilderJR (write data sorted on a column, making following read query with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you @alamb @Dandandan for review, i am also interested how we can communicate "this table is sorted" information for newly written files. I will also investigate this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the ordering information is normally handled by a higher level "catalog" rather than the parquet format itself. This is the only thing I know of in Parquet, but I don't think it can describe the ordering Maybe this is something that https://iceberg.apache.org/ could represent 🤔 It is also conceivable that DataFusion itself could write custom metadata in paquet and other formats that support that custom metadata with the ordering, but that seems like we would just be reinventing Iceberg and similar table formats |
||
let plan = LogicalPlanBuilder::copy_to( | ||
self.plan, | ||
plan, | ||
path.into(), | ||
file_type, | ||
Default::default(), | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a good test 👍