Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add
ParquetAccessPlan
, unify RowGroup selection and PagePruning selection #10738Add
ParquetAccessPlan
, unify RowGroup selection and PagePruning selection #10738Changes from 2 commits
3bd9b04
aed543e
6019e07
52f4d39
8d44ed2
a76f95a
8fd9983
e2dec61
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm. I don't think this is correct. Unless all the RowGroupAccess is
Skip
we can simply return none here. Otherwise, we should still build RowSelection for theScan
Access.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good question.
I think the reasoning here is that actually correct because any RowGroupAccess that is
Skip
is filtered by skipping the row groups itself (and thus there are no rows to select).I will improve the documentation and add some tests that show how this works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about all the RowGroupAccess is Scan then(which means all the row group should be selected)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same reasoning applies there (that the entire rowgroup is scanned)
The intuition is that entire row groups are filtered out using
Skip
andScan
-- an overall RowSelection is only useful if there is any parts within a row group which can be filtered out.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to clarify with comments and some test updates in 8d44ed2 -- let me know what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I think the example and the following code in L214 - L218 is quite misleading.
BTW, I did follow the code in /path/to/arrow-rs/src/index.crates.io-6f17d22bba15001f/parquet-51.0.0/src/arrow/async_reader/mod.rs:601's poll_next method, namely the following code:
I'm not sure I understand the code correctly, but it looks to me that the
row_selection
only consider one row group per parquet file? Otherwise, I believe the row_count(or the split_off point) should be the row count has been accumulated with all the previous row groups?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am somewhat confused too -- I will look into this more carefully
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that is correct
I have updated the example in a76f95a to more accurately reflect what is going on, which I think makes the behavior clearer
I will also make a PR to the arrow repo trying to clarify with an example as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apache/arrow-rs#5850 to further improve the parquet crate docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW this so subtle it turns out my unit tests don't get the edge cases correct. I added error checking and more tests as part of #10813
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added these tests because I found the existing tests don't cover the case where there is a mix of scan some entire row groups and only scan the rows of another (happens when the statistics can't be extracted or there is some error evaluating the page pruning predicate only on some row groups)