Add `ParquetAccessPlan`, unify RowGroup selection and PagePruning selection #10738

alamb · 2024-05-31T12:03:00Z

Which issue does this PR close?

This is part of #9929, a way to provide row selections from an outside index to the parquet reader

Rationale for this change

My highlevel plan, which you can see in action in #10701 is that the ParquetAccessPlan is the structure that users pass to the parquet reader to select row groups and pages.

The idea is that a user can provide a starting ParquetAccessPlan which can be then further refined by the
parquet reader by reading file metadata, if needed.

What changes are included in this PR?

Split out the representation of which RowGroups to scan from the code to prune them (RowGroupBuilder)
Update the page pruning to update ParquetAccessPlan
Copious documentation

Are these changes tested?

Covered by existing tests

Are there any user-facing changes?

Not yet -- this is all internal. I will make a follow on PR with a proposed API for passing this structure in

…s to read

alamb · 2024-06-01T10:59:54Z

datafusion/core/src/datasource/physical_plan/parquet/access_plan.rs

+    use std::sync::{Arc, OnceLock};
+
+    #[test]
+    fn test_overall_row_selection_only_scans() {


I added these tests because I found the existing tests don't cover the case where there is a mix of scan some entire row groups and only scan the rows of another (happens when the statistics can't be extracted or there is some error evaluating the page pruning predicate only on some row groups)

alamb · 2024-06-01T11:02:28Z

datafusion/core/src/datasource/physical_plan/parquet/mod.rs

@@ -152,8 +154,9 @@ pub use writer::plan_to_parquet;
 /// the file.
 ///
 /// * Step 3: The `ParquetOpener` gets the [`ParquetMetaData`] (file metadata)
-/// via [`ParquetFileReaderFactory`] and applies any predicates and projections
-/// to determine what pages must be read.
+/// via [`ParquetFileReaderFactory`], creating a [`ParquetAccessPlan`] by


The main change / reason for this PR is to encapsulate the decision of what row groups / what row selection to scan into a single struct (in this case ParquetAccessPlan) so that that the information can be passed in from the outside as well

Aka ParquetAccessPlan will be the struct passed in the API described in #9929

alamb · 2024-06-01T11:03:18Z

datafusion/core/src/datasource/physical_plan/parquet/opener.rs

                }
            }

+            let row_group_indexes = access_plan.row_group_indexes();


while the rationale is to pass in scan restrictions from outside, I actually think encapsulating the logic for scan / skip into ParquetAccessPlan has made the existing logic easier to follow as well

alamb · 2024-06-01T11:04:48Z

datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs

        }

        let page_index_predicates = &self.predicates;
        let groups = file_metadata.row_groups();

        if groups.is_empty() {
-            return Ok(None);
+            return access_plan;


I reworked this logic in two ways:

The loop is now for each row group (rather than for each predicate)

The row selection is incrementally built up for each row group rather than being created all at the end.

I think this logic is clearer and also is no less performant (the same work is still done)

alamb · 2024-06-01T11:05:30Z

datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs

-///
-/// The final selection is the intersection of these  `RowSelector`s:
-/// * `final_selection:[ Skip(0~199), Read(200~249), Skip(250~299)]`
-fn combine_multi_col_selection(row_selections: Vec<Vec<RowSelector>>) -> RowSelection {


this intersection is now done in ParquetAccessPlan::scan_selection

alamb · 2024-06-01T11:06:12Z

datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs

-pub struct RowGroupSet {
-    /// `row_groups[i]` is true if the i-th row group should be scanned
-    row_groups: Vec<bool>,
+#[derive(Debug, Clone, PartialEq)]


this extracts the plan of what to scan from the logic that prunes it based on row group statistic metadat

…page

alamb · 2024-06-03T21:08:17Z

@Ted-Jiang I wonder if you have time to review this PR

cc @hengfeiyang, @thinkharderdev @crepererum @waynexia and @xinlifoobar for any comments you may have

hengfeiyang

LGTM, the ParquetAccessPlan API is very clear.

alamb · 2024-06-05T10:05:46Z

Also FYI @advancedxy

crepererum

Seems sensible to have a well-designed IR (intermediate representation) for that stuff so that both the internal code and the future API users benefit from that. 👍

advancedxy

Thanks for ping me. I left some comments in the code. The code looks much clearer over all.

advancedxy · 2024-06-05T12:54:13Z

datafusion/core/src/datasource/physical_plan/parquet/access_plan.rs

+        if !self
+            .row_groups
+            .iter()
+            .any(|rg| matches!(rg, RowGroupAccess::Selection(_)))


hmmm. I don't think this is correct. Unless all the RowGroupAccess is Skip we can simply return none here. Otherwise, we should still build RowSelection for the Scan Access.

This is a good question.

I think the reasoning here is that actually correct because any RowGroupAccess that is Skip is filtered by skipping the row groups itself (and thus there are no rows to select).

I will improve the documentation and add some tests that show how this works

What about all the RowGroupAccess is Scan then(which means all the row group should be selected)?

The same reasoning applies there (that the entire rowgroup is scanned)

The intuition is that entire row groups are filtered out using Skip and Scan -- an overall RowSelection is only useful if there is any parts within a row group which can be filtered out.

I tried to clarify with comments and some test updates in 8d44ed2 -- let me know what you think

I see, I think the example and the following code in L214 - L218 is quite misleading.

BTW, I did follow the code in /path/to/arrow-rs/src/index.crates.io-6f17d22bba15001f/parquet-51.0.0/src/arrow/async_reader/mod.rs:601's poll_next method, namely the following code:

let row_count = self.metadata.row_group(row_group_idx).num_rows() as usize; let selection = self.selection.as_mut().map(|s| s.split_off(row_count)); let fut = reader .read_row_group( row_group_idx, selection, self.projection.clone(), self.batch_size, ) .boxed();

I'm not sure I understand the code correctly, but it looks to me that the row_selection only consider one row group per parquet file? Otherwise, I believe the row_count(or the split_off point) should be the row count has been accumulated with all the previous row groups?

I am somewhat confused too -- I will look into this more carefully

I'm not sure I understand the code correctly, but it looks to me that the row_selection only consider one row group per parquet file? Otherwise, I believe the row_count(or the split_off point) should be the row count has been accumulated with all the previous row groups?

Yes, I think that is correct

I have updated the example in a76f95a to more accurately reflect what is going on, which I think makes the behavior clearer

I will also make a PR to the arrow repo trying to clarify with an example as well

apache/arrow-rs#5850 to further improve the parquet crate docs

BTW this so subtle it turns out my unit tests don't get the edge cases correct. I added error checking and more tests as part of #10813

datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs

datafusion/core/src/datasource/physical_plan/parquet/access_plan.rs

…page

advancedxy

lgtm now, except one minor comment.

advancedxy · 2024-06-05T16:30:29Z

datafusion/core/src/datasource/physical_plan/parquet/access_plan.rs

+    ///
+    /// If there are no [`RowGroupAccess::Selection`]s, the overall row
+    /// selection is `None` because each row group is either entirely skipped or
+    /// scanned, as specified by [`Self::row_group_indexes`].


Nit: How about add some comment about the mixed RowGroupAccess.

/// If there are no [`RowGroupAccess::Selection`]s, the overall row /// selection is `None` because each row group is either entirely skipped or /// scanned, as specified by [`Self::row_group_indexes`]. If the RowGroupAccess::Scan /// is also included in the list, the entire row group selection should be populated as well.

I think that matches the comment and the following code.

Excellent idea - I added more comments and examples in a76f95a. Thank you

Ted-Jiang · 2024-06-06T02:41:55Z

Thanks for ping me, i will review this carefully later.

Ted-Jiang

LTGM 👍 this looks really well-designed and tidy, Now the logic seems more clearer ❤️

Ted-Jiang · 2024-06-06T03:35:13Z

datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs

+) -> Option<RowSelection> {
+    match current_selection {
+        None => Some(row_selection),
+        Some(current_selection) => Some(current_selection.intersection(&row_selection)),


@alamb maybe we can check after the intersection the result row_selection is select 0 row ? 🤔 then we can cast it to skip. IMO select-0 seems need do more effort in next round check.

Good idea. Thank you @Ted-Jiang I made this change in e2dec61

…ection

…page

alamb · 2024-06-06T12:40:22Z

Thank you everyone for the feedback and reviews. I will now merge this PR and will address any additional feedback as follow on PRs

adriangb · 2024-06-12T19:43:24Z

@alamb is there any documentation on what it means for DataFusion to "scan" specific rows within a row group? Does it actually read only those rows? I'd imagine that because of some mix of compression and limitations of byte range fetches to contiguous bytes for object stores you end up streaming entire row groups anyway.

Edit: looks like there's info in https://github.com/apache/parquet-format/blob/master/PageIndex.md, and guess who's committed there 😆. I still don't fully understand the implications but this is a good start and reading homework for me.

alamb · 2024-06-12T22:24:28Z

@alamb is there any documentation on what it means for DataFusion to "scan" specific rows within a row group? Does it actually read only those rows? I'd imagine that because of some mix of compression and limitations of byte range fetches to contiguous bytes for object stores you end up streaming entire row groups anyway.

Specifically, DataFusion uses this API: https://github.com/apache/arrow-rs/blob/0cc14168000e1e41fc5f63929d34d13dda6e5873/parquet/src/arrow/arrow_reader/mod.rs#L137-L194

Which if you have the PageIndex (which is written by default in the parquet rs writer) the reader may be able to skip certain pages

thinkharderdev · 2024-06-13T12:56:22Z

@alamb is there any documentation on what it means for DataFusion to "scan" specific rows within a row group? Does it actually read only those rows? I'd imagine that because of some mix of compression and limitations of byte range fetches to contiguous bytes for object stores you end up streaming entire row groups anyway.

Specifically, DataFusion uses this API: https://github.com/apache/arrow-rs/blob/0cc14168000e1e41fc5f63929d34d13dda6e5873/parquet/src/arrow/arrow_reader/mod.rs#L137-L194

Which if you have the PageIndex (which is written by default in the parquet rs writer) the reader may be able to skip certain pages

Yeah so conceptually how it works is that once we have a RowSelection we can

If there is a PageIndex, we can compare the RowSelection to the PageIndex and fetch only the data pages which contain selected rows (and hence prune IO)
While decoding the data pages that were fetched we can skip decoding of rows that were not selected. Depending on the exact datatype this can be more or less useful. For something that is delta encoded, you can't really skip decoding within mini-blocks so it probably doesn't make a huge difference, but with a fixed-size datatype you can skip over an arbitrary number of rows by just jumping directly to the next selected row and potentially save a bunch of CPU cycles.

…ection (apache#10738) * Add `ParquetAccessPlan` that describes which part of the parquet files to read * Rename to RowGroupAccessPlanFilter * Clarify when overall selection is needed * Update documentation to exlain the relationship between scan/skip/selection * Break early of the row selection is empty

alamb marked this pull request as draft May 31, 2024 12:03

github-actions bot added the core Core DataFusion crate label May 31, 2024

alamb force-pushed the alamb/unified_rg_and_page branch from f301a73 to 301f27d Compare May 31, 2024 13:19

alamb changed the title ~~Add ParquetAccessPlan to describe which row groups and which rows should be read~~ Add ParquetAccessPlan, unify RowGroup selection and PagePruning selection May 31, 2024

alamb force-pushed the alamb/unified_rg_and_page branch 2 times, most recently from 47ed64a to d638b8f Compare May 31, 2024 17:06

alamb mentioned this pull request May 31, 2024

Minor: Improve ArrowReaderBuilder::with_row_selection docs apache/arrow-rs#5824

Merged

Add ParquetAccessPlan that describes which part of the parquet file…

3bd9b04

…s to read

alamb force-pushed the alamb/unified_rg_and_page branch from 5b1b5eb to 3bd9b04 Compare June 1, 2024 10:58

alamb commented Jun 1, 2024

View reviewed changes

alamb marked this pull request as ready for review June 1, 2024 19:37

alamb mentioned this pull request Jun 3, 2024

API in ParquetExec to pass in RowSelections to ParquetExec (enable custom indexes, finer grained pushdown) #9929

Closed

Merge remote-tracking branch 'apache/main' into alamb/unified_rg_and_…

aed543e

…page

alamb mentioned this pull request Jun 3, 2024

Add advanced_parquet_index.rs example of index in into parquet files #10701

Merged

hengfeiyang approved these changes Jun 4, 2024

View reviewed changes

thinkharderdev approved these changes Jun 5, 2024

View reviewed changes

crepererum approved these changes Jun 5, 2024

View reviewed changes

advancedxy reviewed Jun 5, 2024

View reviewed changes

alamb added 3 commits June 5, 2024 11:49

Merge remote-tracking branch 'apache/main' into alamb/unified_rg_and_…

6019e07

…page

Rename to RowGroupAccessPlanFilter

52f4d39

Clarify when overall selection is needed

8d44ed2

advancedxy approved these changes Jun 5, 2024

View reviewed changes

Ted-Jiang approved these changes Jun 6, 2024

View reviewed changes

Ted-Jiang reviewed Jun 6, 2024

View reviewed changes

xinlifoobar approved these changes Jun 6, 2024

View reviewed changes

alamb added 2 commits June 6, 2024 07:14

Update documentation to exlain the relationship between scan/skip/sel…

a76f95a

…ection

Merge remote-tracking branch 'apache/main' into alamb/unified_rg_and_…

8fd9983

…page

alamb mentioned this pull request Jun 6, 2024

Minor: refine row selection example more apache/arrow-rs#5850

Merged

Break early of the row selection is empty

e2dec61

alamb merged commit 1a26eca into apache:main Jun 6, 2024
23 checks passed

This was referenced Jun 6, 2024

Support user defined ParquetAccessPlan in ParquetExec, validation to ParquetAccessPlan::select #10813

Merged

Make RowSelection's from_consecutive_ranges public apache/arrow-rs#5848

Merged

alamb deleted the alamb/unified_rg_and_page branch June 12, 2024 22:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `ParquetAccessPlan`, unify RowGroup selection and PagePruning selection #10738

Add `ParquetAccessPlan`, unify RowGroup selection and PagePruning selection #10738

alamb commented May 31, 2024 •

edited

Loading

alamb Jun 1, 2024

alamb Jun 1, 2024

alamb Jun 1, 2024

alamb Jun 1, 2024

alamb Jun 1, 2024

alamb Jun 1, 2024

alamb commented Jun 3, 2024

hengfeiyang left a comment •

edited

Loading

alamb commented Jun 5, 2024

crepererum left a comment

advancedxy left a comment

advancedxy Jun 5, 2024

alamb Jun 5, 2024

advancedxy Jun 5, 2024

alamb Jun 5, 2024

alamb Jun 5, 2024

advancedxy Jun 5, 2024

alamb Jun 5, 2024

alamb Jun 6, 2024

alamb Jun 6, 2024

alamb Jun 6, 2024

advancedxy left a comment

advancedxy Jun 5, 2024

alamb Jun 6, 2024

Ted-Jiang commented Jun 6, 2024

Ted-Jiang left a comment

Ted-Jiang Jun 6, 2024

alamb Jun 6, 2024

alamb commented Jun 6, 2024

adriangb commented Jun 12, 2024 •

edited

Loading

alamb commented Jun 12, 2024

thinkharderdev commented Jun 13, 2024

Add ParquetAccessPlan, unify RowGroup selection and PagePruning selection #10738

Add ParquetAccessPlan, unify RowGroup selection and PagePruning selection #10738

Conversation

alamb commented May 31, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jun 3, 2024

hengfeiyang left a comment • edited Loading

Choose a reason for hiding this comment

alamb commented Jun 5, 2024

crepererum left a comment

Choose a reason for hiding this comment

advancedxy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

advancedxy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ted-Jiang commented Jun 6, 2024

Ted-Jiang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jun 6, 2024

adriangb commented Jun 12, 2024 • edited Loading

alamb commented Jun 12, 2024

thinkharderdev commented Jun 13, 2024

Add `ParquetAccessPlan`, unify RowGroup selection and PagePruning selection #10738

Add `ParquetAccessPlan`, unify RowGroup selection and PagePruning selection #10738

alamb commented May 31, 2024 •

edited

Loading

hengfeiyang left a comment •

edited

Loading

adriangb commented Jun 12, 2024 •

edited

Loading