Skip to content

create PageIndexPolicy to allow optional indexes #8071

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

kczimm
Copy link

@kczimm kczimm commented Aug 6, 2025

Which issue does this PR close?

Rationale for this change

This change introduces a more flexible way to handle page indexes (column and offset indexes) in Parquet files. Previously, the reading of these indexes was controlled by boolean flags, which indicated read required or do not read. The new PageIndexPolicy enum (Off, Optional, Required) provides finer control, allowing users to specify whether an index is not read, read if present (without error if missing), or strictly required (error if missing).

What changes are included in this PR?

  • Introduced a new PageIndexPolicy enum with Off, Optional, and Required variants.
  • Replaced the boolean column_index and offset_index fields in ParquetMetaDataReader with the new PageIndexPolicy enum.
  • Updated the ParquetMetaDataReader::new() function to initialize page index policies to Off, preserving previous defaults.
  • Modified existing with_page_indexes, with_column_indexes, and with_offset_indexes methods to utilize the new PageIndexPolicy, defaulting to Required when enabling indexes.
  • Added new methods: with_page_index_policy, with_column_index_policy, and with_offset_index_policy to allow direct setting of the page index policy.
  • Adjusted the internal logic for parsing column and offset indexes to respect the specified PageIndexPolicy, including returning an error if a Required index is not found.

Are these changes tested?

Yes, a new test file parquet/tests/page_index.rs has been added to cover the functionality of the new PageIndexPolicy and its integration with ParquetMetaDataReader.

Are there any user-facing changes?

Yes, there are user-facing changes to the ParquetMetaDataReader API. The with_column_indexes and with_offset_indexes methods now implicitly use PageIndexPolicy::Required when enabling page indexes. New methods with_page_index_policy, with_column_index_policy, and with_offset_index_policy have been added.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 6, 2025
if self.offset_index == PageIndexPolicy::Required {
return Err(general_err!("missing offset index"));
}
Ok(OffsetIndexMetaData {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question from naivetaty. What are the implication of page_locations being empty? i.e. What behaviour is assumed if this is empty? possibly

  1. there are no associated pages
  2. we have no pre-indexed idea about which pages are associated so we must calculate it ourselves.

I've started looking at this, but it is convoluted.

I feel the most correct approach would be to change the ParquetOffsetIndex to have Options, i.e.

- pub type ParquetOffsetIndex = Vec<Vec<OffsetIndexMetaData>>;
+ pub type ParquetOffsetIndex = Vec<Vec<Option<OffsetIndexMetaData>>>; 

This is a bit more involved, but semantically more correct (again from my understanding).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it based on @etseidl feedback. Now if we just set the column and offset index to None and return. What do you think about this approach?

kczimm added 2 commits August 7, 2025 15:03
- Rename PageIndexPolicy::Off to PageIndexPolicy::Skip
- impl From<bool> for PageIndexPolicy for DRY
- Expose PageIndexPolicy to Arrow
@alamb
Copy link
Contributor

alamb commented Aug 7, 2025

I think this is a good idea, FWIW and a nice change. Is this PR ready for review @kczimm (it is currently marked as a draft)?

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see the desire for this, but I think some discussion is warranted to suss out what the desired behavior is for the Optional case.

Thanks for raising the issue @kczimm.

@@ -593,7 +642,15 @@ impl ParquetMetaDataReader {
col_idx,
)
}
None => Err(general_err!("missing offset index")),
None => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change here is the heart of the PR. The thing that gives me pause is simply replacing missing indexes with empty vectors doesn't let the user know that the indexes are in a potentially unusable state. Are the indexes missing for a single column chunk? An entire row group? We can't really tell without doing a validation step after decoding is complete.

I think if we move forward with this, I'd prefer rather than inserting invalid indexes, we instead invalidate the entire page index (i.e. set column and offset index back to None in the ParquetMetaData).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thoughtful feedback, @etseidl. I see what you mean. I pushed a commit that was an attempt to implement your desire.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kczimm. I'll try to review some time in the next few days. First glance looks good.

@kczimm kczimm marked this pull request as ready for review August 8, 2025 00:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optionally read parquet page indexes
4 participants