-
Notifications
You must be signed in to change notification settings - Fork 989
create PageIndexPolicy to allow optional indexes #8071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
parquet/src/file/metadata/reader.rs
Outdated
if self.offset_index == PageIndexPolicy::Required { | ||
return Err(general_err!("missing offset index")); | ||
} | ||
Ok(OffsetIndexMetaData { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A question from naivetaty. What are the implication of page_locations
being empty? i.e. What behaviour is assumed if this is empty? possibly
- there are no associated pages
- we have no pre-indexed idea about which pages are associated so we must calculate it ourselves.
I've started looking at this, but it is convoluted.
I feel the most correct approach would be to change the ParquetOffsetIndex
to have Options, i.e.
- pub type ParquetOffsetIndex = Vec<Vec<OffsetIndexMetaData>>;
+ pub type ParquetOffsetIndex = Vec<Vec<Option<OffsetIndexMetaData>>>;
This is a bit more involved, but semantically more correct (again from my understanding).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed it based on @etseidl feedback. Now if we just set the column and offset index to None and return. What do you think about this approach?
- Rename PageIndexPolicy::Off to PageIndexPolicy::Skip - impl From<bool> for PageIndexPolicy for DRY - Expose PageIndexPolicy to Arrow
I think this is a good idea, FWIW and a nice change. Is this PR ready for review @kczimm (it is currently marked as a draft)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see the desire for this, but I think some discussion is warranted to suss out what the desired behavior is for the Optional
case.
Thanks for raising the issue @kczimm.
parquet/src/file/metadata/reader.rs
Outdated
@@ -593,7 +642,15 @@ impl ParquetMetaDataReader { | |||
col_idx, | |||
) | |||
} | |||
None => Err(general_err!("missing offset index")), | |||
None => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change here is the heart of the PR. The thing that gives me pause is simply replacing missing indexes with empty vectors doesn't let the user know that the indexes are in a potentially unusable state. Are the indexes missing for a single column chunk? An entire row group? We can't really tell without doing a validation step after decoding is complete.
I think if we move forward with this, I'd prefer rather than inserting invalid indexes, we instead invalidate the entire page index (i.e. set column and offset index back to None
in the ParquetMetaData
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the thoughtful feedback, @etseidl. I see what you mean. I pushed a commit that was an attempt to implement your desire.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kczimm. I'll try to review some time in the next few days. First glance looks good.
Which issue does this PR close?
Rationale for this change
This change introduces a more flexible way to handle page indexes (column and offset indexes) in Parquet files. Previously, the reading of these indexes was controlled by boolean flags, which indicated read required or do not read. The new
PageIndexPolicy
enum (Off
,Optional
,Required
) provides finer control, allowing users to specify whether an index is not read, read if present (without error if missing), or strictly required (error if missing).What changes are included in this PR?
PageIndexPolicy
enum withOff
,Optional
, andRequired
variants.column_index
andoffset_index
fields inParquetMetaDataReader
with the newPageIndexPolicy
enum.ParquetMetaDataReader::new()
function to initialize page index policies toOff
, preserving previous defaults.with_page_indexes
,with_column_indexes
, andwith_offset_indexes
methods to utilize the newPageIndexPolicy
, defaulting toRequired
when enabling indexes.with_page_index_policy
,with_column_index_policy
, andwith_offset_index_policy
to allow direct setting of the page index policy.PageIndexPolicy
, including returning an error if aRequired
index is not found.Are these changes tested?
Yes, a new test file
parquet/tests/page_index.rs
has been added to cover the functionality of the newPageIndexPolicy
and its integration withParquetMetaDataReader
.Are there any user-facing changes?
Yes, there are user-facing changes to the
ParquetMetaDataReader
API. Thewith_column_indexes
andwith_offset_indexes
methods now implicitly usePageIndexPolicy::Required
when enabling page indexes. New methodswith_page_index_policy
,with_column_index_policy
, andwith_offset_index_policy
have been added.