-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract parquet statistics for StructArray
#11289
Conversation
@alamb I'll try to add support for struct in DataPage in next PR. |
StructArray
// For example a ListArray could correspond to anything from 1 to 3 levels | ||
// in the parquet schema | ||
return None; | ||
match field.data_type() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should extract the repeated code into a helper function? Could also reduce nesting with an early return for the case where it's not nested. So something like:
match field.data_type() { | |
pub(crate) fn parquet_column<'a>( | |
parquet_schema: &SchemaDescriptor, | |
arrow_schema: &'a Schema, | |
name: &str, | |
) -> Option<(usize, &'a FieldRef)> { | |
let (root_idx, field) = arrow_schema.fields.find(name)?; | |
if !field.data_type().is_nested() { | |
let parquet_idx = find_parquet_idx(parquet_schema, root_idx)?; | |
return Some((parquet_idx, field)); | |
} | |
// Nested field | |
match field.data_type() { | |
DataType::Struct(_) => { | |
let parquet_idx = find_parquet_idx(parquet_schema, root_idx)?; | |
Some((parquet_idx, field)) | |
} | |
_ => { | |
if field.data_type().is_nested() { | |
// Nested fields are not supported and require non-trivial logic | |
// to correctly walk the parquet schema accounting for the | |
// logical type rules - <https://github.com/apache/parquet-format/blob/master/LogicalTypes.md> | |
// | |
// For example a ListArray could correspond to anything from 1 to 3 levels | |
// in the parquet schema | |
None | |
} else { | |
let parquet_idx = find_parquet_idx(parquet_schema, root_idx)?; | |
Some((parquet_idx, field)) | |
} | |
} | |
} | |
} | |
// Helper function to find parquet index - TBD: this could be more efficient | |
fn find_parquet_idx(parquet_schema: &SchemaDescriptor, root_idx: usize) -> Option<usize> { | |
(0..parquet_schema.columns().len()) | |
.find(|x| parquet_schema.get_column_root_idx(*x) == root_idx) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can do it
Arc::new(BooleanArray::from(vec![Some(true)])) as ArrayRef, | ||
), | ||
]); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondered if we needed a struct array with top-level nulls. But for the stats, it shouldn't matter whether the null is coming from the top level or the nested arrays. That would be testing the how the stats are calculated. (This assumes I'm interpreting the way nulls are handled correctly.)
But it might be worth having some null counts to see that they get calculated correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I concern about. Since it is hard and seems meaningless to have a top-level nullcount. The reason I leave it like this is just to follow the test pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This blog post provides good examples on when you'd represent nulls at different levels: https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to read the valid mask but it actually contains in the data page, not sure whether we need to read the actual data at this time since it is a Metadata level analysis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to do anything differently. I was trying to find some tests in the arrow crate for when the statistics are written but I haven't been able to find any for writing nested structs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's some tests in the arrow crate that are writing a 2 level struct: https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_writer/mod.rs#L1516-L1618 Not sure if that's helpful or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll take a look at it.
c1a3e62
to
15cd7c5
Compare
metadata: &[&RowGroupMetaData], | ||
index: usize, | ||
data_type: &DataType, | ||
) -> Vec<u64> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you preferred, this could also be expressed as a fold:
let num_row_groups = metadata.len();
fields.iter().fold(vec![0; num_row_groups], |mut acc, field| {
let field_null_counts = Self::get_null_counts_recursive(
metadata,
index + 1,
field.data_type(),
);
acc.iter_mut().zip(field_null_counts.iter()).for_each(|(a, b)| *a += b);
acc
})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, sorry for the late response
Since we have merged apache/arrow-rs#6046 now upstream in arrow-rs, @Lordworms would you be willing to port this PR to the other repo? |
Sure, I'll open a PR later today. |
filed apache/arrow-rs#6090 |
Thank you @Lordworms - sorry for the delay / runaround. I just haven't had a chance to focus on this PR. I was hoping someone with more structured type experience would be able to help review it. 🎣 |
superceded by apache/arrow-rs#6090 (which I know is waiting for a review -- I just need to find time to study the nested types or find someone else who does) |
Which issue does this PR close?
Closes #10609
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?