Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract parquet statistics for StructArray #11289

Closed
wants to merge 2 commits into from

Conversation

Lordworms
Copy link
Contributor

Which issue does this PR close?

Closes #10609

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Jul 5, 2024
@Lordworms
Copy link
Contributor Author

@alamb I'll try to add support for struct in DataPage in next PR.

@alamb alamb changed the title extract statistics read for struct array in parquet Extract parquet statistics for StructArray Jul 7, 2024
// For example a ListArray could correspond to anything from 1 to 3 levels
// in the parquet schema
return None;
match field.data_type() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should extract the repeated code into a helper function? Could also reduce nesting with an early return for the case where it's not nested. So something like:

Suggested change
match field.data_type() {
pub(crate) fn parquet_column<'a>(
parquet_schema: &SchemaDescriptor,
arrow_schema: &'a Schema,
name: &str,
) -> Option<(usize, &'a FieldRef)> {
let (root_idx, field) = arrow_schema.fields.find(name)?;
if !field.data_type().is_nested() {
let parquet_idx = find_parquet_idx(parquet_schema, root_idx)?;
return Some((parquet_idx, field));
}
// Nested field
match field.data_type() {
DataType::Struct(_) => {
let parquet_idx = find_parquet_idx(parquet_schema, root_idx)?;
Some((parquet_idx, field))
}
_ => {
if field.data_type().is_nested() {
// Nested fields are not supported and require non-trivial logic
// to correctly walk the parquet schema accounting for the
// logical type rules - <https://github.com/apache/parquet-format/blob/master/LogicalTypes.md>
//
// For example a ListArray could correspond to anything from 1 to 3 levels
// in the parquet schema
None
} else {
let parquet_idx = find_parquet_idx(parquet_schema, root_idx)?;
Some((parquet_idx, field))
}
}
}
}
// Helper function to find parquet index - TBD: this could be more efficient
fn find_parquet_idx(parquet_schema: &SchemaDescriptor, root_idx: usize) -> Option<usize> {
(0..parquet_schema.columns().len())
.find(|x| parquet_schema.get_column_root_idx(*x) == root_idx)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can do it

Arc::new(BooleanArray::from(vec![Some(true)])) as ArrayRef,
),
]);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered if we needed a struct array with top-level nulls. But for the stats, it shouldn't matter whether the null is coming from the top level or the nested arrays. That would be testing the how the stats are calculated. (This assumes I'm interpreting the way nulls are handled correctly.)

But it might be worth having some null counts to see that they get calculated correctly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I concern about. Since it is hard and seems meaningless to have a top-level nullcount. The reason I leave it like this is just to follow the test pattern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This blog post provides good examples on when you'd represent nulls at different levels: https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/

Copy link
Contributor Author

@Lordworms Lordworms Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to read the valid mask but it actually contains in the data page, not sure whether we need to read the actual data at this time since it is a Metadata level analysis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to do anything differently. I was trying to find some tests in the arrow crate for when the statistics are written but I haven't been able to find any for writing nested structs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's some tests in the arrow crate that are writing a 2 level struct: https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_writer/mod.rs#L1516-L1618 Not sure if that's helpful or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll take a look at it.

metadata: &[&RowGroupMetaData],
index: usize,
data_type: &DataType,
) -> Vec<u64> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you preferred, this could also be expressed as a fold:

            let num_row_groups = metadata.len();
            fields.iter().fold(vec![0; num_row_groups], |mut acc, field| {
                let field_null_counts = Self::get_null_counts_recursive(
                    metadata,
                    index + 1,
                    field.data_type(),
                );
                acc.iter_mut().zip(field_null_counts.iter()).for_each(|(a, b)| *a += b);
                acc
            })

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, sorry for the late response

@alamb
Copy link
Contributor

alamb commented Jul 18, 2024

Since we have merged apache/arrow-rs#6046 now upstream in arrow-rs, @Lordworms would you be willing to port this PR to the other repo?

@Lordworms
Copy link
Contributor Author

Since we have merged apache/arrow-rs#6046 now upstream in arrow-rs, @Lordworms would you be willing to port this PR to the other repo?

Sure, I'll open a PR later today.

@Lordworms
Copy link
Contributor Author

filed apache/arrow-rs#6090

@alamb
Copy link
Contributor

alamb commented Jul 19, 2024

Thank you @Lordworms - sorry for the delay / runaround. I just haven't had a chance to focus on this PR. I was hoping someone with more structured type experience would be able to help review it. 🎣

@alamb
Copy link
Contributor

alamb commented Jul 24, 2024

superceded by apache/arrow-rs#6090

(which I know is waiting for a review -- I just need to find time to study the nested types or find someone else who does)

@alamb alamb closed this Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect statistics read for struct array in parquet
3 participants