Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report uncompressed column size as a statistic #6848

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions parquet/src/arrow/arrow_reader/statistics.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1432,6 +1432,24 @@ impl<'a> StatisticsConverter<'a> {
Ok(UInt64Array::from_iter(null_counts))
}

/// Extract the uncompressed sizes from row group statistics in [`RowGroupMetaData`]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also be worth mentioning here that this is the uncompressed size of the parquet data page

Aka this is what is reported here

https://github.com/apache/parquet-format/blob/4a17d6bfc0bcf7fe360e75e165c1764b43b51352/src/main/thrift/parquet.thrift#L724-L725

I think as written it might be confused with the uncompressed size after decoding to arrow, which will likely be quite different (and substantially larger)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Which is why I wanted this added to Parquet, to allow better estimation of decoded sizes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 I forgot that you added this:

https://github.com/search?q=repo%3Aapache%2Farrow-rs%20unencoded_byte_array_data_bytes&type=code

So @AdamGS what do you think about updating this PR to return the unencoded_byte_array_data_bytes field instead of the decompressed page size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, I'll probably have it tomorrow/later today depending on my jetlag

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bear in mind that unencoded_byte_array_data_bytes is only for byte array data, and does not include any overhead introduced by Arrow (offsets array, etc). For fixed width types it would be sufficient to know the total number of values encoded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My current plan is to report unencoded_byte_array_data_bytes for BYTE_ARRAY columns, and width * num_values for the others in my mind is the amount of "information stored".
Consumers like DataFusion can then add any known overheads (like Arrow offset arrays etc).
The other option I can think of is reporting the value size, and letting callers do any arithmetic the find useful (like multiplying by number of values etc.), would love to hear your thoughts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment by @findepi makes me think the latter might be the right way to go here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion:

  • the parquet crate's API already exposes the unencoded_byte_array_data_bytes metric so users can do arithmetic they want so simply adding unencoded_byte_array_data_bytes to the StatisticsConverter is not very helpful (if anything returning unencoded_byte_array_data_bytes as an arrow array makes the values harder to use)
  • Something I do think could potentially be valuable is some way to calculate the memory required for certain amounts of arrow data (e.g. a 100 row Int64 array) but that is probably worth its own ticket / discussion

I suggest proceeding with apache/datafusion#7548 by adding code there first/ figuring out the real use case and then upstreaming (to arrow-rs) and common pattern that emerges

pub fn row_group_uncompressed_size<I>(&self, metadatas: I) -> Result<UInt64Array>
where
I: IntoIterator<Item = &'a RowGroupMetaData>,
{
let Some(parquet_index) = self.parquet_column_index else {
let num_row_groups = metadatas.into_iter().count();
return Ok(UInt64Array::from_iter(
std::iter::repeat(None).take(num_row_groups),
));
};

let uncompressed_sizes = metadatas
.into_iter()
.map(|x| x.column(parquet_index).uncompressed_size() as u64);
Ok(UInt64Array::from_iter_values(uncompressed_sizes))
}

/// Extract the minimum values from Data Page statistics.
///
/// In Parquet files, in addition to the Column Chunk level statistics
Expand Down
Loading
Loading