Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report uncompressed column size as a statistic #6848

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

AdamGS
Copy link
Contributor

@AdamGS AdamGS commented Dec 6, 2024

Which issue does this PR close?

Closes #6847.

Rationale for this change

Part of apache/datafusion#7548.

What changes are included in this PR?

Are there any user-facing changes?

Adds a new public function.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Dec 6, 2024
@etseidl
Copy link
Contributor

etseidl commented Dec 6, 2024

Could you add tests for this to parquet/tests/arrow_reader/statistics.rs ? 🙏

Otherwise looks reasonable to me.

@etseidl
Copy link
Contributor

etseidl commented Dec 6, 2024

Looks like Date64 handling has changed since you branched, which is throwing off test_dates_64_diff_rg_sizes. Try merging with main and adjusting the expected values to 164, 108.

But thanks! That was fast turnaround.

@AdamGS
Copy link
Contributor Author

AdamGS commented Dec 6, 2024

Thank you for the fast review!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @AdamGS (and @etseidl for the review 🙏 )

I am not sure this metric is the ideal one to use in DataFusion (as it refers to the parquet size, rather than the decoded arrow size 🤔 )

@@ -1432,6 +1432,24 @@ impl<'a> StatisticsConverter<'a> {
Ok(UInt64Array::from_iter(null_counts))
}

/// Extract the uncompressed sizes from row group statistics in [`RowGroupMetaData`]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also be worth mentioning here that this is the uncompressed size of the parquet data page

Aka this is what is reported here

https://github.com/apache/parquet-format/blob/4a17d6bfc0bcf7fe360e75e165c1764b43b51352/src/main/thrift/parquet.thrift#L724-L725

I think as written it might be confused with the uncompressed size after decoding to arrow, which will likely be quite different (and substantially larger)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Which is why I wanted this added to Parquet, to allow better estimation of decoded sizes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 I forgot that you added this:

https://github.com/search?q=repo%3Aapache%2Farrow-rs%20unencoded_byte_array_data_bytes&type=code

So @AdamGS what do you think about updating this PR to return the unencoded_byte_array_data_bytes field instead of the decompressed page size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, I'll probably have it tomorrow/later today depending on my jetlag

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bear in mind that unencoded_byte_array_data_bytes is only for byte array data, and does not include any overhead introduced by Arrow (offsets array, etc). For fixed width types it would be sufficient to know the total number of values encoded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My current plan is to report unencoded_byte_array_data_bytes for BYTE_ARRAY columns, and width * num_values for the others in my mind is the amount of "information stored".
Consumers like DataFusion can then add any known overheads (like Arrow offset arrays etc).
The other option I can think of is reporting the value size, and letting callers do any arithmetic the find useful (like multiplying by number of values etc.), would love to hear your thoughts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment by @findepi makes me think the latter might be the right way to go here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion:

  • the parquet crate's API already exposes the unencoded_byte_array_data_bytes metric so users can do arithmetic they want so simply adding unencoded_byte_array_data_bytes to the StatisticsConverter is not very helpful (if anything returning unencoded_byte_array_data_bytes as an arrow array makes the values harder to use)
  • Something I do think could potentially be valuable is some way to calculate the memory required for certain amounts of arrow data (e.g. a 100 row Int64 array) but that is probably worth its own ticket / discussion

I suggest proceeding with apache/datafusion#7548 by adding code there first/ figuring out the real use case and then upstreaming (to arrow-rs) and common pattern that emerges

@tustvold tustvold marked this pull request as draft December 15, 2024 12:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extract parquet columns uncompressed size as a statistic
3 participants