Report uncompressed column size as a statistic #6848

AdamGS · 2024-12-06T15:53:38Z

Which issue does this PR close?

Closes #6847.

Rationale for this change

Part of apache/datafusion#7548.

What changes are included in this PR?

Are there any user-facing changes?

Adds a new public function.

etseidl · 2024-12-06T17:54:27Z

Could you add tests for this to parquet/tests/arrow_reader/statistics.rs ? 🙏

Otherwise looks reasonable to me.

etseidl · 2024-12-06T21:03:26Z

Looks like Date64 handling has changed since you branched, which is throwing off test_dates_64_diff_rg_sizes. Try merging with main and adjusting the expected values to 164, 108.

But thanks! That was fast turnaround.

AdamGS · 2024-12-06T21:19:29Z

Thank you for the fast review!

alamb

Thank you @AdamGS (and @etseidl for the review 🙏 )

I am not sure this metric is the ideal one to use in DataFusion (as it refers to the parquet size, rather than the decoded arrow size 🤔 )

alamb · 2024-12-06T21:49:19Z

parquet/src/arrow/arrow_reader/statistics.rs

@@ -1432,6 +1432,24 @@ impl<'a> StatisticsConverter<'a> {
        Ok(UInt64Array::from_iter(null_counts))
    }

+    /// Extract the uncompressed sizes from row group statistics in [`RowGroupMetaData`]


It might also be worth mentioning here that this is the uncompressed size of the parquet data page

Aka this is what is reported here

https://github.com/apache/parquet-format/blob/4a17d6bfc0bcf7fe360e75e165c1764b43b51352/src/main/thrift/parquet.thrift#L724-L725

I think as written it might be confused with the uncompressed size after decoding to arrow, which will likely be quite different (and substantially larger)

Good point. Which is why I wanted this added to Parquet, to allow better estimation of decoded sizes.

🤦 I forgot that you added this:

https://github.com/search?q=repo%3Aapache%2Farrow-rs%20unencoded_byte_array_data_bytes&type=code

So @AdamGS what do you think about updating this PR to return the unencoded_byte_array_data_bytes field instead of the decompressed page size?

SGTM, I'll probably have it tomorrow/later today depending on my jetlag

Bear in mind that unencoded_byte_array_data_bytes is only for byte array data, and does not include any overhead introduced by Arrow (offsets array, etc). For fixed width types it would be sufficient to know the total number of values encoded.

My current plan is to report unencoded_byte_array_data_bytes for BYTE_ARRAY columns, and width * num_values for the others in my mind is the amount of "information stored".
Consumers like DataFusion can then add any known overheads (like Arrow offset arrays etc).
The other option I can think of is reporting the value size, and letting callers do any arithmetic the find useful (like multiplying by number of values etc.), would love to hear your thoughts.

this comment by @findepi makes me think the latter might be the right way to go here.

In my opinion:

the parquet crate's API already exposes the unencoded_byte_array_data_bytes metric so users can do arithmetic they want so simply adding unencoded_byte_array_data_bytes to the StatisticsConverter is not very helpful (if anything returning unencoded_byte_array_data_bytes as an arrow array makes the values harder to use)

Something I do think could potentially be valuable is some way to calculate the memory required for certain amounts of arrow data (e.g. a 100 row Int64 array) but that is probably worth its own ticket / discussion

I suggest proceeding with apache/datafusion#7548 by adding code there first/ figuring out the real use case and then upstreaming (to arrow-rs) and common pattern that emerges

.

b0e3f7d

github-actions bot added the parquet Changes to the parquet crate label Dec 6, 2024

Add expected row counts to statistics tests

0c0c40e

AdamGS added 2 commits December 6, 2024 16:17

Merge branch 'main' into adamg/uncompressed-size-stat

227689c

pull upstream and fix test

98bea44

alamb reviewed Dec 6, 2024

View reviewed changes

tustvold marked this pull request as draft December 15, 2024 12:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report uncompressed column size as a statistic #6848

Report uncompressed column size as a statistic #6848

AdamGS commented Dec 6, 2024

etseidl commented Dec 6, 2024

etseidl commented Dec 6, 2024 •

edited

Loading

AdamGS commented Dec 6, 2024

alamb left a comment

alamb Dec 6, 2024

etseidl Dec 6, 2024

alamb Dec 9, 2024

AdamGS Dec 9, 2024

etseidl Dec 9, 2024

AdamGS Dec 10, 2024

AdamGS Dec 10, 2024

alamb Dec 12, 2024

Report uncompressed column size as a statistic #6848

Are you sure you want to change the base?

Report uncompressed column size as a statistic #6848

Conversation

AdamGS commented Dec 6, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

etseidl commented Dec 6, 2024

etseidl commented Dec 6, 2024 • edited Loading

AdamGS commented Dec 6, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

etseidl commented Dec 6, 2024 •

edited

Loading