-
Notifications
You must be signed in to change notification settings - Fork 821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report uncompressed column size as a statistic #6848
Draft
AdamGS
wants to merge
4
commits into
apache:main
Choose a base branch
from
AdamGS:adamg/uncompressed-size-stat
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might also be worth mentioning here that this is the uncompressed size of the parquet data page
Aka this is what is reported here
https://github.com/apache/parquet-format/blob/4a17d6bfc0bcf7fe360e75e165c1764b43b51352/src/main/thrift/parquet.thrift#L724-L725
I think as written it might be confused with the uncompressed size after decoding to arrow, which will likely be quite different (and substantially larger)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Which is why I wanted this added to Parquet, to allow better estimation of decoded sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 I forgot that you added this:
https://github.com/search?q=repo%3Aapache%2Farrow-rs%20unencoded_byte_array_data_bytes&type=code
So @AdamGS what do you think about updating this PR to return the
unencoded_byte_array_data_bytes
field instead of the decompressed page size?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, I'll probably have it tomorrow/later today depending on my jetlag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bear in mind that
unencoded_byte_array_data_bytes
is only for byte array data, and does not include any overhead introduced by Arrow (offsets array, etc). For fixed width types it would be sufficient to know the total number of values encoded.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My current plan is to report
unencoded_byte_array_data_bytes
forBYTE_ARRAY
columns, and width * num_values for the others in my mind is the amount of "information stored".Consumers like DataFusion can then add any known overheads (like Arrow offset arrays etc).
The other option I can think of is reporting the value size, and letting callers do any arithmetic the find useful (like multiplying by number of values etc.), would love to hear your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment by @findepi makes me think the latter might be the right way to go here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion:
unencoded_byte_array_data_bytes
metric so users can do arithmetic they want so simply addingunencoded_byte_array_data_bytes
to theStatisticsConverter
is not very helpful (if anything returningunencoded_byte_array_data_bytes
as an arrow array makes the values harder to use)I suggest proceeding with apache/datafusion#7548 by adding code there first/ figuring out the real use case and then upstreaming (to arrow-rs) and common pattern that emerges