-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-45045: [C++][Parquet] Add a benchmark for size_statistics_level #45085
base: main
Are you sure you want to change the base?
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
The slowdown on lists is a bit surprising, is it because of levels histograms? Does the non-list case have nulls? |
Did you mean the slowdown from which one?
For 1, I think it is due to explosion of element sizes. The data size is large enough that most iteration numbers are 1, which may affect judgement.
Yes, the null probability is hard-coded to 50%. |
There must be some unstable test since the None encoding is slower...🤔 |
Yes, the iteration is 1 in most cases which is not stable. |
This comment was marked as outdated.
This comment was marked as outdated.
Ok, some observations:
|
I was to observe the page index size overhead compared to the file size. Now it seems to be pretty trivial and thus compression can be removed. |
Number of leaf values: 1024 * 1024
It is counter-intuitive that |
Thanks. By the way, can you disable dictionary encoding? Does it use PLAIN encoding? |
After disabling dictionary, the result looks more reasonable.
|
Rationale for this change
Add a benchmark to know the performance of different size stats levels.
What changes are included in this PR?
Add a size_stats_benchmark for parquet.
Are these changes tested?
No
Are there any user-facing changes?
No