-
-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write statistics #486
Comments
I assumed the |
Note to self: https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterProperties.html#method.statistics_enabled Happy to change the default, still wonder how to test this, though. |
Extracting statistics from a file written with import pyarrow.parquet as pq
pq_file = pq.ParquetFile("tmp.par")
# Get metadata for the i-th RowGroup
rg_meta = pq_file.metadata.row_group(0)
# Get the "max" statistic for the k-th column
max_of_col = rg_meta.column(0).statistics.max
print(max_of_col) Now, I wonder, what makes you think |
I can reproduce this, but I wonder if this is an issue. It |
Am I also right in assuming it is specifically about the min/max statistic? |
Yes. This is important because it helps query engines skip row groups that are irrelevant. Am I right to assume that this is the implementation used by this tool? I can open a ticket. https://github.com/apache/arrow-rs |
Yes this is the correct repository to raise an issue. In case it could be of help, here is the point in I rely very much on the defaults, but setting to produce statistics explicitly did not change behavior. |
Feel free to link this issue to the upstream ticket. |
I wonder if it's the same as apache/arrow-rs#5162 |
This might very well be the case. I think text is written as "physical" type bytes with logical type UTF-8 |
The ticket was unrelated. My ticket: apache/arrow-rs#5270 |
This behavior is fixed in |
It seems like no Parquet column statistics (like min/max value) are written by this tool.
Would be nice to write add an option to add statistics or even enable them by default.
The text was updated successfully, but these errors were encountered: