Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statistics truncate length is not applied in DataFusion compaction #4240

Open
patchwork01 opened this issue Feb 11, 2025 · 0 comments
Open
Labels
bug Something isn't working on-hold

Comments

@patchwork01
Copy link
Collaborator

patchwork01 commented Feb 11, 2025

Description

In a compaction on DataFusion, we set the statistics truncate length in the DataFusion ParquetOptions under max_statistics_size. This property is not actually used.

Steps to reproduce

  1. Set a statistics truncate length in the table property sleeper.table.parquet.statistics.truncate.length
  2. Ingest data that will result in statistics longer than this truncate length
  3. Run a compaction in DataFusion
  4. See the statistics in the file output by the compaction are not truncated, although it was in the input file

Expected behaviour

The statistics truncate length should be applied in DataFusion.

Background

DataFusion has deprecated the configuration option datafusion.execution.parquet.max_statistics_size, because it's not used:

It seems to have been replaced in the Parquet library by statistics_truncate_length, added here:

There doesn't seem to be a way to set this in DataFusion. We've raised this issue to add an option to configure it:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working on-hold
Projects
None yet
Development

No branches or pull requests

1 participant