Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Add arrow::ArrayStatistics #41909

Closed
kou opened this issue May 31, 2024 · 3 comments
Closed

[C++] Add arrow::ArrayStatistics #41909

kou opened this issue May 31, 2024 · 3 comments

Comments

@kou
Copy link
Member

kou commented May 31, 2024

Describe the enhancement requested

An Arrow array doesn't have statistics but Arrow array source such as Parquet column may have statistics.
We can get the source statistics via source reader such as parquet::ColumnChunkMetaData::statistics() (parquet::ParquetFileReader::metadata()->RowGroup(X)->ColumnChunk(Y)->statistics()) but can't get read Arrow array (e.g. parquet::arrow::FileReader::ReadColumn()).

How about adding arrow::ArrayStatistics or something and attaching source statistics to arrow::Array?

Component(s)

C++

@llama90
Copy link
Contributor

llama90 commented Jun 1, 2024

I think this is a great idea.

Adding arrow::ArrayStatistics and connecting source statistics to arrow::Array will help users get and use statistics directly from Arrow arrays.

This can make Arrow easier to use, especially for people working with data sources like Parquet that have statistics.

@kou
Copy link
Member Author

kou commented Jun 3, 2024

The mailing list discussion: https://lists.apache.org/thread/kcpyq9npnh346pw90ljwbg0wxq6hwxxh

kou added a commit to kou/arrow that referenced this issue Jun 13, 2024
kou added a commit to kou/arrow that referenced this issue Jul 11, 2024
kou added a commit to kou/arrow that referenced this issue Jul 12, 2024
kou added a commit to kou/arrow that referenced this issue Jul 13, 2024
kou added a commit to kou/arrow that referenced this issue Jul 16, 2024
See apacheGH-42133 how to use this for Apache Parquet statistics.
kou added a commit to kou/arrow that referenced this issue Aug 2, 2024
See apacheGH-42133 how to use this for Apache Parquet statistics.
kou added a commit that referenced this issue Aug 4, 2024
### Rationale for this change

We're discussion API on the mailing list https://lists.apache.org/thread/kcpyq9npnh346pw90ljwbg0wxq6hwxxh and GH-41909.

If we have `arrow::ArrayStatistics`, we can attach statistics read from Apache Parquet to `arrow::Array`s.

This only includes `arrow::ArrayStatistics`. See GH-42133 how to use `arrow::ArrayStatitics` for Apache Parquet's statistics.

### What changes are included in this PR?

This only adds `arrow::ArrayStatistics` and its tests.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #41909

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
@kou kou added this to the 18.0.0 milestone Aug 4, 2024
@kou
Copy link
Member Author

kou commented Aug 4, 2024

Issue resolved by pull request 43273
#43273

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants