GH-38837: [Format] Add the specification for statistics schema #45058

kou · 2024-12-18T07:12:26Z

Rationale for this change

Statistics are useful for fast query processing. Many query engines
use statistics to optimize their query plan.

Apache Arrow format doesn't have statistics but other formats that can
be read as Apache Arrow data may have statistics. For example, Apache
Parquet C++ can read Apache Parquet file as Apache Arrow data and
Apache Parquet file may have statistics.

One of the Apache Arrow C streaming interface use cases is the following:

Module A reads Apache Parquet file as Apache Arrow data
Module A passes the read Apache Arrow data to module B through the
Arrow C data interface
Module B processes the passed Apache Arrow data

If module A can pass the statistics associated with the Apache Parquet
file to module B, module B can use the statistics to optimize its
query plan.

What changes are included in this PR?

We standardize how to represent statistics as an Apache Arrow array
for easy to exchange.

We don't standardize how to pass the statistics array. You can use any
interface for it. For example, you can us ethe Apache Arrow C data interface.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

GitHub Issue: [Format] Passing column statistics through Arrow C data interface #38837

kou · 2024-12-18T07:12:49Z

@github-actions crossbow submit preview-docs

github-actions · 2024-12-18T07:12:50Z

⚠️ GitHub issue #43553 has been automatically assigned in GitHub to PR creator.

github-actions · 2024-12-18T07:15:00Z

Revision: 5308f9f

Submitted crossbow builds: ursacomputing/crossbow @ actions-a2d5a054aa

Task	Status
preview-docs

kou · 2024-12-18T08:02:06Z

Preview: http://crossbow.voltrondata.com/pr_docs/45058/format/StatisticsSchema.html

paleolimbot

Thank you for continuing to work on this! A few optional thoughts on TODOs.

paleolimbot · 2024-12-18T16:11:04Z

docs/source/format/StatisticsSchema.rst

+       can access to proper field by type code not name. So we can use
+       any valid name for fields.
+
+       TODO: Should we standardize field names?


I don't see any reason to standardize the names, but a reason I could see to use explicit type IDs for at least a few commonly used statistic types would be to ensure that an ArrowArray (or standalone RecordBatch message) could be interpreted without a ArrowSchema. (Completely optional!)

Related discussion: https://github.com/apache/arrow/pull/43553/files#r1871755575

There is no objection for not standardizing field names.

So I don't standardize field names.

paleolimbot · 2024-12-18T16:12:35Z

docs/source/format/StatisticsSchema.rst

+     - Nullable
+     - Notes
+   * - key
+     - ``dictionary<indices: int32, dictionary: utf8>``


The only reason I can see why this would be problematic is that the statistics values would require more than one IPC message to represent. (Completely optional: this may not be an important consideration!)

Ah, I didn't notice the point. Thanks.

The current proposed specification doesn't focus on any transports/protocols/APIs/.... So this may not be a problem. If this representation doesn't match for our IPC formats, users just don't use this.

@paleolimbot it's hard to shoehorn auxiliary data to the existing protocols. By using Arrow Arrays themselves to push statistics we at least don't have to introduce another serialization format (like protobuf, JSON [how this proposal started], more flatbuffers...).

paleolimbot · 2024-12-18T16:17:11Z

docs/source/format/StatisticsSchema.rst

+     - The maximum size in bytes of a row in the target
+       column. (exact)
+   * - ``ARROW:max_byte_width:approximate``
+     - ``float64``: TODO: Should we use ``int64`` instead?


The float64ness makes sense to me here because the calculation providing this approximate value almost certainly returns a non-exact value (i.e., not an integer, even though the exact value is definitely an integer).

Related discussion: https://github.com/apache/arrow/pull/43553/files#r1871759234

There is no objection for using float64.

So I use float64.

kou · 2024-12-19T02:42:09Z

Thanks for sharing your opinions!
I should have added links of related discussions to TODOs. I'll add them.

kou · 2024-12-23T05:18:18Z

@github-actions crossbow submit preview-docs

github-actions · 2024-12-23T05:20:34Z

Revision: 195c374

Submitted crossbow builds: ursacomputing/crossbow @ actions-f825363132

Task	Status
preview-docs

kou · 2024-12-23T05:37:36Z

The vote thread: https://lists.apache.org/thread/3b3gfsxmvl0jxdnbnr89lvjmlvjk2nw5

felipecrv

Some suggestions. Please double-check them.

I'm not sure if my suggestion to change the way we describe the dictionary type makes sense, but I find dictionary<values=.., indices=..> much less confusing than dictionary<indices=.., dictionary=...>.

I think "statistics key" should be referred to as "statistics name".

docs/source/format/StatisticsSchema.rst

felipecrv · 2024-12-23T14:31:00Z

docs/source/format/StatisticsSchema.rst

+     - Nullable
+     - Notes
+   * - key
+     - ``dictionary<indices: int32, dictionary: utf8>``


@paleolimbot it's hard to shoehorn auxiliary data to the existing protocols. By using Arrow Arrays themselves to push statistics we at least don't have to introduce another serialization format (like protobuf, JSON [how this proposal started], more flatbuffers...).

docs/source/format/StatisticsSchema.rst

Co-authored-by: Felipe Oliveira Carvalho <[email protected]>

docs/source/format/StatisticsSchema.rst

kou · 2024-12-24T00:27:53Z

@github-actions crossbow submit preview-docs

kou · 2024-12-24T00:29:42Z

Thanks for your suggestions!
All of them make sense. I've applied all of them and adjusted other places based on the suggestions.

github-actions · 2024-12-24T00:30:08Z

Revision: 3b9c918

Submitted crossbow builds: ursacomputing/crossbow @ actions-ce837e4f33

Task	Status
preview-docs

apacheGH-43553: [Format] Add the specification for statistics schema

5308f9f

github-actions bot added Component: Documentation awaiting committer review Awaiting committer review labels Dec 18, 2024

kou changed the title ~~GH-43553: [Format] Add the specification for statistics schema~~ GH-38837: [Format] Add the specification for statistics schema Dec 18, 2024

kou mentioned this pull request Dec 18, 2024

GH-38837: [Format] Add the specification to pass statistics through the Arrow C data interface #43553

Closed

Remove needless TODOs

b010404

paleolimbot reviewed Dec 18, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes awaiting review Awaiting review and removed awaiting committer review Awaiting committer review labels Dec 18, 2024

github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review awaiting changes Awaiting changes labels Dec 19, 2024

kou added 2 commits December 23, 2024 14:17

Add array examples

ea68751

Remove TODOs

195c374

github-actions bot removed the awaiting changes Awaiting changes label Dec 23, 2024

github-actions bot added the awaiting change review Awaiting change review label Dec 23, 2024

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Dec 23, 2024

felipecrv reviewed Dec 23, 2024

View reviewed changes

Improve English

cf18851

Co-authored-by: Felipe Oliveira Carvalho <[email protected]>

github-actions bot removed the awaiting changes Awaiting changes label Dec 24, 2024

github-actions bot added the awaiting change review Awaiting change review label Dec 24, 2024

Use "values" not "dictionary"

96872e7

Co-authored-by: Felipe Oliveira Carvalho <[email protected]>

github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Dec 24, 2024

kou commented Dec 24, 2024

View reviewed changes

docs/source/format/StatisticsSchema.rst Outdated Show resolved Hide resolved

Add a note about dense union field name

44c48bf

github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Dec 24, 2024

Adjust

ad49371

This comment was marked as outdated.

Sign in to view

Use "name" not "key"

3b9c918

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-38837: [Format] Add the specification for statistics schema #45058

GH-38837: [Format] Add the specification for statistics schema #45058

kou commented Dec 18, 2024 •

edited by github-actions bot

Loading

kou commented Dec 18, 2024

github-actions bot commented Dec 18, 2024

github-actions bot commented Dec 18, 2024

kou commented Dec 18, 2024

paleolimbot left a comment

paleolimbot Dec 18, 2024

kou Dec 19, 2024

kou Dec 23, 2024

paleolimbot Dec 18, 2024

kou Dec 19, 2024

felipecrv Dec 23, 2024

paleolimbot Dec 18, 2024

kou Dec 19, 2024

kou Dec 23, 2024

kou commented Dec 19, 2024

kou commented Dec 23, 2024

github-actions bot commented Dec 23, 2024

kou commented Dec 23, 2024

felipecrv left a comment

felipecrv Dec 23, 2024

This comment was marked as outdated.

This comment was marked as outdated.

kou commented Dec 24, 2024

kou commented Dec 24, 2024

github-actions bot commented Dec 24, 2024

GH-38837: [Format] Add the specification for statistics schema #45058

Are you sure you want to change the base?

GH-38837: [Format] Add the specification for statistics schema #45058

Conversation

kou commented Dec 18, 2024 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

kou commented Dec 18, 2024

github-actions bot commented Dec 18, 2024

github-actions bot commented Dec 18, 2024

kou commented Dec 18, 2024

paleolimbot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kou commented Dec 19, 2024

kou commented Dec 23, 2024

github-actions bot commented Dec 23, 2024

kou commented Dec 23, 2024

felipecrv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as outdated.

This comment was marked as outdated.

kou commented Dec 24, 2024

kou commented Dec 24, 2024

github-actions bot commented Dec 24, 2024

kou commented Dec 18, 2024 •

edited by github-actions bot

Loading