GH-38837: [Format] Add the specification to pass statistics through the Arrow C data interface #43553

kou · 2024-08-05T09:28:14Z

Rationale for this change

Statistics are useful for fast query processing. Many query engines
use statistics to optimize their query plan.

Apache Arrow format doesn't have statistics but other formats that can
be read as Apache Arrow data may have statistics. For example, Apache
Parquet C++ can read Apache Parquet file as Apache Arrow data and
Apache Parquet file may have statistics.

One of the Arrow C data interface use cases is the following:

Module A reads Apache Parquet file as Apache Arrow data
Module A passes the read Apache Arrow data to module B through the
Arrow C data interface
Module B processes the passed Apache Arrow data

If module A can pass the statistics associated with the Apache Parquet
file to module B through the Arrow C data interface, module B can use
the statistics to optimize its query plan.

What changes are included in this PR?

Add the specification to pass statistics through the Arrow C data interface based on the discussion on the dev@ mailing list: https://lists.apache.org/thread/z0jz2bnv61j7c6lbk7lympdrs49f69cx

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

GitHub Issue: [Format] Passing column statistics through Arrow C data interface #38837

kou · 2024-08-05T09:28:37Z

@github-actions crossbow submit preview-docs

github-actions · 2024-08-05T09:28:42Z

⚠️ GitHub issue #38837 has been automatically assigned in GitHub to PR creator.

kou · 2024-08-05T09:29:54Z

I'm not a native English speaker. Wording suggestions are very welcome.

I'll add examples after I implement a convenient API to C++.

github-actions · 2024-08-05T09:30:52Z

Revision: 22336f4

Submitted crossbow builds: ursacomputing/crossbow @ actions-28c2a45b3d

Task	Status
preview-docs

docs/source/format/CDataInterfaceStatistics.rst

ianmcook · 2024-08-05T16:29:13Z

docs/source/format/CDataInterfaceStatistics.rst

+* Provide a common way to pass statistics that can be used for
+  other interfaces such Arrow Flight too.


What about the Arrow IPC format? Can you add a sentence here that explains why we do not recommend using this to pass statistics over Arrow IPC?

Good point. This may fit the Arrow IPC format. (A producer sends data and statistics as 2 separated the Arrow IPC format data.)

But the Arrow IPC format can use more approaches. For example, the Arrow IPC format can have metadata for each record batch data: https://arrow.apache.org/docs/format/Columnar.html#encapsulated-message-format (The Arrow C data can't have metadata for ArrowArray.)
The Arrow IPC format can be used with other mechanisms such as Arrow Flight and ADBC.

So this may not be the best approach for the Arrow IPC format. We should discuss this use case with the Arrow IPC format separately.

I'll add something to here.

ianmcook · 2024-08-05T16:50:51Z

docs/source/format/CDataInterfaceStatistics.rst

+For example, ADBC has the statistics related APIs. This specification
+doesn't replace them.


Maybe link to https://arrow.apache.org/adbc/current/format/specification.html#statistics

Thanks. I should have done it.

ianmcook · 2024-08-05T16:53:42Z

@pdet please take a look and add comments if you have any, thanks!

pdet · 2024-08-06T08:11:27Z

The format and contents LGTM! I was just slightly confused for one second that the second mapping is the value items in the first mapping.

@Tmonster I think the proposed Statistics keys already cover the table statistics we use/produce. Could you double-check?

docs/source/format/CDataInterfaceStatistics.rst

kou

@ianmcook Thanks for your suggestions! I've merged all of them!

kou · 2024-08-07T02:25:17Z

docs/source/format/CDataInterfaceStatistics.rst

+* Provide a common way to pass statistics that can be used for
+  other interfaces such Arrow Flight too.


Good point. This may fit the Arrow IPC format. (A producer sends data and statistics as 2 separated the Arrow IPC format data.)

But the Arrow IPC format can use more approaches. For example, the Arrow IPC format can have metadata for each record batch data: https://arrow.apache.org/docs/format/Columnar.html#encapsulated-message-format (The Arrow C data can't have metadata for ArrowArray.)
The Arrow IPC format can be used with other mechanisms such as Arrow Flight and ADBC.

So this may not be the best approach for the Arrow IPC format. We should discuss this use case with the Arrow IPC format separately.

I'll add something to here.

kou · 2024-08-07T02:25:57Z

docs/source/format/CDataInterfaceStatistics.rst

+For example, ADBC has the statistics related APIs. This specification
+doesn't replace them.


Thanks. I should have done it.

docs/source/format/CDataInterfaceStatistics.rst

kou · 2024-08-07T02:39:12Z

docs/source/format/CDataInterfaceStatistics.rst

+The ``ARROW`` pattern is a reserved namespace for pre-defined
+statistics keys. User-defined statistics must not use it.
+
+Here are pre-defined statistics keys:


They are the same as ADBC's one: https://github.com/apache/arrow-adbc/blob/05fa60d643c66b572d426ab28aa78fc52e9520e8/c/include/arrow-adbc/adbc.h#L524-L570

kou · 2024-08-07T02:49:17Z

I was just slightly confused for one second that the second mapping is the value items in the first mapping.

Ah, it makes sense. I'll improve it. Thanks.

kou · 2024-08-07T03:57:24Z

@github-actions crossbow submit preview-docs

* Add the original DuckDB use case * Add TODOs * Clarify "column index": It uses the flattened index * Clarify statistics target * Use concrete data not C++ for examples

kou · 2024-12-11T08:08:34Z

I think the C Data Interface framing came from the original use case (which I would like to not lose sight of); allowing engines like DuckDB to have a way to get statistics when given a C Data Interface stream, so that they can properly do query planning. Even if we reframe this as just a definition for the format of the statistics it might be good to mention the original use case so there's proper context, to help commenters who aren't familiar with query planning.

I've added the original DuckDB use case.

DuckDB may be able to get statistics without ArrowArrayStream (DuckDB may be able to call separate API to get statistics) because duckdb::TableFunction has table_function_cardinality_t cardinality and table_statistics_t statistics.
See also: https://github.com/duckdb/duckdb/blob/v1.1.3/src/function/table/arrow.cpp#L525-L527

I'll start a discussion on the mailing list tomorrow.
(I wanted to do it today...)

kou · 2024-12-12T01:36:01Z

The discussion thread: https://lists.apache.org/thread/b6chzlyn95rztoybs39b6olz907g12gj

kou · 2024-12-18T07:14:14Z

Based on the mailing list discussion, we don't limit this specification to only the Arrow C data interface.
We standardize only statistics schema.

I close this in favor of GH-45058.

alamb

Sorry for being (very) late to respond / review this proposal

alamb · 2024-12-24T13:17:11Z

docs/source/format/CDataInterfaceStatistics.rst

+
+       We don't standardize field names for the dense union because
+       consumers can access to proper field by index not name. So
+       producers can use any valid name for fields.


We have had trouble in the arrow-rs implementation or other places where schema's contain names but they aren't standardized

For example, we have an embedded field name for Lists and some implementations use "item" and some "element" which causes spurious schema mismatch errors

Therefore I also recommend removing field names unless there is some good reason for doing so (it isn't clear to me from the text why there are arbitrary field namds)

alamb · 2024-12-24T13:19:08Z

docs/source/format/CDataInterfaceStatistics.rst

+   * - key
+     - ``dictionary<indices: int32, dictionary: utf8>``
+     - ``false``
+     - Statistics key is string. Dictionary is used for


Maybe we could be more explicitly about what the key means (it seems like it is the "statistics type" )?

alamb · 2024-12-24T13:19:53Z

docs/source/format/CDataInterfaceStatistics.rst

+
+.. _c-data-interface-statistics-key:
+
+Statistics key


Nit is I would find the term "statistics type" rather than "statistics key" easier to understand

alamb · 2024-12-24T13:27:19Z

docs/source/format/CDataInterfaceStatistics.rst

+      >
+    >
+
+Statistics array::


I didn't understand this example. I thought the statistics were structs, so I would have expected the data to look something like this (perhaps we could give the "logical contents" and then the specific array encoding):

[ // first struct element { column: null, # record batch statistics: { "ARROW:row_count:exact": 0 } }, { column: 0, # vendor_id statistics: { "ARROW:null_count:exact": 0, "ARROW:distinct_count:exact": 2, "ARROW:max_value:exact": 5, "ARROW:min_value:exact": 1, } }, ... ]

I can help work out the example if people think this is a good idea

alamb · 2024-12-24T13:31:53Z

docs/source/format/CDataInterfaceStatistics.rst

+    vendor_id:       [5, 1, 5, 1, 5]
+    passenger_count: [1, 1, 2, 0, null]
+
+Statistics schema::


It is probably too late, but I figured I would point out another format we use for statistics in DataFusion is a columnar form rather than a nested structure

https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/trait.PruningStatistics.html

For example to represent this data in DataFusion it would be

vendor_id::null_count vendor_id::min vendor_id::max ... passenger_count::max

0 1 5 ... 2

The benefit of this encoding is that it can be used to quickly evaluate predicates on ranges (e.g. figure out if vendor_id = 6 could ever be true)

We use this format to read statistics from the parquet files (see ParquetStatisticsConverter to rule out row groups)

It does result in potentially very wide schemas however

kou marked this pull request as draft August 5, 2024 09:28

github-actions bot added Component: Documentation awaiting committer review Awaiting committer review labels Aug 5, 2024

ianmcook reviewed Aug 5, 2024

View reviewed changes

docs/source/format/CDataInterfaceStatistics.rst Outdated Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Aug 5, 2024

ianmcook reviewed Aug 5, 2024

View reviewed changes

docs/source/format/CDataInterfaceStatistics.rst Outdated Show resolved Hide resolved

ianmcook reviewed Aug 5, 2024

View reviewed changes

ianmcook reviewed Aug 6, 2024

View reviewed changes

docs/source/format/CDataInterfaceStatistics.rst Outdated Show resolved Hide resolved

ianmcook reviewed Aug 6, 2024

View reviewed changes

docs/source/format/CDataInterfaceStatistics.rst Outdated Show resolved Hide resolved

ianmcook reviewed Aug 6, 2024

View reviewed changes

docs/source/format/CDataInterfaceStatistics.rst Outdated Show resolved Hide resolved

ianmcook reviewed Aug 6, 2024

View reviewed changes

docs/source/format/CDataInterfaceStatistics.rst Outdated Show resolved Hide resolved

ianmcook reviewed Aug 6, 2024

View reviewed changes

docs/source/format/CDataInterfaceStatistics.rst Outdated Show resolved Hide resolved

ianmcook reviewed Aug 6, 2024

View reviewed changes

docs/source/format/CDataInterfaceStatistics.rst Outdated Show resolved Hide resolved

ianmcook reviewed Aug 6, 2024

View reviewed changes

docs/source/format/CDataInterfaceStatistics.rst Outdated Show resolved Hide resolved

kou commented Aug 7, 2024

View reviewed changes

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Aug 7, 2024

github-actions bot removed the awaiting changes Awaiting changes label Aug 7, 2024

kou and others added 6 commits December 11, 2024 15:20

Add experimental warning

4644418

index -> type code

595d916

Clarify wording

5fbddf4

Add user-defined statistics example

6815b99

Fix a typo

f744c80

Update

ed0cbe2

* Add the original DuckDB use case * Add TODOs * Clarify "column index": It uses the flattened index * Clarify statistics target * Use concrete data not C++ for examples

kou force-pushed the docs-statistics-c-data-interface branch from 6cf71e0 to ed0cbe2 Compare December 11, 2024 07:58

Add one more TODO

50ca4c5

github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Dec 11, 2024

Add one more TODO

1825520

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Dec 11, 2024

Fix syntax

2d70741

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Dec 11, 2024

kou added a commit to kou/arrow that referenced this pull request Dec 18, 2024

apacheGH-43553: [Format] Add the specification for statistics schema

5308f9f

github-actions bot mentioned this pull request Dec 18, 2024

GH-38837: [Format] Add the specification for statistics schema #45058

Open

github-actions bot assigned kou Dec 18, 2024

kou closed this Dec 18, 2024

alamb reviewed Dec 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-38837: [Format] Add the specification to pass statistics through the Arrow C data interface #43553

GH-38837: [Format] Add the specification to pass statistics through the Arrow C data interface #43553

kou commented Aug 5, 2024 •

edited by github-actions bot

Loading

kou commented Aug 5, 2024

github-actions bot commented Aug 5, 2024

kou commented Aug 5, 2024

github-actions bot commented Aug 5, 2024

ianmcook Aug 5, 2024

kou Aug 7, 2024

ianmcook Aug 5, 2024

kou Aug 7, 2024

ianmcook commented Aug 5, 2024

pdet commented Aug 6, 2024

kou left a comment

kou Aug 7, 2024

kou Aug 7, 2024

kou Aug 7, 2024

kou commented Aug 7, 2024

kou commented Aug 7, 2024

kou commented Dec 11, 2024

kou commented Dec 12, 2024

kou commented Dec 18, 2024

alamb left a comment

alamb Dec 24, 2024

alamb Dec 24, 2024

alamb Dec 24, 2024

alamb Dec 24, 2024

alamb Dec 24, 2024

		* Provide a common way to pass statistics that can be used for
		other interfaces such Arrow Flight too.

		For example, ADBC has the statistics related APIs. This specification
		doesn't replace them.

GH-38837: [Format] Add the specification to pass statistics through the Arrow C data interface #43553

GH-38837: [Format] Add the specification to pass statistics through the Arrow C data interface #43553

Conversation

kou commented Aug 5, 2024 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

kou commented Aug 5, 2024

github-actions bot commented Aug 5, 2024

kou commented Aug 5, 2024

github-actions bot commented Aug 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ianmcook commented Aug 5, 2024

pdet commented Aug 6, 2024

kou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kou commented Aug 7, 2024

kou commented Aug 7, 2024

kou commented Dec 11, 2024

kou commented Dec 12, 2024

kou commented Dec 18, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kou commented Aug 5, 2024 •

edited by github-actions bot

Loading