-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-38837: [Format] Add the specification for statistics schema #45058
base: main
Are you sure you want to change the base?
Conversation
@github-actions crossbow submit preview-docs |
|
Revision: 5308f9f Submitted crossbow builds: ursacomputing/crossbow @ actions-a2d5a054aa
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for continuing to work on this! A few optional thoughts on TODOs.
can access to proper field by type code not name. So we can use | ||
any valid name for fields. | ||
|
||
TODO: Should we standardize field names? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any reason to standardize the names, but a reason I could see to use explicit type IDs for at least a few commonly used statistic types would be to ensure that an ArrowArray
(or standalone RecordBatch
message) could be interpreted without a ArrowSchema
. (Completely optional!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related discussion: https://github.com/apache/arrow/pull/43553/files#r1871755575
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no objection for not standardizing field names.
So I don't standardize field names.
- Nullable | ||
- Notes | ||
* - key | ||
- ``dictionary<indices: int32, dictionary: utf8>`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason I can see why this would be problematic is that the statistics values would require more than one IPC message to represent. (Completely optional: this may not be an important consideration!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't notice the point. Thanks.
The current proposed specification doesn't focus on any transports/protocols/APIs/.... So this may not be a problem. If this representation doesn't match for our IPC formats, users just don't use this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paleolimbot it's hard to shoehorn auxiliary data to the existing protocols. By using Arrow Arrays themselves to push statistics we at least don't have to introduce another serialization format (like protobuf, JSON [how this proposal started], more flatbuffers...).
- The maximum size in bytes of a row in the target | ||
column. (exact) | ||
* - ``ARROW:max_byte_width:approximate`` | ||
- ``float64``: TODO: Should we use ``int64`` instead? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The float64
ness makes sense to me here because the calculation providing this approximate value almost certainly returns a non-exact value (i.e., not an integer, even though the exact value is definitely an integer).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related discussion: https://github.com/apache/arrow/pull/43553/files#r1871759234
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no objection for using float64
.
So I use float64
.
Thanks for sharing your opinions! |
@github-actions crossbow submit preview-docs |
Revision: 195c374 Submitted crossbow builds: ursacomputing/crossbow @ actions-f825363132
|
The vote thread: https://lists.apache.org/thread/3b3gfsxmvl0jxdnbnr89lvjmlvjk2nw5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions. Please double-check them.
I'm not sure if my suggestion to change the way we describe the dictionary type makes sense, but I find dictionary<values=.., indices=..>
much less confusing than dictionary<indices=.., dictionary=...>
.
I think "statistics key" should be referred to as "statistics name".
- Nullable | ||
- Notes | ||
* - key | ||
- ``dictionary<indices: int32, dictionary: utf8>`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paleolimbot it's hard to shoehorn auxiliary data to the existing protocols. By using Arrow Arrays themselves to push statistics we at least don't have to introduce another serialization format (like protobuf, JSON [how this proposal started], more flatbuffers...).
Co-authored-by: Felipe Oliveira Carvalho <[email protected]>
Co-authored-by: Felipe Oliveira Carvalho <[email protected]>
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
@github-actions crossbow submit preview-docs |
Thanks for your suggestions! |
Revision: 3b9c918 Submitted crossbow builds: ursacomputing/crossbow @ actions-ce837e4f33
|
Rationale for this change
Statistics are useful for fast query processing. Many query engines
use statistics to optimize their query plan.
Apache Arrow format doesn't have statistics but other formats that can
be read as Apache Arrow data may have statistics. For example, Apache
Parquet C++ can read Apache Parquet file as Apache Arrow data and
Apache Parquet file may have statistics.
One of the Apache Arrow C streaming interface use cases is the following:
Arrow C data interface
If module A can pass the statistics associated with the Apache Parquet
file to module B, module B can use the statistics to optimize its
query plan.
What changes are included in this PR?
We standardize how to represent statistics as an Apache Arrow array
for easy to exchange.
We don't standardize how to pass the statistics array. You can use any
interface for it. For example, you can us ethe Apache Arrow C data interface.
Are these changes tested?
Yes.
Are there any user-facing changes?
Yes.