-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41909: [C++] Add arrow::ArrayStatistics #43273
Conversation
Unity builds seems to be failing to link on the
|
/// | ||
/// Apache Arrow format doesn't have statistics but data source such | ||
/// as Apache Parquet may have statistics. Statistics associate with | ||
/// data source can be read unified API via this class. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's meant to be usable with Parquet, then we also need to represent string/binary values (Parquet can store max/min stats of BYTE_ARRAY columns).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right.
It's a discussion point. See also: #42133 (comment)
We didn't have a good idea for string/binary values. But I have an idea now. How about adding both of std::string
and std::string_view
for them? If we can store min/max values, we can use std::string_view
to avoid copy of them. Otherwise, we can use std::string
for them.
cpp/src/arrow/array/statistics.h
Outdated
/// data source can be read unified API via this class. | ||
struct ARROW_EXPORT ArrayStatistics { | ||
public: | ||
using ElementBufferType = std::variant<bool, int8_t, uint8_t, int16_t, uint16_t, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why "Buffer"? I would call it ValueType
or similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the first PoC implementation, we have TypedArrayStatistics<T>::{min,max}()
. I wanted to use min
/max
for them instead of instance variables. So I added Buffer
/_buffer
suffix for instance variables and their type.
But TypedArrayStatistics<T>
from the first implementation. See also: #42133 (comment)
So we can remove Buffer
/_buffer
suffix for instance variables and their type.
I've renamed ElementBufferType
to ValueType
and {min,max}_buffer
to {min,max}
.
cpp/src/arrow/array/statistics.h
Outdated
/// as Apache Parquet may have statistics. Statistics associate with | ||
/// data source can be read unified API via this class. | ||
struct ARROW_EXPORT ArrayStatistics { | ||
public: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is superfluous as the default member visibility is public
in structs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I've removed it.
It's caused by not inlining some symbols in |
If nobody objects this, I'll merge this in this week. |
I didn't fully catch the things before, how does this able to react with C interface? Does |
The "react with C interface" refers the "[DISCUSS] Statistics through the C data interface" thread https://lists.apache.org/thread/z0jz2bnv61j7c6lbk7lympdrs49f69cx , right? We will not use metadata for it. We will send an array that uses the following schema separately for it: https://lists.apache.org/thread/8j7sc0lwl3f9c77srpvkt83r43ktwwfr
So we don't store |
No more comments? |
/// \brief The minimum value, may not be set | ||
std::optional<ValueType> min = std::nullopt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A naive question: how does this ordering being defined?
Parquet has this definition https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L953-L1001 . The current definition would be great but problem would raise when float is in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I missed it...
It seems that enum class Order {kUndefined, kTypeDefinedOrder};
is suitable for Parquet but it's not enough generic...
Anyway, I'll add Float16
, float
and double
to ValueType
. I missed them too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think there're not some differences on int/unsigned int. Float might able to define a IEEE754 total ordered float like NAN - -inf - numbers - inf like: https://doc.rust-lang.org/std/primitive.f32.html#method.total_cmp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for additional information.
Can we defer the ordering as a separated issue and discuss it on the issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me since it's still unstable 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opened an issue for it: #43533
/// \brief Whether the minimum value is exact or not, may not be set | ||
std::optional<bool> is_min_exact = std::nullopt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not set, it means "might not exact"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It means "unknown".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if it's unknown, then it's not exact. The third state is not useful IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. It may be exact.
is_max_value_exact
in the Parquet's definition is optional:
arrow/cpp/src/parquet/parquet.thrift
Lines 279 to 282 in d078d5c
/** If true, max_value is the actual maximum value for a column */ | |
7: optional bool is_max_value_exact; | |
/** If true, min_value is the actual minimum value for a column */ | |
8: optional bool is_min_value_exact; |
We may need std::optional
to support Parquet statistics. Can we keep this std::optional
for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason it's optional in Parquet is simply for compatibility with older files. @wgtmac Is it right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if it's unknown, then it's not exact. The third state is not useful IMHO.
I agree with @pitrou. We can simply mark it as not exact if this field is missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
Let's remove std::optional
: #43594
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for current types
See apacheGH-42133 how to use this for Apache Parquet statistics.
2c2e697
to
2c60782
Compare
LGTM |
I'll merge this. |
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 39af73f. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
std::variant<bool, int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, | ||
uint64_t, util::Float16, float, double, std::string, std::string_view>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too much IMO. uint64_t
and int64_t
is enough to cover all integers. And double
is enough to cover all floating pointing statistics. string_view
confuses things a lot because this is supposed to have value semantics and not pointer semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with the above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Let's simplify this: #43578
ArrayStatistics() = default; | ||
~ArrayStatistics() = default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these required at all? Perhaps for ARROW_EXPORT
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, they may not be needed. I didn't care about them.
Let's try it: #43579
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rationale for this change
We're discussion API on the mailing list https://lists.apache.org/thread/kcpyq9npnh346pw90ljwbg0wxq6hwxxh and GH-41909.
If we have
arrow::ArrayStatistics
, we can attach statistics read from Apache Parquet toarrow::Array
s.This only includes
arrow::ArrayStatistics
. See GH-42133 how to usearrow::ArrayStatitics
for Apache Parquet's statistics.What changes are included in this PR?
This only adds
arrow::ArrayStatistics
and its tests.Are these changes tested?
Yes.
Are there any user-facing changes?
Yes.