-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41909: [C++] Add arrow::ArrayStatistics #43273
Changes from all commits
bed8e3d
f7857b1
1ee4271
97a013c
0fbe9ef
b6b4754
2af686c
4639d58
e935d66
2c60782
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
// Licensed to the Apache Software Foundation (ASF) under one | ||
// or more contributor license agreements. See the NOTICE file | ||
// distributed with this work for additional information | ||
// regarding copyright ownership. The ASF licenses this file | ||
// to you under the Apache License, Version 2.0 (the | ||
// "License"); you may not use this file except in compliance | ||
// with the License. You may obtain a copy of the License at | ||
// | ||
// http://www.apache.org/licenses/LICENSE-2.0 | ||
// | ||
// Unless required by applicable law or agreed to in writing, | ||
// software distributed under the License is distributed on an | ||
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
// KIND, either express or implied. See the License for the | ||
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
// This empty .cc file is for embedding not inlined symbols in | ||
// arrow::ArrayStatistics into libarrow. | ||
|
||
#include "arrow/array/statistics.h" |
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,76 @@ | ||||||||||
// Licensed to the Apache Software Foundation (ASF) under one | ||||||||||
// or more contributor license agreements. See the NOTICE file | ||||||||||
// distributed with this work for additional information | ||||||||||
// regarding copyright ownership. The ASF licenses this file | ||||||||||
// to you under the Apache License, Version 2.0 (the | ||||||||||
// "License"); you may not use this file except in compliance | ||||||||||
// with the License. You may obtain a copy of the License at | ||||||||||
// | ||||||||||
// http://www.apache.org/licenses/LICENSE-2.0 | ||||||||||
// | ||||||||||
// Unless required by applicable law or agreed to in writing, | ||||||||||
// software distributed under the License is distributed on an | ||||||||||
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||||||||||
// KIND, either express or implied. See the License for the | ||||||||||
// specific language governing permissions and limitations | ||||||||||
// under the License. | ||||||||||
|
||||||||||
#pragma once | ||||||||||
|
||||||||||
#include <cstdint> | ||||||||||
#include <optional> | ||||||||||
#include <string> | ||||||||||
#include <string_view> | ||||||||||
#include <variant> | ||||||||||
|
||||||||||
#include "arrow/util/float16.h" | ||||||||||
#include "arrow/util/visibility.h" | ||||||||||
|
||||||||||
namespace arrow { | ||||||||||
|
||||||||||
/// \brief Statistics for an Array | ||||||||||
/// | ||||||||||
/// Apache Arrow format doesn't have statistics but data source such | ||||||||||
/// as Apache Parquet may have statistics. Statistics associated with | ||||||||||
/// data source can be read unified API via this class. | ||||||||||
struct ARROW_EXPORT ArrayStatistics { | ||||||||||
using ValueType = | ||||||||||
std::variant<bool, int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, | ||||||||||
uint64_t, util::Float16, float, double, std::string, std::string_view>; | ||||||||||
Comment on lines
+38
to
+39
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is too much IMO. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed with the above. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK. Let's simplify this: #43578 |
||||||||||
|
||||||||||
ArrayStatistics() = default; | ||||||||||
~ArrayStatistics() = default; | ||||||||||
Comment on lines
+41
to
+42
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are these required at all? Perhaps for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, they may not be needed. I didn't care about them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||||||||||
|
||||||||||
/// \brief The number of null values, may not be set | ||||||||||
std::optional<int64_t> null_count = std::nullopt; | ||||||||||
|
||||||||||
/// \brief The number of distinct values, may not be set | ||||||||||
std::optional<int64_t> distinct_count = std::nullopt; | ||||||||||
|
||||||||||
/// \brief The minimum value, may not be set | ||||||||||
std::optional<ValueType> min = std::nullopt; | ||||||||||
Comment on lines
+50
to
+51
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A naive question: how does this ordering being defined? Parquet has this definition https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L953-L1001 . The current definition would be great but problem would raise when float is in? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, I missed it... It seems that Anyway, I'll add There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah I think there're not some differences on int/unsigned int. Float might able to define a IEEE754 total ordered float like NAN - -inf - numbers - inf like: https://doc.rust-lang.org/std/primitive.f32.html#method.total_cmp There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for additional information. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks good to me since it's still unstable 🤔 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've opened an issue for it: #43533 |
||||||||||
|
||||||||||
/// \brief Whether the minimum value is exact or not, may not be set | ||||||||||
std::optional<bool> is_min_exact = std::nullopt; | ||||||||||
Comment on lines
+53
to
+54
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If not set, it means "might not exact"? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. It means "unknown". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, if it's unknown, then it's not exact. The third state is not useful IMHO. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm. It may be exact.
arrow/cpp/src/parquet/parquet.thrift Lines 279 to 282 in d078d5c
We may need There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the reason it's optional in Parquet is simply for compatibility with older files. @wgtmac Is it right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I agree with @pitrou. We can simply mark it as not exact if this field is missing. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK. |
||||||||||
|
||||||||||
/// \brief The maximum value, may not be set | ||||||||||
std::optional<ValueType> max = std::nullopt; | ||||||||||
|
||||||||||
/// \brief Whether the maximum value is exact or not, may not be set | ||||||||||
std::optional<bool> is_max_exact = std::nullopt; | ||||||||||
|
||||||||||
/// \brief Check two statistics for equality | ||||||||||
bool Equals(const ArrayStatistics& other) const { | ||||||||||
return null_count == other.null_count && distinct_count == other.distinct_count && | ||||||||||
min == other.min && is_min_exact == other.is_min_exact && max == other.max && | ||||||||||
is_max_exact == other.is_max_exact; | ||||||||||
} | ||||||||||
|
||||||||||
/// \brief Check two statistics for equality | ||||||||||
bool operator==(const ArrayStatistics& other) const { return Equals(other); } | ||||||||||
|
||||||||||
/// \brief Check two statistics for not equality | ||||||||||
bool operator!=(const ArrayStatistics& other) const { return !Equals(other); } | ||||||||||
}; | ||||||||||
|
||||||||||
} // namespace arrow |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,103 @@ | ||
// Licensed to the Apache Software Foundation (ASF) under one | ||
// or more contributor license agreements. See the NOTICE file | ||
// distributed with this work for additional information | ||
// regarding copyright ownership. The ASF licenses this file | ||
// to you under the Apache License, Version 2.0 (the | ||
// "License"); you may not use this file except in compliance | ||
// with the License. You may obtain a copy of the License at | ||
// | ||
// http://www.apache.org/licenses/LICENSE-2.0 | ||
// | ||
// Unless required by applicable law or agreed to in writing, | ||
// software distributed under the License is distributed on an | ||
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
// KIND, either express or implied. See the License for the | ||
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
#include <gtest/gtest.h> | ||
|
||
#include "arrow/array/statistics.h" | ||
|
||
namespace arrow { | ||
|
||
TEST(ArrayStatisticsTest, TestNullCount) { | ||
ArrayStatistics statistics; | ||
ASSERT_FALSE(statistics.null_count.has_value()); | ||
statistics.null_count = 29; | ||
ASSERT_TRUE(statistics.null_count.has_value()); | ||
ASSERT_EQ(29, statistics.null_count.value()); | ||
} | ||
|
||
TEST(ArrayStatisticsTest, TestDistinctCount) { | ||
ArrayStatistics statistics; | ||
ASSERT_FALSE(statistics.distinct_count.has_value()); | ||
statistics.distinct_count = 29; | ||
ASSERT_TRUE(statistics.distinct_count.has_value()); | ||
ASSERT_EQ(29, statistics.distinct_count.value()); | ||
} | ||
|
||
TEST(ArrayStatisticsTest, TestMin) { | ||
ArrayStatistics statistics; | ||
ASSERT_FALSE(statistics.min.has_value()); | ||
ASSERT_FALSE(statistics.is_min_exact.has_value()); | ||
statistics.min = static_cast<int32_t>(29); | ||
statistics.is_min_exact = true; | ||
ASSERT_TRUE(statistics.min.has_value()); | ||
ASSERT_TRUE(std::holds_alternative<int32_t>(statistics.min.value())); | ||
ASSERT_EQ(29, std::get<int32_t>(statistics.min.value())); | ||
ASSERT_TRUE(statistics.is_min_exact.has_value()); | ||
ASSERT_TRUE(statistics.is_min_exact.value()); | ||
} | ||
|
||
TEST(ArrayStatisticsTest, TestMax) { | ||
ArrayStatistics statistics; | ||
ASSERT_FALSE(statistics.max.has_value()); | ||
ASSERT_FALSE(statistics.is_max_exact.has_value()); | ||
statistics.max = std::string("hello"); | ||
statistics.is_max_exact = false; | ||
ASSERT_TRUE(statistics.max.has_value()); | ||
ASSERT_TRUE(std::holds_alternative<std::string>(statistics.max.value())); | ||
ASSERT_EQ("hello", std::get<std::string>(statistics.max.value())); | ||
ASSERT_TRUE(statistics.is_max_exact.has_value()); | ||
ASSERT_FALSE(statistics.is_max_exact.value()); | ||
} | ||
|
||
TEST(ArrayStatisticsTest, TestEquality) { | ||
ArrayStatistics statistics1; | ||
ArrayStatistics statistics2; | ||
|
||
ASSERT_EQ(statistics1, statistics2); | ||
|
||
statistics1.null_count = 29; | ||
ASSERT_NE(statistics1, statistics2); | ||
statistics2.null_count = 29; | ||
ASSERT_EQ(statistics1, statistics2); | ||
|
||
statistics1.distinct_count = 2929; | ||
ASSERT_NE(statistics1, statistics2); | ||
statistics2.distinct_count = 2929; | ||
ASSERT_EQ(statistics1, statistics2); | ||
|
||
statistics1.min = std::string_view("world"); | ||
ASSERT_NE(statistics1, statistics2); | ||
statistics2.min = std::string_view("world"); | ||
ASSERT_EQ(statistics1, statistics2); | ||
|
||
statistics1.is_min_exact = false; | ||
ASSERT_NE(statistics1, statistics2); | ||
statistics2.is_min_exact = false; | ||
ASSERT_EQ(statistics1, statistics2); | ||
|
||
statistics1.max = arrow::util::Float16(-29); | ||
ASSERT_NE(statistics1, statistics2); | ||
statistics2.max = arrow::util::Float16(-29); | ||
ASSERT_EQ(statistics1, statistics2); | ||
|
||
statistics1.is_max_exact = true; | ||
ASSERT_NE(statistics1, statistics2); | ||
statistics2.is_max_exact = true; | ||
ASSERT_EQ(statistics1, statistics2); | ||
} | ||
|
||
} // namespace arrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's meant to be usable with Parquet, then we also need to represent string/binary values (Parquet can store max/min stats of BYTE_ARRAY columns).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right.
It's a discussion point. See also: #42133 (comment)
We didn't have a good idea for string/binary values. But I have an idea now. How about adding both of
std::string
andstd::string_view
for them? If we can store min/max values, we can usestd::string_view
to avoid copy of them. Otherwise, we can usestd::string
for them.