Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41909: [C++] Add arrow::ArrayStatistics #43273

Merged
merged 10 commits into from
Aug 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions cpp/src/arrow/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -412,6 +412,7 @@ arrow_add_object_library(ARROW_ARRAY
array/concatenate.cc
array/data.cc
array/diff.cc
array/statistics.cc
array/util.cc
array/validate.cc)

Expand Down Expand Up @@ -1168,6 +1169,7 @@ add_arrow_test(array_test
array/array_struct_test.cc
array/array_union_test.cc
array/array_view_test.cc
array/statistics_test.cc
PRECOMPILED_HEADERS
"$<$<COMPILE_LANGUAGE:CXX>:arrow/testing/pch.h>")

Expand Down
21 changes: 21 additions & 0 deletions cpp/src/arrow/array/statistics.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

// This empty .cc file is for embedding not inlined symbols in
// arrow::ArrayStatistics into libarrow.

#include "arrow/array/statistics.h"
76 changes: 76 additions & 0 deletions cpp/src/arrow/array/statistics.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

#pragma once

#include <cstdint>
#include <optional>
#include <string>
#include <string_view>
#include <variant>

#include "arrow/util/float16.h"
#include "arrow/util/visibility.h"

namespace arrow {

/// \brief Statistics for an Array
///
/// Apache Arrow format doesn't have statistics but data source such
/// as Apache Parquet may have statistics. Statistics associated with
/// data source can be read unified API via this class.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's meant to be usable with Parquet, then we also need to represent string/binary values (Parquet can store max/min stats of BYTE_ARRAY columns).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right.
It's a discussion point. See also: #42133 (comment)

We didn't have a good idea for string/binary values. But I have an idea now. How about adding both of std::string and std::string_view for them? If we can store min/max values, we can use std::string_view to avoid copy of them. Otherwise, we can use std::string for them.

struct ARROW_EXPORT ArrayStatistics {
using ValueType =
std::variant<bool, int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t,
uint64_t, util::Float16, float, double, std::string, std::string_view>;
Comment on lines +38 to +39
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too much IMO. uint64_t and int64_t is enough to cover all integers. And double is enough to cover all floating pointing statistics. string_view confuses things a lot because this is supposed to have value semantics and not pointer semantics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with the above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Let's simplify this: #43578


ArrayStatistics() = default;
~ArrayStatistics() = default;
Comment on lines +41 to +42
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these required at all? Perhaps for ARROW_EXPORT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, they may not be needed. I didn't care about them.
Let's try it: #43579

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need them based on the CI result of #43579 .
Let's remove them: #43592


/// \brief The number of null values, may not be set
std::optional<int64_t> null_count = std::nullopt;

/// \brief The number of distinct values, may not be set
std::optional<int64_t> distinct_count = std::nullopt;

/// \brief The minimum value, may not be set
std::optional<ValueType> min = std::nullopt;
Comment on lines +50 to +51
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A naive question: how does this ordering being defined?

Parquet has this definition https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L953-L1001 . The current definition would be great but problem would raise when float is in?

Copy link
Member Author

@kou kou Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I missed it...

It seems that enum class Order {kUndefined, kTypeDefinedOrder}; is suitable for Parquet but it's not enough generic...

Anyway, I'll add Float16, float and double to ValueType. I missed them too.

Copy link
Member

@mapleFU mapleFU Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think there're not some differences on int/unsigned int. Float might able to define a IEEE754 total ordered float like NAN - -inf - numbers - inf like: https://doc.rust-lang.org/std/primitive.f32.html#method.total_cmp

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for additional information.
Can we defer the ordering as a separated issue and discuss it on the issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me since it's still unstable 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened an issue for it: #43533


/// \brief Whether the minimum value is exact or not, may not be set
std::optional<bool> is_min_exact = std::nullopt;
Comment on lines +53 to +54
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not set, it means "might not exact"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It means "unknown".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if it's unknown, then it's not exact. The third state is not useful IMHO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. It may be exact.

is_max_value_exact in the Parquet's definition is optional:

/** If true, max_value is the actual maximum value for a column */
7: optional bool is_max_value_exact;
/** If true, min_value is the actual minimum value for a column */
8: optional bool is_min_value_exact;

We may need std::optional to support Parquet statistics. Can we keep this std::optional for now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason it's optional in Parquet is simply for compatibility with older files. @wgtmac Is it right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if it's unknown, then it's not exact. The third state is not useful IMHO.

I agree with @pitrou. We can simply mark it as not exact if this field is missing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.
Let's remove std::optional: #43594


/// \brief The maximum value, may not be set
std::optional<ValueType> max = std::nullopt;

/// \brief Whether the maximum value is exact or not, may not be set
std::optional<bool> is_max_exact = std::nullopt;

/// \brief Check two statistics for equality
bool Equals(const ArrayStatistics& other) const {
return null_count == other.null_count && distinct_count == other.distinct_count &&
min == other.min && is_min_exact == other.is_min_exact && max == other.max &&
is_max_exact == other.is_max_exact;
}

/// \brief Check two statistics for equality
bool operator==(const ArrayStatistics& other) const { return Equals(other); }

/// \brief Check two statistics for not equality
bool operator!=(const ArrayStatistics& other) const { return !Equals(other); }
};

} // namespace arrow
103 changes: 103 additions & 0 deletions cpp/src/arrow/array/statistics_test.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

#include <gtest/gtest.h>

#include "arrow/array/statistics.h"

namespace arrow {

TEST(ArrayStatisticsTest, TestNullCount) {
ArrayStatistics statistics;
ASSERT_FALSE(statistics.null_count.has_value());
statistics.null_count = 29;
ASSERT_TRUE(statistics.null_count.has_value());
ASSERT_EQ(29, statistics.null_count.value());
}

TEST(ArrayStatisticsTest, TestDistinctCount) {
ArrayStatistics statistics;
ASSERT_FALSE(statistics.distinct_count.has_value());
statistics.distinct_count = 29;
ASSERT_TRUE(statistics.distinct_count.has_value());
ASSERT_EQ(29, statistics.distinct_count.value());
}

TEST(ArrayStatisticsTest, TestMin) {
ArrayStatistics statistics;
ASSERT_FALSE(statistics.min.has_value());
ASSERT_FALSE(statistics.is_min_exact.has_value());
statistics.min = static_cast<int32_t>(29);
statistics.is_min_exact = true;
ASSERT_TRUE(statistics.min.has_value());
ASSERT_TRUE(std::holds_alternative<int32_t>(statistics.min.value()));
ASSERT_EQ(29, std::get<int32_t>(statistics.min.value()));
ASSERT_TRUE(statistics.is_min_exact.has_value());
ASSERT_TRUE(statistics.is_min_exact.value());
}

TEST(ArrayStatisticsTest, TestMax) {
ArrayStatistics statistics;
ASSERT_FALSE(statistics.max.has_value());
ASSERT_FALSE(statistics.is_max_exact.has_value());
statistics.max = std::string("hello");
statistics.is_max_exact = false;
ASSERT_TRUE(statistics.max.has_value());
ASSERT_TRUE(std::holds_alternative<std::string>(statistics.max.value()));
ASSERT_EQ("hello", std::get<std::string>(statistics.max.value()));
ASSERT_TRUE(statistics.is_max_exact.has_value());
ASSERT_FALSE(statistics.is_max_exact.value());
}

TEST(ArrayStatisticsTest, TestEquality) {
ArrayStatistics statistics1;
ArrayStatistics statistics2;

ASSERT_EQ(statistics1, statistics2);

statistics1.null_count = 29;
ASSERT_NE(statistics1, statistics2);
statistics2.null_count = 29;
ASSERT_EQ(statistics1, statistics2);

statistics1.distinct_count = 2929;
ASSERT_NE(statistics1, statistics2);
statistics2.distinct_count = 2929;
ASSERT_EQ(statistics1, statistics2);

statistics1.min = std::string_view("world");
ASSERT_NE(statistics1, statistics2);
statistics2.min = std::string_view("world");
ASSERT_EQ(statistics1, statistics2);

statistics1.is_min_exact = false;
ASSERT_NE(statistics1, statistics2);
statistics2.is_min_exact = false;
ASSERT_EQ(statistics1, statistics2);

statistics1.max = arrow::util::Float16(-29);
ASSERT_NE(statistics1, statistics2);
statistics2.max = arrow::util::Float16(-29);
ASSERT_EQ(statistics1, statistics2);

statistics1.is_max_exact = true;
ASSERT_NE(statistics1, statistics2);
statistics2.is_max_exact = true;
ASSERT_EQ(statistics1, statistics2);
}

} // namespace arrow
Loading