Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41909: [C++] Add arrow::ArrayStatistics #43273

Merged
merged 10 commits into from
Aug 4, 2024

Conversation

kou
Copy link
Member

@kou kou commented Jul 16, 2024

Rationale for this change

We're discussion API on the mailing list https://lists.apache.org/thread/kcpyq9npnh346pw90ljwbg0wxq6hwxxh and GH-41909.

If we have arrow::ArrayStatistics, we can attach statistics read from Apache Parquet to arrow::Arrays.

This only includes arrow::ArrayStatistics. See GH-42133 how to use arrow::ArrayStatitics for Apache Parquet's statistics.

What changes are included in this PR?

This only adds arrow::ArrayStatistics and its tests.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

@raulcd
Copy link
Member

raulcd commented Jul 16, 2024

Unity builds seems to be failing to link on the array-test:

[530/602] Linking CXX executable debug\arrow-array-test.exe
FAILED: debug/arrow-array-test.exe 
C:\Windows\system32\cmd.exe /C "cd . && "C:\Program Files\CMake\bin\cmake.exe" -E vs_link_exe --intdir=src\arrow\CMakeFiles\arrow-array-test.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100226~1.0\x64\rc.exe --mt=C:\PROGRA~2\WI3CF2~1\10\bin\100226~1.0\x64\mt.exe --manifests  -- C:\PROGRA~2\MICROS~2\2019\ENTERP~1\VC\Tools\MSVC\1429~1.301\bin\Hostx64\x64\link.exe /nologo src\arrow\CMakeFiles\arrow-array-test.dir\Unity\unity_1_cxx.cxx.obj src\arrow\CMakeFiles\arrow-array-test.dir\Unity\unity_0_cxx.cxx.obj  /out:debug\arrow-array-test.exe /implib:debug\arrow-array-test.lib /pdb:debug\arrow-array-test.pdb /version:0.0 /machine:x64  /NODEFAULTLIB:LIBCMT /debug /INCREMENTAL /subsystem:console  debug\arrow_testing.lib  debug\arrow_gmockd.lib  debug\arrow_gtest_maind.lib  debug\arrow.lib  ws2_32.lib  debug\arrow_gtestd.lib  kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib && cd ."
LINK Pass 1: command "C:\PROGRA~2\MICROS~2\2019\ENTERP~1\VC\Tools\MSVC\1429~1.301\bin\Hostx64\x64\link.exe /nologo src\arrow\CMakeFiles\arrow-array-test.dir\Unity\unity_1_cxx.cxx.obj src\arrow\CMakeFiles\arrow-array-test.dir\Unity\unity_0_cxx.cxx.obj /out:debug\arrow-array-test.exe /implib:debug\arrow-array-test.lib /pdb:debug\arrow-array-test.pdb /version:0.0 /machine:x64 /NODEFAULTLIB:LIBCMT /debug /INCREMENTAL /subsystem:console debug\arrow_testing.lib debug\arrow_gmockd.lib debug\arrow_gtest_maind.lib debug\arrow.lib ws2_32.lib debug\arrow_gtestd.lib kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib /MANIFEST /MANIFESTFILE:src\arrow\CMakeFiles\arrow-array-test.dir/intermediate.manifest src\arrow\CMakeFiles\arrow-array-test.dir/manifest.res" failed (exit code 1120) with the following output:
unity_1_cxx.cxx.obj : error LNK2019: unresolved external symbol "__declspec(dllimport) public: __cdecl arrow::ArrayStatistics::ArrayStatistics(void)" (__imp_??0ArrayStatistics@arrow@@QEAA@XZ) referenced in function "private: virtual void __cdecl arrow::ArrayStatisticsTest_TestNullCount_Test::TestBody(void)" (?TestBody@ArrayStatisticsTest_TestNullCount_Test@arrow@@EEAAXXZ)
unity_1_cxx.cxx.obj : error LNK2019: unresolved external symbol "__declspec(dllimport) public: bool __cdecl arrow::ArrayStatistics::operator==(struct arrow::ArrayStatistics const &)const " (__imp_??8ArrayStatistics@arrow@@QEBA_NAEBU01@@Z) referenced in function "class testing::AssertionResult __cdecl testing::internal::CmpHelperEQ<struct arrow::ArrayStatistics,struct arrow::ArrayStatistics>(char const *,char const *,struct arrow::ArrayStatistics const &,struct arrow::ArrayStatistics const &)" (??$CmpHelperEQ@UArrayStatistics@arrow@@U12@@internal@testing@@YA?AVAssertionResult@1@PEBD0AEBUArrayStatistics@arrow@@1@Z)
unity_1_cxx.cxx.obj : error LNK2019: unresolved external symbol "__declspec(dllimport) public: bool __cdecl arrow::ArrayStatistics::operator!=(struct arrow::ArrayStatistics const &)const " (__imp_??9ArrayStatistics@arrow@@QEBA_NAEBU01@@Z) referenced in function "class testing::AssertionResult __cdecl testing::internal::CmpHelperNE<struct arrow::ArrayStatistics,struct arrow::ArrayStatistics>(char const *,char const *,struct arrow::ArrayStatistics const &,struct arrow::ArrayStatistics const &)" (??$CmpHelperNE@UArrayStatistics@arrow@@U12@@internal@testing@@YA?AVAssertionResult@1@PEBD0AEBUArrayStatistics@arrow@@1@Z)
debug\arrow-array-test.exe : fatal error LNK1120: 3 unresolved externals

///
/// Apache Arrow format doesn't have statistics but data source such
/// as Apache Parquet may have statistics. Statistics associate with
/// data source can be read unified API via this class.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's meant to be usable with Parquet, then we also need to represent string/binary values (Parquet can store max/min stats of BYTE_ARRAY columns).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right.
It's a discussion point. See also: #42133 (comment)

We didn't have a good idea for string/binary values. But I have an idea now. How about adding both of std::string and std::string_view for them? If we can store min/max values, we can use std::string_view to avoid copy of them. Otherwise, we can use std::string for them.

/// data source can be read unified API via this class.
struct ARROW_EXPORT ArrayStatistics {
public:
using ElementBufferType = std::variant<bool, int8_t, uint8_t, int16_t, uint16_t,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "Buffer"? I would call it ValueType or similar.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the first PoC implementation, we have TypedArrayStatistics<T>::{min,max}(). I wanted to use min/max for them instead of instance variables. So I added Buffer/_buffer suffix for instance variables and their type.

But TypedArrayStatistics<T> from the first implementation. See also: #42133 (comment)

So we can remove Buffer/_buffer suffix for instance variables and their type.
I've renamed ElementBufferType to ValueType and {min,max}_buffer to {min,max}.

/// as Apache Parquet may have statistics. Statistics associate with
/// data source can be read unified API via this class.
struct ARROW_EXPORT ArrayStatistics {
public:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is superfluous as the default member visibility is public in structs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I've removed it.

@pitrou pitrou requested a review from felipecrv July 16, 2024 15:05
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels Jul 17, 2024
@kou
Copy link
Member Author

kou commented Jul 17, 2024

Unity builds seems to be failing to link on the array-test:

[530/602] Linking CXX executable debug\arrow-array-test.exe
FAILED: debug/arrow-array-test.exe 
C:\Windows\system32\cmd.exe /C "cd . && "C:\Program Files\CMake\bin\cmake.exe" -E vs_link_exe --intdir=src\arrow\CMakeFiles\arrow-array-test.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100226~1.0\x64\rc.exe --mt=C:\PROGRA~2\WI3CF2~1\10\bin\100226~1.0\x64\mt.exe --manifests  -- C:\PROGRA~2\MICROS~2\2019\ENTERP~1\VC\Tools\MSVC\1429~1.301\bin\Hostx64\x64\link.exe /nologo src\arrow\CMakeFiles\arrow-array-test.dir\Unity\unity_1_cxx.cxx.obj src\arrow\CMakeFiles\arrow-array-test.dir\Unity\unity_0_cxx.cxx.obj  /out:debug\arrow-array-test.exe /implib:debug\arrow-array-test.lib /pdb:debug\arrow-array-test.pdb /version:0.0 /machine:x64  /NODEFAULTLIB:LIBCMT /debug /INCREMENTAL /subsystem:console  debug\arrow_testing.lib  debug\arrow_gmockd.lib  debug\arrow_gtest_maind.lib  debug\arrow.lib  ws2_32.lib  debug\arrow_gtestd.lib  kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib && cd ."
LINK Pass 1: command "C:\PROGRA~2\MICROS~2\2019\ENTERP~1\VC\Tools\MSVC\1429~1.301\bin\Hostx64\x64\link.exe /nologo src\arrow\CMakeFiles\arrow-array-test.dir\Unity\unity_1_cxx.cxx.obj src\arrow\CMakeFiles\arrow-array-test.dir\Unity\unity_0_cxx.cxx.obj /out:debug\arrow-array-test.exe /implib:debug\arrow-array-test.lib /pdb:debug\arrow-array-test.pdb /version:0.0 /machine:x64 /NODEFAULTLIB:LIBCMT /debug /INCREMENTAL /subsystem:console debug\arrow_testing.lib debug\arrow_gmockd.lib debug\arrow_gtest_maind.lib debug\arrow.lib ws2_32.lib debug\arrow_gtestd.lib kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib /MANIFEST /MANIFESTFILE:src\arrow\CMakeFiles\arrow-array-test.dir/intermediate.manifest src\arrow\CMakeFiles\arrow-array-test.dir/manifest.res" failed (exit code 1120) with the following output:
unity_1_cxx.cxx.obj : error LNK2019: unresolved external symbol "__declspec(dllimport) public: __cdecl arrow::ArrayStatistics::ArrayStatistics(void)" (__imp_??0ArrayStatistics@arrow@@QEAA@XZ) referenced in function "private: virtual void __cdecl arrow::ArrayStatisticsTest_TestNullCount_Test::TestBody(void)" (?TestBody@ArrayStatisticsTest_TestNullCount_Test@arrow@@EEAAXXZ)
unity_1_cxx.cxx.obj : error LNK2019: unresolved external symbol "__declspec(dllimport) public: bool __cdecl arrow::ArrayStatistics::operator==(struct arrow::ArrayStatistics const &)const " (__imp_??8ArrayStatistics@arrow@@QEBA_NAEBU01@@Z) referenced in function "class testing::AssertionResult __cdecl testing::internal::CmpHelperEQ<struct arrow::ArrayStatistics,struct arrow::ArrayStatistics>(char const *,char const *,struct arrow::ArrayStatistics const &,struct arrow::ArrayStatistics const &)" (??$CmpHelperEQ@UArrayStatistics@arrow@@U12@@internal@testing@@YA?AVAssertionResult@1@PEBD0AEBUArrayStatistics@arrow@@1@Z)
unity_1_cxx.cxx.obj : error LNK2019: unresolved external symbol "__declspec(dllimport) public: bool __cdecl arrow::ArrayStatistics::operator!=(struct arrow::ArrayStatistics const &)const " (__imp_??9ArrayStatistics@arrow@@QEBA_NAEBU01@@Z) referenced in function "class testing::AssertionResult __cdecl testing::internal::CmpHelperNE<struct arrow::ArrayStatistics,struct arrow::ArrayStatistics>(char const *,char const *,struct arrow::ArrayStatistics const &,struct arrow::ArrayStatistics const &)" (??$CmpHelperNE@UArrayStatistics@arrow@@U12@@internal@testing@@YA?AVAssertionResult@1@PEBD0AEBUArrayStatistics@arrow@@1@Z)
debug\arrow-array-test.exe : fatal error LNK1120: 3 unresolved externals

It's caused by not inlining some symbols in arrow::ArrayStatistics. I've added an empty statistics.cc to embed non-inlined symbols in arrow::ArrayStatistics to libarrow.

@kou
Copy link
Member Author

kou commented Jul 31, 2024

If nobody objects this, I'll merge this in this week.

@mapleFU
Copy link
Member

mapleFU commented Jul 31, 2024

I didn't fully catch the things before, how does this able to react with C interface? Does Statistics still need to be stored in key-value-metadata?

@kou
Copy link
Member Author

kou commented Jul 31, 2024

I didn't fully catch the things before, how does this able to react with C interface? Does Statistics still need to be stored in key-value-metadata?

The "react with C interface" refers the "[DISCUSS] Statistics through the C data interface" thread https://lists.apache.org/thread/z0jz2bnv61j7c6lbk7lympdrs49f69cx , right?

We will not use metadata for it. We will send an array that uses the following schema separately for it:

https://lists.apache.org/thread/8j7sc0lwl3f9c77srpvkt83r43ktwwfr

map<
   // The column index or null if the statistics refer to whole table or batch.
  column: int32,
  // Statistics key is dictionary<int32, utf8>. Will will provide pre-defined key names. Custom key names
  // are also supported.
  // Different keys are assigned for exact value and approximate value.
  map<dictionary<int32, utf8>, dense_union<...needed types based on stat kinds in the keys...>
>

So we don't store arrow::Statistics to metadata. We'll provide a convenient function that converts arrow::Statisticss to a statistics array that uses the schema.

@kou
Copy link
Member Author

kou commented Aug 2, 2024

No more comments?

Comment on lines +49 to +51
/// \brief The minimum value, may not be set
std::optional<ValueType> min = std::nullopt;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A naive question: how does this ordering being defined?

Parquet has this definition https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L953-L1001 . The current definition would be great but problem would raise when float is in?

Copy link
Member Author

@kou kou Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I missed it...

It seems that enum class Order {kUndefined, kTypeDefinedOrder}; is suitable for Parquet but it's not enough generic...

Anyway, I'll add Float16, float and double to ValueType. I missed them too.

Copy link
Member

@mapleFU mapleFU Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think there're not some differences on int/unsigned int. Float might able to define a IEEE754 total ordered float like NAN - -inf - numbers - inf like: https://doc.rust-lang.org/std/primitive.f32.html#method.total_cmp

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for additional information.
Can we defer the ordering as a separated issue and discuss it on the issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me since it's still unstable 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened an issue for it: #43533

Comment on lines +52 to +54
/// \brief Whether the minimum value is exact or not, may not be set
std::optional<bool> is_min_exact = std::nullopt;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not set, it means "might not exact"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It means "unknown".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if it's unknown, then it's not exact. The third state is not useful IMHO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. It may be exact.

is_max_value_exact in the Parquet's definition is optional:

/** If true, max_value is the actual maximum value for a column */
7: optional bool is_max_value_exact;
/** If true, min_value is the actual minimum value for a column */
8: optional bool is_min_value_exact;

We may need std::optional to support Parquet statistics. Can we keep this std::optional for now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason it's optional in Parquet is simply for compatibility with older files. @wgtmac Is it right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if it's unknown, then it's not exact. The third state is not useful IMHO.

I agree with @pitrou. We can simply mark it as not exact if this field is missing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.
Let's remove std::optional: #43594

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for current types

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Aug 2, 2024
@kou kou force-pushed the cpp-array-statistics-only branch from 2c2e697 to 2c60782 Compare August 2, 2024 07:32
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Aug 2, 2024
@mapleFU
Copy link
Member

mapleFU commented Aug 2, 2024

LGTM

@kou
Copy link
Member Author

kou commented Aug 4, 2024

I'll merge this.

@kou kou merged commit 39af73f into apache:main Aug 4, 2024
41 checks passed
@kou kou removed the awaiting changes Awaiting changes label Aug 4, 2024
@kou kou deleted the cpp-array-statistics-only branch August 4, 2024 07:39
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 39af73f.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Comment on lines +38 to +39
std::variant<bool, int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t,
uint64_t, util::Float16, float, double, std::string, std::string_view>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too much IMO. uint64_t and int64_t is enough to cover all integers. And double is enough to cover all floating pointing statistics. string_view confuses things a lot because this is supposed to have value semantics and not pointer semantics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with the above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Let's simplify this: #43578

@github-actions github-actions bot added the awaiting changes Awaiting changes label Aug 5, 2024
Comment on lines +41 to +42
ArrayStatistics() = default;
~ArrayStatistics() = default;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these required at all? Perhaps for ARROW_EXPORT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, they may not be needed. I didn't care about them.
Let's try it: #43579

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need them based on the CI result of #43579 .
Let's remove them: #43592

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants