Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-17211: [C++] Add hash32 and hash64 scalar compute functions #45001

Open
wants to merge 38 commits into
base: main
Choose a base branch
from

Conversation

kszucs
Copy link
Member

@kszucs kszucs commented Dec 11, 2024

Rationale for this change

Support for calculating elementwise hashes.

The PR adds to scalar functions hash32() and hash64() using the existing internal hashing machinery.

What changes are included in this PR?

Continuation of #39836 with the following changes:

  • Use column oriented hash-combine rather than flattening nested elements
  • Support arbitrary nesting levels with an optimization that only hash child arrays if they are also nested

Are these changes tested?

Partially, working on a proper testing suite.

Are there any user-facing changes?

There is a new compute kernel hash_64 available.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Dec 11, 2024
@kszucs
Copy link
Member Author

kszucs commented Dec 11, 2024

Seems like we generate the same hash for both NULL and 0 which is not ideal.

In [1]: import pyarrow as pa

In [2]: import pyarrow.compute as pc

In [3]: pc.hash_64([None])
Out[3]:
<pyarrow.lib.UInt64Array object at 0x124247be0>
[
  0
]

In [4]: pc.hash_64([0])
Out[4]:
<pyarrow.lib.UInt64Array object at 0x1033027a0>
[
  0
]

@github-actions github-actions bot added Component: Python awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Dec 11, 2024
@kszucs kszucs marked this pull request as ready for review December 11, 2024 17:41
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Dec 11, 2024
@kszucs kszucs changed the title GH-17211: [C++] Add hash_64 scalar compute function GH-17211: [C++] Add hash_64 scalar compute function Dec 11, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Dec 11, 2024
Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some first glance comments. I'll look into more details later.

cpp/src/arrow/compute/api_scalar.h Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/CMakeLists.txt Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/CMakeLists.txt Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/scalar_hash.cc Outdated Show resolved Hide resolved
docs/source/cpp/compute.rst Outdated Show resolved Hide resolved
cpp/src/arrow/compute/api_scalar.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Dec 13, 2024
@kszucs kszucs changed the title GH-17211: [C++] Add hash_64 scalar compute function GH-17211: [C++] Add hash32 and hash64 scalar compute functions Dec 13, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Dec 13, 2024
@kszucs
Copy link
Member Author

kszucs commented Dec 20, 2024

Just asking - any idea if there is line of sight to this issue being closed? I am interested because, among other things, I believe this Arrow functionality is the underlying reason why pandas cannot currently group by a pyarrow-backed struct series.

The PR should be ready now. Do you have a reference to the relevant pandas issue perhaps?

@a-reich
Copy link
Contributor

a-reich commented Dec 21, 2024

@kszucs I couldn’t find an existing pandas issue so I created one, see pandas-dev/pandas#60594 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants