-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-17211: [C++] Add hash_64 scalar compute function #39836
base: main
Are you sure you want to change the base?
Conversation
sure! next week I'll have more time and I'll try to squeeze in a bit of time this week. |
@judahrand sorry, I ended up taking a break last week, but I happened to remember this today. I will work on this issue this week and next week and my goal will be to have it finished (or close to finished) by end of next week. I'll try to stage changes so that I can validate on a single nested structure like what is described in #32504 (list<>) and once I have some tests for that working I will try to accommodate arbitrary nesting (list<list<...>>). |
255d0be
to
23558bf
Compare
rebased on main branch |
@judahrand, I wanted to verify something with you since I assume you have an immediate use case in mind for #32504 . The naive implementation of hashing for a "shallow" nested type (e.g. What held up this PR was logic for "deep" nested types (e.g. # A "shallow" nested type that we should be able to reproduce (I think)
Reference Test data [length: 3]:
[
[ 1, 1, 2, 3, 4 ],
[ 2, 2, 4, 6, 8 ],
[ 3, 3, 6, 9, 12 ]
]
# A "deep" nested type that we may receive as input
Sample Test data (list<item: list<item: int64>>| [length: 3]):
[
[ [ 1 ], [ 1, 2, 3, 4 ] ],
[ [ 2 ], [ 2, 4, 6, 8 ] ],
[ [ 3 ], [ 3, 6, 9, 12 ] ]
] There is ambiguity in how a "deep" nested type should be hashed (maybe the nested structure should produce a different hash for a "row" of values), but what makes the most sense to me is for it to be value-based (the above example). In the case that a different approach is preferred, other approaches can either be encoded as future options or a UDF provided instead of the existing hashing functions. I have prototyped a naive version of the flattening logic in a separate repo: recipe_convert.cpp#L150-L179. For context, this type of flattening would be used in the FastHashScalar::Exec function as a preprocessing step before calling the Hashing32::HashMultiColumn function. Thanks for any feedback you can provide! Particularly on expected behavior for nested array types. |
An alternate approach is to return a Status::NotImplemented("Cannot hash nested data types with many levels of nesting") if any of the columns to hash are "deep" nested types. |
I'm not entirely sure I've understood your distinction between the two cases. Am I correct in understanding that in the first of the two options that you describe a row containing If that is the case then the 'value-based' behaviour would be surprising to me. I'd have thought that two rows with what is definitely different data should have different hashes (not considering the small chance of a hash collision). For the case of nested listed I wonder if this issue could be avoided by including the offsets of the parent list in the hash? |
Oh yeah, I guess I didn't make that part clear. But yes, you understood correctly. I was in favor of If you want those to have different hashes, then it may be possible to transform the list array into a struct array and one struct column can be all of the offsets coalesced and a second struct column can be all of the values (referenced). To do it any other way would require extending the existing hashing functions in non-trivial ways or adding new hashing functions entirely. So, maybe for this PR I can simply do shallow nested arrays and someone can work on the deep nested arrays in aa follow-up PR. |
This commit ports the latest state of scalar_hash kernels without pulling a long development history. This kernel is an element-wise function that uses an xxHash-like algorithm, prioritizing speed and not suitable for cryptographic purposes. The function is implemented by the `FastHashScalar` struct which is templated by the output type (which is assumed to be either UInt32 or UInt64, but there is no validation of that at the moment). The benchmarks in scalar_hash_benchmark.cc uses the hashing_benchmark.cc file as a reference (in cpp/src/arrow/util/), but only covers various input types and the key hashing functions (from key_hash.h). The tests in scalar_hash_test.cc use a simplified version of hashing based on what is implemented in the key_hash.cc. The idea being that the high-level entry points for high-level types should eventually reach an expected application of the low-level hash functions on simple data types; the tests do this exact comparison. At the moment, the tests pass for simple cases, but they do not work for nested types with non-trivial row layouts (e.g. ListTypes). Issue: ARROW-8991 Issue: apacheGH-17211
This commit pulls the latest changes to key_hash.h and implementations in light_array without the burden of a long development history. The only change in key_hash.h is the addition of a friend function which is used in scalar_hash_test.cc. Changes in light_array.[h,cc] are to accommodate two scenarios: (1) the use of ArraySpan, which was introduced after light_array was written; and (2) the need for a KeyColumnArray to allocate data for the purposes of interpreting (or decoding) the structure of a nested type. The main reason for the 2nd scenario is that a ListArray may have many values represented in a single row which should be hashed together; however, if the ListArray has a nested ListArray or other type, the row may have further structure. In the simplest interpretation, only the highest-level structure (the "outer" ListArray) needs to be preserved, and any further nested structures must be explicitly handled by custom kernels (or any future options, etc. that are upstreamed into Arrow). In trying to efficiently interpret complex nested types, ArraySpan can be useful because it is non-owning, thus the main reason for the 1st aforementioned scenario. Although unfinished, any tests added to light_array_test.cc should accommodate the 2 scenarios above.
This commit includes changes to register a new compute function without the burden of a long development history. The change to cpp/src/arrow/CMakeLists.txt includes scalar_hash.cc in compilation as it is used by the new Hash64 function defined in api_scalar.[h,cc]. The change to cpp/src/arrow/compute/kernels/CMakeLists.txt includes scalar_hash_test.cc in compilation for tests and it also adds a new benchmark binary that is implemented by scalar_hash_benchmark.cc. The registry files are updated to register the kernel implementations in scalar_hash.cc with the function definitions in api_scalar.[h,cc]. Finally, docs/source/cpp/compute.rst adds documentation for the Hash64 function. Issue: apacheGH-17211 Issue: ARROW-8991
This commit includes additions to the general hashing benchmarks that cover the use of hashing functions in key_hash.h without carrying the burden of a long dev history. Some existing benchmark names were changed to distinguish between the use of Int32 and Int64 types, new benchmarks were added that use the functions declared in key_hash.h. The reason the new benchmarks are added is because it is claimed they prioritize speed over cryptography as they're primarily used for join algorithms and other processing tasks, which the hashing benchmark can now provide observability for. Issue: apacheGH-17211 Issue: ARROW-8991
23558bf
to
b2af8c9
Compare
This is primarily to make the next commit a bit more focused.
Previous commits tried to add allocation mechanisms to KeyColumnArray to accommodate arbitrarily nested types. While some allocation is necessary to accommodate arbitrarily nested types, the work is being split into incremental steps and scope is being reduced to the straightforward nested types. Specifically, we want to support: ListArray, StructArray, and MapArray. We **only** support these types if they contain primitive types (e.g. List<int8> not List<List<int8>>, or Map<int64, float32> not Map<string, int32>). The difficulty with nested types beyond the first level is that multiple buffers of offsets must be accommodated and that support requires changes that either: (1) will propagate through key_hash code or (2) requires a (temporary) allocation of an array that flattens offsets into a single offset buffer.
Needed to add a check when constructing a KeyColumnArray for a ListArray that the element type of the ListArray is primitive.
This PR is a fresh continuation of #13487 which was closed purely in favor of this one. The reason for this is that the other PR carried a lot of "development burden" in the form of a long commit history. This fresh version maintains the current state of the code in much fewer commits with a cleaner history. Hopefully this PR will allow others to finish what I started (if I don't finish this first).
The changes still required is to add functional mechanisms for the following:
A TempVectorStack for local allocation within the kernel
Functional testing of a kernel that can handle a nested structure (e.g. a List[List[utf-8]]).
Closes: [C++][Compute] Add scalar_hash function #17211