-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Insertion with duplicate keys #7
Comments
Hi @mfbalin, The current The reduction behavior that you are looking for will be inefficient to implement during insertion using the cuckoo hashing probing scheme, but instead, perhaps we can have a find-all with a reduction operator (e.g., sum). Reduction during find will be something similar to a If you are interested in any of these solutions, let me know, and I'll create another issue for it. You are welcome to implement and propose any addition and do a PR if you are interested. |
Hi @maawad, thanks for your reply. Is there a way to create a reduced hash table after the insertion with the duplicate keys is done? Could potentially be implemented with a pass over the whole hashtable and inserting the unique keys to either a new one or the current hashtable. For my use case, if each query is like an equal_range query, I am afraid it will slow down a lot. |
If we have already inserted the keys with duplicates and then tried to perform the reduction, we will be using something like mapped_type find_all_and_reduce(key_type const& key, tile_type const& tile, BinaryOperation op = std::plus<>()); This would be an efficient solution for specific workloads. If you would like to share some information about the workload (e.g., key and value types, duplicate keys ratio expected, how many times we will perform finds, space requirements, etc.), then I will be able to suggest more suitable ideas. |
My workload involves the insertion of a lot of these key value pairs, where a pair can be <int, float> or <int, int>, after which the hashtable will stay static and only queries to get the reduced value will be made. My current approach uses sorting and compacting but I believe a hashtable approach will be much faster, especially when it comes to queries. Binary search is quite slow for my use case for the queries. |
Or I could insert the pairs after the compaction has been made using sorting, but I think an efficiently implemented find_all_and_reduce should be much faster. I believe that a very efficient SpMV kernel can be written if we have such a functionality, where the vertex id's are not necessarily contiguous. |
Yeah, I think if you are not wasting that much space on duplicates, then storing them should be okay. Any optimization will depend on the workload :) The performance of the find and reduce will be the same as queries that do not exist in the hash table. The cost is only three reads for any single find. The implementation will look like this: template <class Key,
class T,
class Hash,
class KeyEqual,
cuda::thread_scope Scope,
class Allocator,
int B>
template <typename tile_type>
__device__ bght::bcht<Key, T, Hash, KeyEqual, Scope, Allocator, B>::mapped_type
bght::bcht<Key, T, Hash, KeyEqual, Scope, Allocator, B>::find_all_and_reduce(key_type const& key,
tile_type const& tile,
T init,
BinaryOperation op) {
const int num_hfs = 3;
auto bucket_id = hf0_(key) % num_buckets_;
value_type sentinel_pair{sentinel_key_, sentinel_value_};
using bucket_type = detail::bucket<atomic_pair_type, value_type, tile_type>;
mapped_type result = init;
for (int hf = 0; hf < num_hfs; hf++) {
bucket_type cur_bucket(&d_table_[bucket_id * bucket_size], tile);
cur_bucket.load(cuda::memory_order_relaxed);
int key_location = cur_bucket.find_key_location(key, key_equal{});
if (key_location != -1) {
auto found_value = cur_bucket.get_value_from_lane(key_location);
result = op(found_value, result);
} else if (cur_bucket.compute_load(sentinel_pair) < bucket_size) {
return result;
} else {
bucket_id = hf == 0 ? hf1_(key) % num_buckets_ : hf2_(key) % num_buckets_;
}
}
return result;
} |
I am wondering what kind of behavior this hash-table has when duplicate keys exist in the inserted set of pairs. The kind of behavior I am looking for is to sum up the values of the duplicate keys after insertion.
The text was updated successfully, but these errors were encountered: