You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Occasionally, when calling cooperative_insert from my own kernel, the function never returns.
I am running the code on an RTX 4090 with driver version 525.78.01, and CUDA 11.8.
I was able to reproduce this issue multiple times using the following code:
I ran the snippet twice and observed the issue in iterations 61 and 1699, respectively. In both cases, I had to terminate the process forcefully using CTRL+C. My modified_insert_kernel is almost identical to the default insertion kernel, it looks like this:
template <typename key_type, typename size_type, typename btree>
__global__ void modified_insert_kernel(
const key_type* keys,
const size_type keys_count,
btree tree
) {
auto thread_id = threadIdx.x + blockIdx.x * blockDim.x;
auto block = cg::this_thread_block();
auto tile = cg::tiled_partition<btree::branching_factor>(block);
if ((thread_id - tile.thread_rank()) >= keys_count) { return; }
auto key = btree::invalid_key;
auto value = btree::invalid_value;
bool to_insert = false;
if (thread_id < keys_count) {
key = keys[thread_id];
value = thread_id;
to_insert = true;
}
using allocator_type = typename btree::device_allocator_context_type;
allocator_type allocator{tree.allocator_, tile};
size_type num_inserted = 1;
auto work_queue = tile.ballot(to_insert);
while (work_queue) {
auto cur_rank = __ffs(work_queue) - 1;
auto cur_key = tile.shfl(key, cur_rank);
auto cur_value = tile.shfl(value, cur_rank);
tree.cooperative_insert(cur_key, cur_value, tile, allocator);
if (tile.thread_rank() == cur_rank) { to_insert = false; }
num_inserted++;
work_queue = tile.ballot(to_insert);
}
}
The text was updated successfully, but these errors were encountered:
Thanks, Justus. I was hoping to reproduce this on an RTX 2080 but looks like I can't:
round 9998 starting
tree uses 0.445878 GB
round 9998 done
round 9999 starting
tree uses 0.445914 GB
round 9999 done
Driver Version: 520.61.05 CUDA Version: 11.8
I reduced the memory allocator size from 8 GiBs to 4 GiBs since I am limited on memory, but it is unlikely that this change the behavior. Will try on a different modern GPU.
Occasionally, when calling
cooperative_insert
from my own kernel, the function never returns.I am running the code on an RTX 4090 with driver version 525.78.01, and CUDA 11.8.
I was able to reproduce this issue multiple times using the following code:
I ran the snippet twice and observed the issue in iterations 61 and 1699, respectively. In both cases, I had to terminate the process forcefully using CTRL+C. My
modified_insert_kernel
is almost identical to the default insertion kernel, it looks like this:The text was updated successfully, but these errors were encountered: