Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional insertion deadlocks #4

Open
slyphix opened this issue Feb 6, 2023 · 1 comment
Open

Occasional insertion deadlocks #4

slyphix opened this issue Feb 6, 2023 · 1 comment

Comments

@slyphix
Copy link

slyphix commented Feb 6, 2023

Occasionally, when calling cooperative_insert from my own kernel, the function never returns.
I am running the code on an RTX 4090 with driver version 525.78.01, and CUDA 11.8.
I was able to reproduce this issue multiple times using the following code:

void investigate_tree_deadlock() {
    using key_type = uint32_t;
    using value_type = uint32_t;

    size_t build_size = size_t{1} << 25;
    key_type min_usable_key = 1;
    key_type max_usable_key = std::numeric_limits<key_type>::max() - 2;

    std::mt19937_64 gen(42);
    std::uniform_int_distribution<key_type> key_dist(min_usable_key, max_usable_key);
    std::vector<key_type> build_keys(build_size);
    std::unordered_set<key_type> build_keys_set;
    while (build_keys_set.size() < build_size) {
        key_type key = key_dist(gen);
        build_keys_set.insert(key);
    }
    std::copy(build_keys_set.begin(), build_keys_set.end(), build_keys.begin());
    std::sort(build_keys.begin(), build_keys.end());

    key_type* keys_on_gpu;
    cudaMalloc(&keys_on_gpu, build_size * sizeof(key_type));
    cudaMemcpy(keys_on_gpu, build_keys.data(), build_size * sizeof(key_type), cudaMemcpyHostToDevice);

    for (size_t i = 0; i < 10000; ++i) {
        std::cout << "round " << i << " starting" << std::endl;

        gpu_blink_tree<key_type, value_type, 16> tree;
        modified_insert_kernel<<<(build_size + 511) / 512, 512>>>(keys_on_gpu, build_size, tree);

        std::cout << "tree uses " << tree.compute_memory_usage() << " GB" << std::endl;
        std::cout << "round " << i << " done" << std::endl;
    }

    cudaFree(keys_on_gpu);
}

I ran the snippet twice and observed the issue in iterations 61 and 1699, respectively. In both cases, I had to terminate the process forcefully using CTRL+C. My modified_insert_kernel is almost identical to the default insertion kernel, it looks like this:

template <typename key_type, typename size_type, typename btree>
__global__ void modified_insert_kernel(
    const key_type* keys,
    const size_type keys_count,
    btree tree
) {
  auto thread_id = threadIdx.x + blockIdx.x * blockDim.x;
  auto block     = cg::this_thread_block();
  auto tile      = cg::tiled_partition<btree::branching_factor>(block);

  if ((thread_id - tile.thread_rank()) >= keys_count) { return; }

  auto key       = btree::invalid_key;
  auto value     = btree::invalid_value;
  bool to_insert = false;
  if (thread_id < keys_count) {
    key       = keys[thread_id];
    value     = thread_id;
    to_insert = true;
  }
  using allocator_type = typename btree::device_allocator_context_type;
  allocator_type allocator{tree.allocator_, tile};

  size_type num_inserted = 1;
  auto work_queue        = tile.ballot(to_insert);
  while (work_queue) {
    auto cur_rank  = __ffs(work_queue) - 1;
    auto cur_key   = tile.shfl(key, cur_rank);
    auto cur_value = tile.shfl(value, cur_rank);

    tree.cooperative_insert(cur_key, cur_value, tile, allocator);

    if (tile.thread_rank() == cur_rank) { to_insert = false; }
    num_inserted++;
    work_queue = tile.ballot(to_insert);
  }
}
maawad added a commit that referenced this issue Feb 7, 2023
@maawad
Copy link
Member

maawad commented Feb 7, 2023

Thanks, Justus. I was hoping to reproduce this on an RTX 2080 but looks like I can't:

round 9998 starting
tree uses 0.445878 GB
round 9998 done
round 9999 starting
tree uses 0.445914 GB
round 9999 done
Driver Version: 520.61.05    CUDA Version: 11.8 

I reduced the memory allocator size from 8 GiBs to 4 GiBs since I am limited on memory, but it is unlikely that this change the behavior. Will try on a different modern GPU.

Update: I can reproduce on an A100.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants