Occasional insertion deadlocks #4

slyphix · 2023-02-06T20:00:13Z

Occasionally, when calling cooperative_insert from my own kernel, the function never returns.
I am running the code on an RTX 4090 with driver version 525.78.01, and CUDA 11.8.
I was able to reproduce this issue multiple times using the following code:

void investigate_tree_deadlock() {
    using key_type = uint32_t;
    using value_type = uint32_t;

    size_t build_size = size_t{1} << 25;
    key_type min_usable_key = 1;
    key_type max_usable_key = std::numeric_limits<key_type>::max() - 2;

    std::mt19937_64 gen(42);
    std::uniform_int_distribution<key_type> key_dist(min_usable_key, max_usable_key);
    std::vector<key_type> build_keys(build_size);
    std::unordered_set<key_type> build_keys_set;
    while (build_keys_set.size() < build_size) {
        key_type key = key_dist(gen);
        build_keys_set.insert(key);
    }
    std::copy(build_keys_set.begin(), build_keys_set.end(), build_keys.begin());
    std::sort(build_keys.begin(), build_keys.end());

    key_type* keys_on_gpu;
    cudaMalloc(&keys_on_gpu, build_size * sizeof(key_type));
    cudaMemcpy(keys_on_gpu, build_keys.data(), build_size * sizeof(key_type), cudaMemcpyHostToDevice);

    for (size_t i = 0; i < 10000; ++i) {
        std::cout << "round " << i << " starting" << std::endl;

        gpu_blink_tree<key_type, value_type, 16> tree;
        modified_insert_kernel<<<(build_size + 511) / 512, 512>>>(keys_on_gpu, build_size, tree);

        std::cout << "tree uses " << tree.compute_memory_usage() << " GB" << std::endl;
        std::cout << "round " << i << " done" << std::endl;
    }

    cudaFree(keys_on_gpu);
}

I ran the snippet twice and observed the issue in iterations 61 and 1699, respectively. In both cases, I had to terminate the process forcefully using CTRL+C. My modified_insert_kernel is almost identical to the default insertion kernel, it looks like this:

template <typename key_type, typename size_type, typename btree>
__global__ void modified_insert_kernel(
    const key_type* keys,
    const size_type keys_count,
    btree tree
) {
  auto thread_id = threadIdx.x + blockIdx.x * blockDim.x;
  auto block     = cg::this_thread_block();
  auto tile      = cg::tiled_partition<btree::branching_factor>(block);

  if ((thread_id - tile.thread_rank()) >= keys_count) { return; }

  auto key       = btree::invalid_key;
  auto value     = btree::invalid_value;
  bool to_insert = false;
  if (thread_id < keys_count) {
    key       = keys[thread_id];
    value     = thread_id;
    to_insert = true;
  }
  using allocator_type = typename btree::device_allocator_context_type;
  allocator_type allocator{tree.allocator_, tile};

  size_type num_inserted = 1;
  auto work_queue        = tile.ballot(to_insert);
  while (work_queue) {
    auto cur_rank  = __ffs(work_queue) - 1;
    auto cur_key   = tile.shfl(key, cur_rank);
    auto cur_value = tile.shfl(value, cur_rank);

    tree.cooperative_insert(cur_key, cur_value, tile, allocator);

    if (tile.thread_rank() == cur_rank) { to_insert = false; }
    num_inserted++;
    work_queue = tile.ballot(to_insert);
  }
}

The text was updated successfully, but these errors were encountered:

maawad · 2023-02-07T15:54:26Z

Thanks, Justus. I was hoping to reproduce this on an RTX 2080 but looks like I can't:

round 9998 starting
tree uses 0.445878 GB
round 9998 done
round 9999 starting
tree uses 0.445914 GB
round 9999 done

Driver Version: 520.61.05    CUDA Version: 11.8

I reduced the memory allocator size from 8 GiBs to 4 GiBs since I am limited on memory, but it is unlikely that this change the behavior. Will try on a different modern GPU.

Update: I can reproduce on an A100.

maawad added a commit that referenced this issue Feb 7, 2023

Trying to reproduce #4

8a8a042

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Occasional insertion deadlocks #4

Occasional insertion deadlocks #4

slyphix commented Feb 6, 2023

maawad commented Feb 7, 2023 •

edited

Loading

Occasional insertion deadlocks #4

Occasional insertion deadlocks #4

Comments

slyphix commented Feb 6, 2023

maawad commented Feb 7, 2023 • edited Loading

maawad commented Feb 7, 2023 •

edited

Loading