-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi-gpu support for Ampere #6
Comments
You can specify additional target architectures, e.g., CC @Funatiq for visibility since he is the PIC for gossip |
In fact, If I change the architectures in
From $ sudo /usr/local/cuda/bin/cuda-gdb -p `nvidia-smi|tail -n2|head -n1|tr -s ' '|cut -d' ' -f5` -ex 'where' NVIDIA (R) CUDA Debugger 11.7 release Portions Copyright (C) 2007-2022 NVIDIA Corporation GNU gdb (GDB) 10.2 ... ... Thread 1 "single_value_be" received signal SIGURG, Urgent I/O condition. 0x00007fdef5feebec in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1 #0 0x00007fdef5feebec in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1 #1 0x00007fdef61ee292 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1 #2 0x00007fdef61eeda9 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1 #3 0x00007fdef632dbe2 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1 --Type for more, q to quit, c to continue without paging-- #4 0x00007fdef5f9ea93 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1 #5 0x00007fdef5f9ef81 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1 #6 0x00007fdef5f9fef8 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1 #7 0x00007fdef61610c1 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1 #8 0x00005652212f45d9 in __cudart606 () #9 0x00005652212c333d in __cudart743 () #10 0x00005652213197b5 in cudaMemcpyAsync () #11 0x000056522129d7fb in warpcore::SingleValueHashTable, warpcore::hashers::MurmurHash, 8ul>, warpcore::storage::key_value::AoSStore, 2048ul>::size (stream=0x0, this=0x565223b18a70) at ../include/single_value_hash_table.cuh:584 #12 warpcore::SingleValueHashTable, warpcore::hashers::MurmurHash, 8ul>, warpcore::storage::key_value::AoSStore, 2048ul>::load_factor (stream=, this=0x565223b18a70) at ../include/single_value_hash_table.cuh:603 #13 single_value_benchmark, warpcore::hashers::MurmurHash, 8ul>, warpcore::storage::key_value::AoSStore, 2048ul> > ( multi_split_overhead_factor=1.5, thermal_backoff=..., iters=5 '\005', load_factors=..., print_headers=true, transfer_plan=..., dev_ids=..., input_sizes=...) at src/single_value_benchmark.cu:222 #14 main (argc=1, argv=0x7ffcd4d1e218) at src/single_value_benchmark.cu:325 I will try generator scripts in gossip later. Thanks for reply. |
I found that the infinitely waiting of multi-gpu program is because of gpu memory error. cudaMemsetAsync(tmp, 0, sizeof(index_t), stream); the memory is invalid here. Here program did not check CUDA error. And if I insert a for(uint32_t i = 0; i < num_gpus; ++i) { actual_load.emplace_back(hash_table[i].load_factor()); status.emplace_back(hash_table[i].pop_status()); } becomes for(uint32_t i = 0; i < num_gpus; ++i) { cudaSetDevice(dev_ids[i]); CUERR actual_load.emplace_back(hash_table[i].load_factor()); status.emplace_back(hash_table[i].pop_status()); } It will work for me. It seems that this is the correct relation between memory and gpu? Although why it works on only Volta without |
Hi, I saw that in the
multi_gpu
branch, GPU architecture was statically specified to Volta(sm_70) :Does multi-gpu support Ampere?
The text was updated successfully, but these errors were encountered: