-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use multi gpu test failure #280
Comments
It appears like you might be running out of host shared memory. Please see if anything under https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#shared-memory may apply to your case. Also, could you try with a more recent NCCL release? 2.13.4 is 2.5 years old at this point... |
I currently have a WSL2 Ubuntu environment created in Windows Server 2022, and have imported two Ubuntu containers into the environment
dcf3743f216b: Test CUDA failure common.cu:295 'an illegal memory access was encountered' .. dcf3743f216b pid 1025: Test failure common.cu:405 .. dcf3743f216b pid 1025: Test failure common.cu:592 .. dcf3743f216b pid 1025: Test failure in all_reduce.cu at line 90 .. dcf3743f216b pid 1025: Test failure common.cu:623 .. dcf3743f216b pid 1025: Test failure common.cu:1078 .. dcf3743f216b pid 1025: Test failure common.cu:891 I'm not sure if this is related to my WSL2 environment This is my graphics card driver version:
|
I'm looking at the memory usage in your You should be able to confirm by running using just 2 GPUs (which would normally succeed, correct?) but with |
I closed the program and retested it, reducing the information volume to 1MB. The same problem occurred, but everything was normal when I adjusted the number of GPUs to 2. nThread 1 nGpus 4 minBytes 8 maxBytes 1048576 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0Using devicesRank 0 Group 0 Pid 309 on 4ca1fd04d760 device 0 [0x16] NVIDIA GeForce RTX 4090Rank 1 Group 0 Pid 309 on 4ca1fd04d760 device 1 [0x3c] NVIDIA GeForce RTX 4090Rank 2 Group 0 Pid 309 on 4ca1fd04d760 device 2 [0x49] NVIDIA GeForce RTX 4090Rank 3 Group 0 Pid 309 on 4ca1fd04d760 device 3 [0x96] NVIDIA GeForce RTX 40904ca1fd04d760: Test NCCL failure common.cu:1005 'unhandled cuda error / ' Current graphics card information: |
When I use the command ./build/all_reduce_perf -b 8 -e 1M -f 2 -g 4, the following error occurs:
c2211f46f318: Test NCCL failure common.cu:1005 'unhandled CUDA error'
.. c2211f46f318 pid 3288: Test failure common.cu:891
I have tried that there is a situation of freezing or error when the number of GPUs is greater than or equal to 4, and normal communication can be achieved when the number of GPUs is less than 3
The complete error log is as follows:
The text was updated successfully, but these errors were encountered: