use multi gpu test failure #280

1556900941lizerui · 2025-01-13T03:23:06Z

When I use the command ./build/all_reduce_perf -b 8 -e 1M -f 2 -g 4, the following error occurs:
c2211f46f318: Test NCCL failure common.cu:1005 'unhandled CUDA error'
.. c2211f46f318 pid 3288: Test failure common.cu:891

I have tried that there is a situation of freezing or error when the number of GPUs is greater than or equal to 4, and normal communication can be achieved when the number of GPUs is less than 3

Using devices
#  Rank 0 Group 0 Pid 3590 on c2211f46f318 device 0 [0x16] NVIDIA GeForce RTX 4090
#  Rank 1 Group 0 Pid 3590 on c2211f46f318 device 1 [0x3c] NVIDIA GeForce RTX 4090
#  Rank 2 Group 0 Pid 3590 on c2211f46f318 device 2 [0x49] NVIDIA GeForce RTX 4090
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float     sum      -1    97.98    0.00    0.00      0    190.2    0.00    0.00      0
          16             4     float     sum      -1    135.4    0.00    0.00      0    115.8    0.00    0.00      0
          32             8     float     sum      -1    106.8    0.00    0.00      0    122.4    0.00    0.00      0
          64            16     float     sum      -1    105.3    0.00    0.00      0    101.4    0.00    0.00      0
         128            32     float     sum      -1    99.74    0.00    0.00      0    101.6    0.00    0.00      0
         256            64     float     sum      -1    98.20    0.00    0.00      0    154.2    0.00    0.00      0
         512           128     float     sum      -1    223.5    0.00    0.00      0    122.8    0.00    0.01      0
        1024           256     float     sum      -1    203.9    0.01    0.01      0    188.6    0.01    0.01      0
        2048           512     float     sum      -1    113.2    0.02    0.02      0    122.8    0.02    0.02      0
        4096          1024     float     sum      -1    111.8    0.04    0.05      0    101.9    0.04    0.05      0
        8192          2048     float     sum      -1    97.23    0.08    0.11      0    225.3    0.04    0.05      0
       16384          4096     float     sum      -1    103.7    0.16    0.21      0    159.6    0.10    0.14      0
       32768          8192     float     sum      -1    217.0    0.15    0.20      0    118.5    0.28    0.37      0
       65536         16384     float     sum      -1    114.4    0.57    0.76      0    229.2    0.29    0.38      0
      131072         32768     float     sum      -1    276.7    0.47    0.63      0    128.5    1.02    1.36      0
      262144         65536     float     sum      -1    139.0    1.89    2.51      0    142.9    1.83    2.45      0
      524288        131072     float     sum      -1    170.2    3.08    4.11      0    158.6    3.31    4.41      0
     1048576        262144     float     sum      -1    247.1    4.24    5.66      0    254.1    4.13    5.50      0
# Values out of range: 0 OK
# Avg bus bandwidth: 0.806538

The complete error log is as follows:

Using devices
#  Rank 0 Group 0 Pid 3719 on c2211f46f318 device 0 [0x16] NVIDIA GeForce RTX 4090
#  Rank 1 Group 0 Pid 3719 on c2211f46f318 device 1 [0x3c] NVIDIA GeForce RTX 4090
#  Rank 2 Group 0 Pid 3719 on c2211f46f318 device 2 [0x49] NVIDIA GeForce RTX 4090
#  Rank 3 Group 0 Pid 3719 on c2211f46f318 device 3 [0x54] NVIDIA GeForce RTX 4090
#  Rank 4 Group 0 Pid 3719 on c2211f46f318 device 4 [0x96] NVIDIA GeForce RTX 4090
#  Rank 5 Group 0 Pid 3719 on c2211f46f318 device 5 [0xbc] NVIDIA GeForce RTX 4090
#  Rank 6 Group 0 Pid 3719 on c2211f46f318 device 6 [0xc9] NVIDIA GeForce RTX 4090
#  Rank 7 Group 0 Pid 3719 on c2211f46f318 device 7 [0xd1] NVIDIA GeForce RTX 4090
c2211f46f318:3719:3719 [0] NCCL INFO Bootstrap: Using eth0:172.17.0.3<0>
c2211f46f318:3719:3719 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so), using internal implementation
c2211f46f318:3719:3719 [7] NCCL INFO cudaDriverVersion 12060
NCCL version 2.13.4+cuda11.7
c2211f46f318:3719:3740 [0] NCCL INFO Failed to open libibverbs.so[.1]
c2211f46f318:3719:3740 [0] NCCL INFO NET/Socket: Using [0]eth0:172.17.0.3<0>
c2211f46f318:3719:3740 [0] NCCL INFO Using network Socket
c2211f46f318:3719:3741 [1] NCCL INFO Using network Socket
c2211f46f318:3719:3743 [3] NCCL INFO Using network Socket
c2211f46f318:3719:3742 [2] NCCL INFO Using network Socket
c2211f46f318:3719:3745 [5] NCCL INFO Using network Socket
c2211f46f318:3719:3744 [4] NCCL INFO Using network Socket
c2211f46f318:3719:3746 [6] NCCL INFO Using network Socket
c2211f46f318:3719:3747 [7] NCCL INFO Using network Socket
c2211f46f318:3719:3740 [0] NCCL INFO Channel 00/02 :    0   1   2   3   4   5   6   7
c2211f46f318:3719:3746 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5
c2211f46f318:3719:3742 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
c2211f46f318:3719:3747 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6
c2211f46f318:3719:3740 [0] NCCL INFO Channel 01/02 :    0   1   2   3   4   5   6   7
c2211f46f318:3719:3745 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4
c2211f46f318:3719:3743 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2
c2211f46f318:3719:3744 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3
c2211f46f318:3719:3740 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
c2211f46f318:3719:3741 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0

c2211f46f318:3719:3742 [2] misc/shmutils.cc:62 NCCL WARN Cuda failure 'out of memory'
c2211f46f318:3719:3742 [2] NCCL INFO transport/shm.cc:106 -> 1
c2211f46f318:3719:3742 [2] NCCL INFO transport.cc:33 -> 1
c2211f46f318:3719:3742 [2] NCCL INFO transport.cc:89 -> 1
c2211f46f318:3719:3742 [2] NCCL INFO init.cc:773 -> 1
c2211f46f318:3719:3742 [2] NCCL INFO init.cc:1045 -> 1
c2211f46f318:3719:3742 [2] NCCL INFO group.cc:43 -> 1 [Async thread]

The text was updated successfully, but these errors were encountered:

kiskra-nvidia · 2025-01-13T17:08:05Z

It appears like you might be running out of host shared memory. Please see if anything under https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#shared-memory may apply to your case.

Also, could you try with a more recent NCCL release? 2.13.4 is 2.5 years old at this point...

1556900941lizerui · 2025-01-14T01:56:40Z

I currently have a WSL2 Ubuntu environment created in Windows Server 2022, and have imported two Ubuntu containers into the environment

The shared memory size of the first container (i.e., the container above) is 6G, with NCCL version 2.13.4 and CUDA version 11.7.
The second container (I use the official image provided by Qianwen) uses 64G of shared memory, with NCCL version 2.17.1 and CUDA version 12.1. The same communication failure will occur when the number of GPUs is greater than 2. The same situation will occur when the number of GPUs is 2, and the following error will occur when the test cannot be completed:

dcf3743f216b: Test CUDA failure common.cu:295 'an illegal memory access was encountered'

.. dcf3743f216b pid 1025: Test failure common.cu:405

.. dcf3743f216b pid 1025: Test failure common.cu:592

.. dcf3743f216b pid 1025: Test failure in all_reduce.cu at line 90

.. dcf3743f216b pid 1025: Test failure common.cu:623

.. dcf3743f216b pid 1025: Test failure common.cu:1078

.. dcf3743f216b pid 1025: Test failure common.cu:891

I'm not sure if this is related to my WSL2 environment

This is my graphics card driver version:

root@dcf3743f216b:/data/nccl-tests-master# nvidia-smi
Tue Jan 14 01:55:15 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.27                 Driver Version: 560.70         CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:16:00.0 Off |                  Off |
| 30%   27C    P8             16W /  450W |    1532MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        On  |   00000000:3C:00.0 Off |                  Off |
| 30%   24C    P8              5W /  450W |    3730MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090        On  |   00000000:49:00.0 Off |                  Off |
| 30%   26C    P8             15W /  450W |   22878MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 4090        On  |   00000000:54:00.0 Off |                  Off |
| 30%   25C    P8              6W /  450W |   22918MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA GeForce RTX 4090        On  |   00000000:96:00.0 Off |                  Off |
| 30%   25C    P8             26W /  450W |      15MiB /  24564MiB |      0%      Default |

kiskra-nvidia · 2025-01-14T05:32:26Z

I'm looking at the memory usage in your nvidia-smi output. GPUs 2 and 3 have hardly any free memory left, in spite of not being otherwise busy at all. Any idea what might be going on there? It would explain why you don't see a problem when running using fewer than 3 GPUs, since in that case only GPUs 0 and 1 are being used, and they have sufficient free memory available...

You should be able to confirm by running using just 2 GPUs (which would normally succeed, correct?) but with CUDA_VISIBLE_DEVICES=2,3.

1556900941lizerui · 2025-01-14T05:40:43Z

I closed the program and retested it, reducing the information volume to 1MB. The same problem occurred, but everything was normal when I adjusted the number of GPUs to 2.
root@4ca1fd04d760:/data/nccl-tests-master# ./build/all_reduce_perf -b 8 -e 1M -f 2 -g 4

nThread 1 nGpus 4 minBytes 8 maxBytes 1048576 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

Using devices

Rank 0 Group 0 Pid 309 on 4ca1fd04d760 device 0 [0x16] NVIDIA GeForce RTX 4090

Rank 1 Group 0 Pid 309 on 4ca1fd04d760 device 1 [0x3c] NVIDIA GeForce RTX 4090

Rank 2 Group 0 Pid 309 on 4ca1fd04d760 device 2 [0x49] NVIDIA GeForce RTX 4090

Rank 3 Group 0 Pid 309 on 4ca1fd04d760 device 3 [0x96] NVIDIA GeForce RTX 4090

4ca1fd04d760: Test NCCL failure common.cu:1005 'unhandled cuda error / '
.. 4ca1fd04d760 pid 309: Test failure common.cu:891

1556900941lizerui · 2025-01-14T05:42:47Z

I am sorry，I paste the image here，And I found that when the GPU is set to 3, it works initially, but after I open another container for testing and then return to the current container for testing, the same error occurs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use multi gpu test failure #280

use multi gpu test failure #280

1556900941lizerui commented Jan 13, 2025 •

edited by kiskra-nvidia

Loading

kiskra-nvidia commented Jan 13, 2025

1556900941lizerui commented Jan 14, 2025 •

edited by kiskra-nvidia

Loading

kiskra-nvidia commented Jan 14, 2025

1556900941lizerui commented Jan 14, 2025

1556900941lizerui commented Jan 14, 2025

use multi gpu test failure #280

use multi gpu test failure #280

Comments

1556900941lizerui commented Jan 13, 2025 • edited by kiskra-nvidia Loading

kiskra-nvidia commented Jan 13, 2025

1556900941lizerui commented Jan 14, 2025 • edited by kiskra-nvidia Loading

kiskra-nvidia commented Jan 14, 2025

1556900941lizerui commented Jan 14, 2025

nThread 1 nGpus 4 minBytes 8 maxBytes 1048576 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

Using devices

Rank 0 Group 0 Pid 309 on 4ca1fd04d760 device 0 [0x16] NVIDIA GeForce RTX 4090

Rank 1 Group 0 Pid 309 on 4ca1fd04d760 device 1 [0x3c] NVIDIA GeForce RTX 4090

Rank 2 Group 0 Pid 309 on 4ca1fd04d760 device 2 [0x49] NVIDIA GeForce RTX 4090

Rank 3 Group 0 Pid 309 on 4ca1fd04d760 device 3 [0x96] NVIDIA GeForce RTX 4090

1556900941lizerui commented Jan 14, 2025

1556900941lizerui commented Jan 13, 2025 •

edited by kiskra-nvidia

Loading

1556900941lizerui commented Jan 14, 2025 •

edited by kiskra-nvidia

Loading