Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use multi gpu test failure #280

Open
1556900941lizerui opened this issue Jan 13, 2025 · 5 comments
Open

use multi gpu test failure #280

1556900941lizerui opened this issue Jan 13, 2025 · 5 comments

Comments

@1556900941lizerui
Copy link

1556900941lizerui commented Jan 13, 2025

When I use the command ./build/all_reduce_perf -b 8 -e 1M -f 2 -g 4, the following error occurs:
c2211f46f318: Test NCCL failure common.cu:1005 'unhandled CUDA error'
.. c2211f46f318 pid 3288: Test failure common.cu:891

I have tried that there is a situation of freezing or error when the number of GPUs is greater than or equal to 4, and normal communication can be achieved when the number of GPUs is less than 3

Using devices
#  Rank 0 Group 0 Pid 3590 on c2211f46f318 device 0 [0x16] NVIDIA GeForce RTX 4090
#  Rank 1 Group 0 Pid 3590 on c2211f46f318 device 1 [0x3c] NVIDIA GeForce RTX 4090
#  Rank 2 Group 0 Pid 3590 on c2211f46f318 device 2 [0x49] NVIDIA GeForce RTX 4090
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float     sum      -1    97.98    0.00    0.00      0    190.2    0.00    0.00      0
          16             4     float     sum      -1    135.4    0.00    0.00      0    115.8    0.00    0.00      0
          32             8     float     sum      -1    106.8    0.00    0.00      0    122.4    0.00    0.00      0
          64            16     float     sum      -1    105.3    0.00    0.00      0    101.4    0.00    0.00      0
         128            32     float     sum      -1    99.74    0.00    0.00      0    101.6    0.00    0.00      0
         256            64     float     sum      -1    98.20    0.00    0.00      0    154.2    0.00    0.00      0
         512           128     float     sum      -1    223.5    0.00    0.00      0    122.8    0.00    0.01      0
        1024           256     float     sum      -1    203.9    0.01    0.01      0    188.6    0.01    0.01      0
        2048           512     float     sum      -1    113.2    0.02    0.02      0    122.8    0.02    0.02      0
        4096          1024     float     sum      -1    111.8    0.04    0.05      0    101.9    0.04    0.05      0
        8192          2048     float     sum      -1    97.23    0.08    0.11      0    225.3    0.04    0.05      0
       16384          4096     float     sum      -1    103.7    0.16    0.21      0    159.6    0.10    0.14      0
       32768          8192     float     sum      -1    217.0    0.15    0.20      0    118.5    0.28    0.37      0
       65536         16384     float     sum      -1    114.4    0.57    0.76      0    229.2    0.29    0.38      0
      131072         32768     float     sum      -1    276.7    0.47    0.63      0    128.5    1.02    1.36      0
      262144         65536     float     sum      -1    139.0    1.89    2.51      0    142.9    1.83    2.45      0
      524288        131072     float     sum      -1    170.2    3.08    4.11      0    158.6    3.31    4.41      0
     1048576        262144     float     sum      -1    247.1    4.24    5.66      0    254.1    4.13    5.50      0
# Values out of range: 0 OK
# Avg bus bandwidth: 0.806538

The complete error log is as follows:

Using devices
#  Rank 0 Group 0 Pid 3719 on c2211f46f318 device 0 [0x16] NVIDIA GeForce RTX 4090
#  Rank 1 Group 0 Pid 3719 on c2211f46f318 device 1 [0x3c] NVIDIA GeForce RTX 4090
#  Rank 2 Group 0 Pid 3719 on c2211f46f318 device 2 [0x49] NVIDIA GeForce RTX 4090
#  Rank 3 Group 0 Pid 3719 on c2211f46f318 device 3 [0x54] NVIDIA GeForce RTX 4090
#  Rank 4 Group 0 Pid 3719 on c2211f46f318 device 4 [0x96] NVIDIA GeForce RTX 4090
#  Rank 5 Group 0 Pid 3719 on c2211f46f318 device 5 [0xbc] NVIDIA GeForce RTX 4090
#  Rank 6 Group 0 Pid 3719 on c2211f46f318 device 6 [0xc9] NVIDIA GeForce RTX 4090
#  Rank 7 Group 0 Pid 3719 on c2211f46f318 device 7 [0xd1] NVIDIA GeForce RTX 4090
c2211f46f318:3719:3719 [0] NCCL INFO Bootstrap: Using eth0:172.17.0.3<0>
c2211f46f318:3719:3719 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so), using internal implementation
c2211f46f318:3719:3719 [7] NCCL INFO cudaDriverVersion 12060
NCCL version 2.13.4+cuda11.7
c2211f46f318:3719:3740 [0] NCCL INFO Failed to open libibverbs.so[.1]
c2211f46f318:3719:3740 [0] NCCL INFO NET/Socket: Using [0]eth0:172.17.0.3<0>
c2211f46f318:3719:3740 [0] NCCL INFO Using network Socket
c2211f46f318:3719:3741 [1] NCCL INFO Using network Socket
c2211f46f318:3719:3743 [3] NCCL INFO Using network Socket
c2211f46f318:3719:3742 [2] NCCL INFO Using network Socket
c2211f46f318:3719:3745 [5] NCCL INFO Using network Socket
c2211f46f318:3719:3744 [4] NCCL INFO Using network Socket
c2211f46f318:3719:3746 [6] NCCL INFO Using network Socket
c2211f46f318:3719:3747 [7] NCCL INFO Using network Socket
c2211f46f318:3719:3740 [0] NCCL INFO Channel 00/02 :    0   1   2   3   4   5   6   7
c2211f46f318:3719:3746 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5
c2211f46f318:3719:3742 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
c2211f46f318:3719:3747 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6
c2211f46f318:3719:3740 [0] NCCL INFO Channel 01/02 :    0   1   2   3   4   5   6   7
c2211f46f318:3719:3745 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4
c2211f46f318:3719:3743 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2
c2211f46f318:3719:3744 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3
c2211f46f318:3719:3740 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
c2211f46f318:3719:3741 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0

c2211f46f318:3719:3742 [2] misc/shmutils.cc:62 NCCL WARN Cuda failure 'out of memory'
c2211f46f318:3719:3742 [2] NCCL INFO transport/shm.cc:106 -> 1
c2211f46f318:3719:3742 [2] NCCL INFO transport.cc:33 -> 1
c2211f46f318:3719:3742 [2] NCCL INFO transport.cc:89 -> 1
c2211f46f318:3719:3742 [2] NCCL INFO init.cc:773 -> 1
c2211f46f318:3719:3742 [2] NCCL INFO init.cc:1045 -> 1
c2211f46f318:3719:3742 [2] NCCL INFO group.cc:43 -> 1 [Async thread]
@kiskra-nvidia
Copy link
Member

It appears like you might be running out of host shared memory. Please see if anything under https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#shared-memory may apply to your case.

Also, could you try with a more recent NCCL release? 2.13.4 is 2.5 years old at this point...

@1556900941lizerui
Copy link
Author

1556900941lizerui commented Jan 14, 2025

I currently have a WSL2 Ubuntu environment created in Windows Server 2022, and have imported two Ubuntu containers into the environment

  1. The shared memory size of the first container (i.e., the container above) is 6G, with NCCL version 2.13.4 and CUDA version 11.7.

  2. The second container (I use the official image provided by Qianwen) uses 64G of shared memory, with NCCL version 2.17.1 and CUDA version 12.1. The same communication failure will occur when the number of GPUs is greater than 2. The same situation will occur when the number of GPUs is 2, and the following error will occur when the test cannot be completed:

dcf3743f216b: Test CUDA failure common.cu:295 'an illegal memory access was encountered'

.. dcf3743f216b pid 1025: Test failure common.cu:405

.. dcf3743f216b pid 1025: Test failure common.cu:592

.. dcf3743f216b pid 1025: Test failure in all_reduce.cu at line 90

.. dcf3743f216b pid 1025: Test failure common.cu:623

.. dcf3743f216b pid 1025: Test failure common.cu:1078

.. dcf3743f216b pid 1025: Test failure common.cu:891

I'm not sure if this is related to my WSL2 environment

This is my graphics card driver version:

root@dcf3743f216b:/data/nccl-tests-master# nvidia-smi
Tue Jan 14 01:55:15 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.27                 Driver Version: 560.70         CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:16:00.0 Off |                  Off |
| 30%   27C    P8             16W /  450W |    1532MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        On  |   00000000:3C:00.0 Off |                  Off |
| 30%   24C    P8              5W /  450W |    3730MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090        On  |   00000000:49:00.0 Off |                  Off |
| 30%   26C    P8             15W /  450W |   22878MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 4090        On  |   00000000:54:00.0 Off |                  Off |
| 30%   25C    P8              6W /  450W |   22918MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA GeForce RTX 4090        On  |   00000000:96:00.0 Off |                  Off |
| 30%   25C    P8             26W /  450W |      15MiB /  24564MiB |      0%      Default |

@kiskra-nvidia
Copy link
Member

I'm looking at the memory usage in your nvidia-smi output. GPUs 2 and 3 have hardly any free memory left, in spite of not being otherwise busy at all. Any idea what might be going on there? It would explain why you don't see a problem when running using fewer than 3 GPUs, since in that case only GPUs 0 and 1 are being used, and they have sufficient free memory available...

You should be able to confirm by running using just 2 GPUs (which would normally succeed, correct?) but with CUDA_VISIBLE_DEVICES=2,3.

@1556900941lizerui
Copy link
Author

I closed the program and retested it, reducing the information volume to 1MB. The same problem occurred, but everything was normal when I adjusted the number of GPUs to 2.
root@4ca1fd04d760:/data/nccl-tests-master# ./build/all_reduce_perf -b 8 -e 1M -f 2 -g 4

nThread 1 nGpus 4 minBytes 8 maxBytes 1048576 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

Using devices

Rank 0 Group 0 Pid 309 on 4ca1fd04d760 device 0 [0x16] NVIDIA GeForce RTX 4090

Rank 1 Group 0 Pid 309 on 4ca1fd04d760 device 1 [0x3c] NVIDIA GeForce RTX 4090

Rank 2 Group 0 Pid 309 on 4ca1fd04d760 device 2 [0x49] NVIDIA GeForce RTX 4090

Rank 3 Group 0 Pid 309 on 4ca1fd04d760 device 3 [0x96] NVIDIA GeForce RTX 4090

4ca1fd04d760: Test NCCL failure common.cu:1005 'unhandled cuda error / '
.. 4ca1fd04d760 pid 309: Test failure common.cu:891

Current graphics card information:
| NVIDIA-SMI 560.27 Driver Version: 560.70 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:16:00.0 Off | Off |
| 30% 26C P8 16W / 450W | 1557MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 On | 00000000:3C:00.0 Off | Off |
| 30% 24C P8 5W / 450W | 3732MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce RTX 4090 On | 00000000:49:00.0 Off | Off |
| 30% 26C P8 15W / 450W | 0MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA GeForce RTX 4090 On | 00000000:54:00.0 Off | Off |
| 30% 24C P8 5W / 450W | 0MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA GeForce RTX 4090 On | 00000000:96:00.0 Off | Off |
| 30% 25C P8 24W / 450W | 15MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA GeForce RTX 4090 On | 00000000:BC:00.0 Off | Off |
| 30% 26C P8 7W / 450W | 0MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA GeForce RTX 4090 On | 00000000:C9:00.0 Off | Off |
| 30% 27C P8 9W / 450W | 77MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA GeForce RTX 4090 On | 00000000:D1:00.0 Off | Off |
| 30% 29C P8 13W / 450W | 0MiB / 24564MiB | 0% Default |
| | | N/A |

@1556900941lizerui
Copy link
Author

I am sorry,I paste the image here,And I found that when the GPU is set to 3, it works initially, but after I open another container for testing and then return to the current container for testing, the same error occurs.

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants