Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

560.35.03 p2p #22

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

560.35.03 p2p #22

wants to merge 13 commits into from

Conversation

mylesgoose
Copy link

Added support for the 560.35.03 Nvidia driver

'/home/myles/cuda-samples/Samples/0_Introduction/simpleP2P/simpleP2P' 
[/home/myles/cuda-samples/Samples/0_Introduction/simpleP2P/simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 7

Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU4) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU5) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU6) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU4) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU5) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU6) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU4) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU5) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU6) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU4) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU5) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU6) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU4) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU4) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU4) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU4) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU4) -> NVIDIA GeForce RTX 4090 (GPU5) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU4) -> NVIDIA GeForce RTX 4090 (GPU6) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU5) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU5) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU5) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU5) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU5) -> NVIDIA GeForce RTX 4090 (GPU4) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU5) -> NVIDIA GeForce RTX 4090 (GPU6) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU6) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU6) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU6) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU6) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU6) -> NVIDIA GeForce RTX 4090 (GPU4) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU6) -> NVIDIA GeForce RTX 4090 (GPU5) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 24.43GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Shutting down...
Test passed
myles@ubuntu11:~/nccl-tests/build$ nvidia-smi
Sat Oct 19 20:11:14 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:01:00.0 Off |                  Off |
|  0%   38C    P8             33W /  400W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        On  |   00000000:02:00.0  On |                  Off |
| 30%   27C    P8             30W /  400W |     138MiB /  24564MiB |      3%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090        On  |   00000000:2B:00.0 Off |                  Off |
|  0%   39C    P8             37W /  400W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 4090        On  |   00000000:41:00.0 Off |                  Off |
| 30%   27C    P8             23W /  400W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA GeForce RTX 4090        On  |   00000000:42:00.0 Off |                  Off |
|  0%   34C    P8             28W /  400W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA GeForce RTX 4090        On  |   00000000:61:00.0 Off |                  Off |
|  0%   41C    P8              9W /  400W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA GeForce RTX 4090        On  |   00000000:62:00.0 Off |                  Off |
|  0%   38C    P8             34W /  400W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      4752      G   /usr/lib/xorg/Xorg                              4MiB |
|    1   N/A  N/A      4752      G   /usr/lib/xorg/Xorg                             83MiB |
|    1   N/A  N/A      5169      G   /usr/bin/gnome-shell                           10MiB |
|    1   N/A  N/A      6068      G   ...erProcess --variations-seed-version          7MiB |
|    1   N/A  N/A      8460      G   ...274d7fcbbd43f748aa61176769d7268cf91          9MiB |
|    2   N/A  N/A      4752      G   /usr/lib/xorg/Xorg                              4MiB |
|    3   N/A  N/A      4752      G   /usr/lib/xorg/Xorg                              4MiB |
|    4   N/A  N/A      4752      G   /usr/lib/xorg/Xorg                              4MiB |
|    5   N/A  N/A      4752      G   /usr/lib/xorg/Xorg                              4MiB |
|    6   N/A  N/A      4752      G   /usr/lib/xorg/Xorg                              4MiB |
+-----------------------------------------------------------------------------------------+
myles@ubuntu11:~/nccl-tests/build$ NCCL_P2P_LEVEL=SYS ./alltoall_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   9357 on   ubuntu11 device  0 [0x01] NVIDIA GeForce RTX 4090
#  Rank  1 Group  0 Pid   9357 on   ubuntu11 device  1 [0x02] NVIDIA GeForce RTX 4090
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             1     float    none      -1     9.43    0.00    0.00      0     8.80    0.00    0.00    N/A
          16             2     float    none      -1     9.40    0.00    0.00      0     9.05    0.00    0.00    N/A
          32             4     float    none      -1     9.24    0.00    0.00      0     9.43    0.00    0.00    N/A
          64             8     float    none      -1    12.44    0.01    0.00      0     9.01    0.01    0.00    N/A
         128            16     float    none      -1     9.00    0.01    0.01      0     8.99    0.01    0.01    N/A
         256            32     float    none      -1     9.00    0.03    0.01      0     8.84    0.03    0.01    N/A
         512            64     float    none      -1     9.18    0.06    0.03      0     9.25    0.06    0.03    N/A
        1024           128     float    none      -1     9.14    0.11    0.06      0     8.87    0.12    0.06    N/A
        2048           256     float    none      -1     9.33    0.22    0.11      0     8.86    0.23    0.12    N/A
        4096           512     float    none      -1     9.24    0.44    0.22      0     9.47    0.43    0.22    N/A
        8192          1024     float    none      -1     9.64    0.85    0.42      0    13.74    0.60    0.30    N/A
       16384          2048     float    none      -1    10.04    1.63    0.82      0     9.96    1.65    0.82    N/A
       32768          4096     float    none      -1    10.96    2.99    1.50      0    10.64    3.08    1.54    N/A
       65536          8192     float    none      -1    12.37    5.30    2.65      0    12.30    5.33    2.66    N/A
      131072         16384     float    none      -1    16.00    8.19    4.10      0    16.19    8.09    4.05    N/A
      262144         32768     float    none      -1    20.08   13.06    6.53      0    20.04   13.08    6.54    N/A
      524288         65536     float    none      -1    27.68   18.94    9.47      0    27.49   19.07    9.54    N/A
     1048576        131072     float    none      -1    44.72   23.45   11.72      0    44.92   23.34   11.67    N/A
     2097152        262144     float    none      -1    71.54   29.31   14.66      0    71.27   29.43   14.71    N/A
     4194304        524288     float    none      -1    112.4   37.31   18.66      0    109.2   38.40   19.20    N/A
     8388608       1048576     float    none      -1    188.1   44.58   22.29      0    210.4   39.88   19.94    N/A
    16777216       2097152     float    none      -1    367.6   45.64   22.82      0    416.1   40.32   20.16    N/A
    33554432       4194304     float    none      -1    702.0   47.80   23.90      0    684.6   49.01   24.51    N/A
    67108864       8388608     float    none      -1   1422.6   47.17   23.59      0   1338.0   50.15   25.08    N/A
   134217728      16777216     float    none      -1   2759.8   48.63   24.32      0   2661.1   50.44   25.22    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 7.48508 
#

@henriklied
Copy link

henriklied commented Oct 30, 2024

@mylesgoose I'm able to install the driver fine, but I can't seem to replicate your simpleP2P results. I'm running 4x RTX4090's in Ubuntu 24.04 LTS Server. Used cuda_12.6.2_560.35.03_linux.run as the cuda installer. Cloned the repo, applied your PR and then copied over the install.sh from @tinygrad's main repo to install. No errors, but two warnings:

The kernel was built by: x86_64-linux-gnu-gcc-13 (Ubuntu 13.2.0-23ubuntu4) 13.2.0
You are using: cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0
SIGN /lib/modules/6.8.0-47-generic/kernel/drivers/video/nvidia-drm.ko
SIGN /lib/modules/6.8.0-47-generic/kernel/drivers/video/nvidia.ko
SIGN /lib/modules/6.8.0-47-generic/kernel/drivers/video/nvidia-uvm.ko
DEPMOD /lib/modules/6.8.0-47-generic
Warning: modules_install: missing 'System.map' file. Skipping depmod.

Output from nvidia-smi topo:

	GPU0	GPU1	GPU2	GPU3	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	PHB	NODE	NODE	0-47	0		N/A
GPU1	PHB	 X 	NODE	NODE	0-47	0		N/A
GPU2	NODE	NODE	 X 	NODE	0-47	0		N/A
GPU3	NODE	NODE	NODE	 X 	0-47	0		N/A

Simplep2p:

[cuda-samples/bin/x86_64/linux/release/simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 4

Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU2) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU3) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU2) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU3) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU0) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU3) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU0) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU2) : No
Two or more GPUs with Peer-to-Peer access capability are required for cuda-samples/bin/x86_64/linux/release/simpleP2P.
Peer to Peer access is not available amongst GPUs in the system, waiving test.

Any thoughts?

@henriklied
Copy link

I was able to get a step further, but simpleP2P fails:

[../cuda-samples/bin/x86_64/linux/release/simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 4

Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes
Enabling peer access between GPU0 and GPU1...
CUDA error at simpleP2P.cu:129 code=205(cudaErrorMapBufferObjectFailed) "cudaDeviceEnablePeerAccess(gpuid[1], 0)"

@mylesgoose
Copy link
Author

@henriklied what do mean by replicating my results. Are you able to get p2p occurring. Is the speed slow. Can you provide more information. I found a simpler way to install it on Ubuntu 24.04 or 24.10. You install the correct open nvidia driver by apt. Right. You then compile the source code, ensuring you clone the correct branch for the driver you have. Let's say the 560 one. You then check that the 560 drive is running fine. Which works well even with Wayland display manager. Where as the p2p one will only really work well with the x11 one. Glitches with gnome nautilus system manager etc. Anyway you then compile the driver using c++14, etc, matching that kernel. You then search your system for nvidia.ko files. You should find two one in usr/modules/uname -r somewhere around there, the apt installed ones. You will also find your p2p ones. In your source folder. Then, use the terminal to copy the original modules files from folder nvidia 560 to a backup place. Then copy in your modified ones. Matching those file names. Then reboot. Or unload the models and then reload them and your display manager. I setup a script that dors it when i need to use p2p or when i just eant to use the standard driver can runscript and it replaces the models and reloads in about 30 seconds.This way apt thinks its still got the original driver so does not try to update it all the time. And you know for certain you replaced the driver with your modified ones ie check the time stamps. Then reboot. Before you reboot if your running a display manager ensure it's xfce, on sddm , kfe, there is issue on my system at least with Wayland. So choose x11. Or xorg. As it does not check for secure ram. I think because mapping to global the vram Wayland does not like. Next you can test simply with nvidia-smi p2p status. Have a look at the issues section on geohot page there are some discussion which show the simple nvidia commands to check p2p. If p2p is working. Then perhaps your meaning your speeds?. Which related to nccl exports and setting the p2p level sys or PBH or whatever er it it. 🤔 you likely have two sets of the same modules trying to load.

@henriklied
Copy link

Thanks for the quick reply, @mylesgoose!

I think the issue is that the machine has not enabled large BAR support in the BIOS, so I will attempt to enable this when I get physical access to the machine in a couple of days. My error seems to correspond with this issue. So I will try that first.

@mylesgoose
Copy link
Author

Thanks for the quick reply, @mylesgoose!

I think the issue is that the machine has not enabled large BAR support in the BIOS, so I will attempt to enable this when I get physical access to the machine in a couple of days. My error seems to correspond with this issue. So I will try that first.

Looks to me like your pretty close. Must be that large BAR thing. As it now says p2p enabled. I don't know why but it did not show your messages above in full beforehand. 😕

@henriklied
Copy link

Was able to enable large BAR, but now I'm getting a different error. Any thoughts, @mylesgoose ?

Checking for multiple GPUs...
CUDA-capable device count: 4

Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 24.48GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Verification error @ element 0: val = nan, ref = 0.000000
Verification error @ element 1: val = nan, ref = 4.000000
Verification error @ element 2: val = nan, ref = 8.000000
Verification error @ element 3: val = nan, ref = 12.000000
Verification error @ element 4: val = nan, ref = 16.000000
Verification error @ element 5: val = nan, ref = 20.000000
Verification error @ element 6: val = nan, ref = 24.000000
Verification error @ element 7: val = nan, ref = 28.000000
Verification error @ element 8: val = nan, ref = 32.000000
Verification error @ element 9: val = nan, ref = 36.000000
Verification error @ element 10: val = nan, ref = 40.000000
Verification error @ element 11: val = nan, ref = 44.000000
Disabling peer access...
Shutting down...
Test failed!

@mylesgoose
Copy link
Author

mylesgoose commented Nov 1, 2024

@henriklied I think there was an update to this repo that fixed that issue. Must have used the older one which repo or files did you use to install? 1ca8b01

@mylesgoose
Copy link
Author

@henriklied how did you get on?

@ewhacc
Copy link

ewhacc commented Nov 14, 2024

@mylesgoose Thanks. Succeeded to get 560 branch. It seems working good. But, a strange thing.

1->2 is ok, but 2->1 is half bandwidth in p2p. It's ok in non p2p.

Device: 0, NVIDIA RTX 6000 Ada Generation, pciBusID: 21, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 4090, pciBusID: 1, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA GeForce RTX 4090, pciBusID: 41, pciDeviceID: 0, pciDomainID:0

Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2
0 1405.44 21.18 21.60
1 21.90 1645.17 21.77
2 21.77 21.80 1735.15
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2
0 1318.57 26.19 26.24
1 26.21 1582.68 26.23
2 11.49 10.16 1682.37
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2
0 2146.29 34.87 35.57
1 36.29 1704.16 36.21
2 36.30 36.25 1785.46
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2
0 2192.60 52.37 52.45
1 52.45 1751.68 52.45
2 22.99 20.32 1705.79

@mylesgoose
Copy link
Author

@ewhacc that's a weird one. 😳 are all your cards identical. What happens if you run the test again directly after running first time or if card is on a different port. Or if you switch cards. Export Nccl debug level info might help figure out.

@ewhacc
Copy link

ewhacc commented Nov 15, 2024

@mylesgoose Yeah, it's weird. One card is not identical but the problem is not there. Anyway, I'm gonna test again only with 4090s. Also, changing slot affects as you said.

@ewhacc
Copy link

ewhacc commented Nov 15, 2024

@mylesgoose It works great after removing non-4090. Strangely, that affected the p2p of other two 4090s.

Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 1452.48 36.04
1 36.40 1775.57
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 1452.98 52.43
1 52.45 1699.52

Mixing cards seems mess in p2p. I have experienced it even between 6000 ada & a6000.

@mylesgoose
Copy link
Author

@ewhacc glad its working for ya bud.

@ewhacc
Copy link

ewhacc commented Nov 15, 2024

@henriklied Did you fix the problem? Now, I got the same problem.
@mylesgoose Weird. p2pBandwidthLatencyTest was ok, but simpleP2P fails. Actual p2p training failed too. Well, it was ok 550.54. Now, both 560 & 550 failed.

@mylesgoose
Copy link
Author

mylesgoose commented Nov 15, 2024

Can you provide more information like what program you where using. Driver that is loaded. Has apt updated your driver? Nvidia-smi -p2p status rw what does it return? @ewhacc how did you install the driver. I think that Deb file has issue. Have to compile from source. Let me know

@mylesgoose
Copy link
Author

mylesgoose commented Nov 15, 2024

I will provide in full the steps I do to install. I have setup a 4 gpu system to test.
Run ubuntu 22.04 installer select safe mode graphics. I have secure boot disabled in bios and iommu off. And large BAR support enabled. I select full installation and install third party drivers etc and install alongside other os that's on my drive.type username etc. I am plugged into my vga motherboard port to a simple display not using the 4 cards that are present in pcie slots directly into motherboard with single slot rtx 4090 and two suprim water cooled cards. I feel nervous about not paying Microsoft so I while it's installing I purchase a key for Microsoft Windows 11 pro. I reboot computer and ensure bios selects the right os. I am asked if I want to update to 24.04 I decline.
I run the following commands

Sudo apt update
Sudo apt upgrade. 

I google search for nvidia cuda toolkit 12.6.2 I use network installer. To add the nvidia repo to apt. I don't install the toolkit yet. I update apt.
I search apt for nvidia-open
I now see that. It's available. 
I type sudo apt install nvidia-open-560 --install-suggests  durimg the install it tells you the locations where it compiles the kernel modules to. In my case it seems usr/lib/modules/6.8.0-48-generic/updates/dmks/ and i can confirm there is 4 files threr nvidia.ko nvidia-drm.ko nvidia-modeset.ko nvidia-peermem.ko nvidia-uvm.ko i perform a system wide search for nvidia.ko and i find a second file with the same name located usr/lib/dmks/nvidia/560.35.03//6.8.0-generic/x86_64/module. Bot files are size 13.7mb. in that previous step it also installs gcc 11 and build essentials etc. It also means that apt will see your driver is installed and not try to update or replace all the time. And this step removes any other versions.
sudo apt auto remove
sudo apt install git
git clone -b 560.35.03-p2p https://github.com/mylesgoose/open-gpu-kernel-modules.git
cd open-kernel-modules
make modules -j$(nproc)
At this point the compilation fails with errors. 
So I type
sudo apt install gcc-12 g++-12
sudo update-alternatives  --install /usr/bin/gcc gcc/usr/bin/gcc-12 60
sudo update-alternatives  --install /usr/bin/g++ gcc/usr/bin/g++-12 60
sudo update-alternatives --config gcc
sudo update-alternatives --config g++
make clean
Then make modules again as above.
That rime it compiled without errors. 

I noted the module location above when we installed the open driver with apt from nvidia repo. I backed up the original modules the .ko files and renamed then ko.bak
for example:
# Backup the original modules in DKMS installation directory
sudo mv /usr/lib/modules/6.8.0-48-generic/updates/dkms/nvidia.ko /usr/lib/modules/6.8.0-48-generic/updates/dkms/nvidia.ko.bak
sudo mv /usr/lib/modules/6.8.0-48-generic/updates/dkms/nvidia-drm.ko /usr/lib/modules/6.8.0-48-generic/updates/dkms/nvidia-drm.ko.bak
sudo mv /usr/lib/modules/6.8.0-48-generic/updates/dkms/nvidia-modeset.ko /usr/lib/modules/6.8.0-48-generic/updates/dkms/nvidia-modeset.ko.bak
sudo mv /usr/lib/modules/6.8.0-48-generic/updates/dkms/nvidia-peermem.ko /usr/lib/modules/6.8.0-48-generic/updates/dkms/nvidia-peermem.ko.bak
sudo mv /usr/lib/modules/6.8.0-48-generic/updates/dkms/nvidia-uvm.ko /usr/lib/modules/6.8.0-48-generic/updates/dkms/nvidia-uvm.ko.bak

# Backup the original modules in DKMS source directory
sudo mv /var/lib/dkms/nvidia/560.35.03/6.8.0-48-generic/x86_64/module/nvidia.ko /var/lib/dkms/nvidia/560.35.03/6.8.0-48-generic/x86_64/module/nvidia.ko.bak
sudo mv /var/lib/dkms/nvidia/560.35.03/6.8.0-48-generic/x86_64/module/nvidia-drm.ko /var/lib/dkms/nvidia/560.35.03/6.8.0-48-generic/x86_64/module/nvidia-drm.ko.bak
sudo mv /var/lib/dkms/nvidia/560.35.03/6.8.0-48-generic/x86_64/module/nvidia-modeset.ko /var/lib/dkms/nvidia/560.35.03/6.8.0-48-generic/x86_64/module/nvidia-modeset.ko.bak
sudo mv /var/lib/dkms/nvidia/560.35.03/6.8.0-48-generic/x86_64/module/nvidia-peermem.ko /var/lib/dkms/nvidia/560.35.03/6.8.0-48-generic/x86_64/module/nvidia-peermem.ko.bak
sudo mv /var/lib/dkms/nvidia/560.35.03/6.8.0-48-generic/x86_64/module/nvidia-uvm.ko /var/lib/dkms/nvidia/560.35.03/6.8.0-48-generic/x86_64/module/nvidia-uvm.ko.bak

# Copy the new modules from the open-gpu folder to DKMS installation directory
sudo cp /home/myles/open-gpu-kernel-modules/kernel-open/nvidia.ko /usr/lib/modules/6.8.0-48-generic/updates/dkms/
sudo cp /home/myles/open-gpu-kernel-modules/kernel-open/nvidia-drm.ko /usr/lib/modules/6.8.0-48-generic/updates/dkms/
sudo cp /home/myles/open-gpu-kernel-modules/kernel-open/nvidia-modeset.ko /usr/lib/modules/6.8.0-48-generic/updates/dkms/
sudo cp /home/myles/open-gpu-kernel-modules/kernel-open/nvidia-peermem.ko /usr/lib/modules/6.8.0-48-generic/updates/dkms/
sudo cp /home/myles/open-gpu-kernel-modules/kernel-open/nvidia-uvm.ko /usr/lib/modules/6.8.0-48-generic/updates/dkms/

# Copy the new modules from the open-gpu folder to DKMS source directory
sudo cp /home/myles/open-gpu-kernel-modules/kernel-open/nvidia.ko /var/lib/dkms/nvidia/560.35.03/6.8.0-48-generic/x86_64/module/
sudo cp /home/myles/open-gpu-kernel-modules/kernel-open/nvidia-drm.ko /var/lib/dkms/nvidia/560.35.03/6.8.0-48-generic/x86_64/module/
sudo cp /home/myles/open-gpu-kernel-modules/kernel-open/nvidia-modeset.ko /var/lib/dkms/nvidia/560.35.03/6.8.0-48-generic/x86_64/module/
sudo cp /home/myles/open-gpu-kernel-modules/kernel-open/nvidia-peermem.ko /var/lib/dkms/nvidia/560.35.03/6.8.0-48-generic/x86_64/module/
sudo cp /home/myles/open-gpu-kernel-modules/kernel-open/nvidia-uvm.ko /var/lib/dkms/nvidia/560.35.03/6.8.0-48-generic/x86_64/module/
I copied the respective compiled modules we just made to the two locations provided by apt installer 
I confirmed that the modules I copied from our p2p folder existed in those two locations with size of 26mb each for nvidia.ko
i reboot and type this in terminal.
nvidia-smi topo -p2p rw

 	GPU0	GPU1	GPU2	GPU3	
 GPU0	X	OK	OK	OK	
 GPU1	OK	X	OK	OK	
 GPU2	OK	OK	X	OK	
 GPU3	OK	OK	OK	X	

Legend:

  X    = Self
  OK   = Status Ok
  CNS  = Chipset not supported
  GNS  = GPU not supported
  TNS  = Topology not supported
  NS   = Not supported
  U    = Unknown
and then
sudo apt-get -y install cuda-toolkit-12-6
git clone https://github.com/NVIDIA/cuda-samples.git
cd cuda-samples
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
sudo apt install libgl1-mesa-dev libglu1-mesa-dev
source ~/.bashrc
sudo apt install libnccl2 libnccl-dev
sudo apt-get install cmake
sudo apt-get install freeglut3-dev
sudo apt-get install libfreeimage-dev
sudo apt-get install openmpi-bin openmpi-common libopenmpi-dev
wget -qO- https://packages.lunarg.com/lunarg-signing-key-pub.asc | sudo tee /etc/apt/trusted.gpg.d/lunarg.asc
sudo wget -qO /etc/apt/sources.list.d/lunarg-vulkan-1.3.296-jammy.list https://packages.lunarg.com/vulkan/1.3.296/lunarg-vulkan-1.3.296-jammy.list
sudo apt update
sudo apt install vulkan-sdk
make -j$(nproc)

@mylesgoose
Copy link
Author

'/home/myles/cuda-samples/Samples/0_Introduction/simpleP2P/simpleP2P' 
[/home/myles/cuda-samples/Samples/0_Introduction/simpleP2P/simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 4

Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 5.88GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Verification error @ element 0: val = nan, ref = 0.000000
Verification error @ element 1: val = nan, ref = 4.000000
Verification error @ element 2: val = nan, ref = 8.000000
Verification error @ element 3: val = nan, ref = 12.000000
Verification error @ element 4: val = nan, ref = 16.000000
Verification error @ element 5: val = nan, ref = 20.000000
Verification error @ element 6: val = nan, ref = 24.000000
Verification error @ element 7: val = nan, ref = 28.000000
Verification error @ element 8: val = nan, ref = 32.000000
Verification error @ element 9: val = nan, ref = 36.000000
Verification error @ element 10: val = nan, ref = 40.000000
Verification error @ element 11: val = nan, ref = 44.000000
Disabling peer access...
Shutting down...
Test failed!
myles@myles-MC62-G40-00:~/cuda-samples$ dmesg | grep -e DMAR -e IOMMU
dmesg: read kernel buffer failed: Operation not permitted
myles@myles-MC62-G40-00:~/cuda-samples$ sudo dmesg | grep -e DMAR -e IOMMU
[sudo] password for myles: 
[    0.829972] pci 0000:60:00.2: AMD-Vi: IOMMU performance counters supported
[    0.850474] pci 0000:40:00.2: AMD-Vi: IOMMU performance counters supported
[    0.864277] pci 0000:20:00.2: AMD-Vi: IOMMU performance counters supported
[    0.876239] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
[    0.897945] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
[    0.897960] perf/amd_iommu: Detected AMD IOMMU #1 (2 banks, 4 counters/bank).
[    0.897974] perf/amd_iommu: Detected AMD IOMMU #2 (2 banks, 4 counters/bank).
[    0.897988] perf/amd_iommu: Detected AMD IOMMU #3 (2 banks, 4 counters/bank).
[ 2238.543680] AMD-Vi: IOMMU Event log restarting
[ 2238.551089] AMD-Vi: IOMMU Event log restarting
[ 2238.558182] AMD-Vi: IOMMU Event log restarting
[ 2238.566681] AMD-Vi: IOMMU Event log restarting
[ 2238.573512] AMD-Vi: IOMMU Event log restarting
[ 2238.581588] AMD-Vi: IOMMU Event log restarting
[ 2238.590563] AMD-Vi: IOMMU Event log restarting
[ 2238.596884] AMD-Vi: IOMMU Event log restarting
[ 2238.604090] AMD-Vi: IOMMU Event log restarting
[ 2238.611923] AMD-Vi: IOMMU Event log restarting
myles@myles-MC62-G40-00:~/cuda-samples$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.8.0-48-generic root=UUID=67e829f9-2373-4f35-9e25-73171b053f04 ro quiet splash vt.handoff=7
myles@myles-MC62-G40-00:~/cuda-samples$ ls /sys/kernel/iommu_groups
0   11  14  17  2   22  25  28  30  33  36  39  41  44  47  5   52  55  58  60  63  66  69  71  8
1   12  15  18  20  23  26  29  31  34  37  4   42  45  48  50  53  56  59  61  64  67  7   72  9
10  13  16  19  21  24  27  3   32  35  38  40  43  46  49  51  54  57  6   62  65  68  70  73

git clone https://github.com/NVIDIA/nccl.git
cd nccl
make -j src.build

@mylesgoose
Copy link
Author

mylesgoose commented Nov 15, 2024

'/home/myles/cuda-samples/Samples/0_Introduction/simpleP2P/simpleP2P' 
[/home/myles/cuda-samples/Samples/0_Introduction/simpleP2P/simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 4

Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 5.88GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Verification error @ element 0: val = nan, ref = 0.000000
Verification error @ element 1: val = nan, ref = 4.000000
Verification error @ element 2: val = nan, ref = 8.000000
Verification error @ element 3: val = nan, ref = 12.000000
Verification error @ element 4: val = nan, ref = 16.000000
Verification error @ element 5: val = nan, ref = 20.000000
Verification error @ element 6: val = nan, ref = 24.000000
Verification error @ element 7: val = nan, ref = 28.000000
Verification error @ element 8: val = nan, ref = 32.000000
Verification error @ element 9: val = nan, ref = 36.000000
Verification error @ element 10: val = nan, ref = 40.000000
Verification error @ element 11: val = nan, ref = 44.000000
Disabling peer access...
Shutting down...
Test failed!
myles@myles-MC62-G40-00:~/cuda-samples$ dmesg | grep -e DMAR -e IOMMU
dmesg: read kernel buffer failed: Operation not permitted
myles@myles-MC62-G40-00:~/cuda-samples$ sudo dmesg | grep -e DMAR -e IOMMU
[sudo] password for myles: 
[    0.829972] pci 0000:60:00.2: AMD-Vi: IOMMU performance counters supported
[    0.850474] pci 0000:40:00.2: AMD-Vi: IOMMU performance counters supported
[    0.864277] pci 0000:20:00.2: AMD-Vi: IOMMU performance counters supported
[    0.876239] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
[    0.897945] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
[    0.897960] perf/amd_iommu: Detected AMD IOMMU #1 (2 banks, 4 counters/bank).
[    0.897974] perf/amd_iommu: Detected AMD IOMMU #2 (2 banks, 4 counters/bank).
[    0.897988] perf/amd_iommu: Detected AMD IOMMU #3 (2 banks, 4 counters/bank).
[ 2238.543680] AMD-Vi: IOMMU Event log restarting
[ 2238.551089] AMD-Vi: IOMMU Event log restarting
[ 2238.558182] AMD-Vi: IOMMU Event log restarting
[ 2238.566681] AMD-Vi: IOMMU Event log restarting
[ 2238.573512] AMD-Vi: IOMMU Event log restarting
[ 2238.581588] AMD-Vi: IOMMU Event log restarting
[ 2238.590563] AMD-Vi: IOMMU Event log restarting
[ 2238.596884] AMD-Vi: IOMMU Event log restarting
[ 2238.604090] AMD-Vi: IOMMU Event log restarting
[ 2238.611923] AMD-Vi: IOMMU Event log restarting
myles@myles-MC62-G40-00:~/cuda-samples$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.8.0-48-generic root=UUID=67e829f9-2373-4f35-9e25-73171b053f04 ro quiet splash vt.handoff=7
myles@myles-MC62-G40-00:~/cuda-samples$ ls /sys/kernel/iommu_groups
0   11  14  17  2   22  25  28  30  33  36  39  41  44  47  5   52  55  58  60  63  66  69  71  8
1   12  15  18  20  23  26  29  31  34  37  4   42  45  48  50  53  56  59  61  64  67  7   72  9
10  13  16  19  21  24  27  3   32  35  38  40  43  46  49  51  54  57  6   62  65  68  70  73

git clone https://github.com/NVIDIA/nccl.git
cd nccl
make -j src.build

and as you can see the above test fails. becuse iommu was enabled. so i rebooted and disabled iommu in the bios. and then ran that test again.
'/home/myles/cuda-samples/Samples/0_Introduction/simpleP2P/simpleP2P'
[/home/myles/cuda-samples/Samples/0_Introduction/simpleP2P/simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 4

Checking GPU(s) for support of peer to peer memory access...

Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes
Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes
Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes
Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes
Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes
Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes
Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes
Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes
Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes
Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes
Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes
Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 12.25GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Shutting down...
Test passed
myles@myles-MC62-G40-00:~$

@mylesgoose
Copy link
Author

@ewhacc @henriklied I reproduced your errors above and then fixed them by following above procedure.

@mylesgoose
Copy link
Author

git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests

NCCL_P2P_LEVEL=SYS NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
# nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   6742 on myles-MC62-G40-00 device  0 [0x02] NVIDIA GeForce RTX 4090
#  Rank  1 Group  0 Pid   6742 on myles-MC62-G40-00 device  1 [0x41] NVIDIA GeForce RTX 4090
#  Rank  2 Group  0 Pid   6742 on myles-MC62-G40-00 device  2 [0x42] NVIDIA GeForce RTX 4090
#  Rank  3 Group  0 Pid   6742 on myles-MC62-G40-00 device  3 [0x61] NVIDIA GeForce RTX 4090
myles-MC62-G40-00:6742:6742 [0] NCCL INFO Bootstrap : Using enp100s0:192.168.1.80<0>
myles-MC62-G40-00:6742:6742 [0] NCCL INFO cudaDriverVersion 12060
myles-MC62-G40-00:6742:6742 [0] NCCL INFO NCCL version 2.23.4+cuda12.6
myles-MC62-G40-00:6742:6766 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
myles-MC62-G40-00:6742:6766 [0] NCCL INFO NET/IB : No device found.
myles-MC62-G40-00:6742:6766 [0] NCCL INFO NET/Socket : Using [0]enp100s0:192.168.1.80<0>
myles-MC62-G40-00:6742:6766 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
myles-MC62-G40-00:6742:6766 [0] NCCL INFO Using network Socket
myles-MC62-G40-00:6742:6768 [2] NCCL INFO Using network Socket
myles-MC62-G40-00:6742:6767 [1] NCCL INFO Using network Socket
myles-MC62-G40-00:6742:6769 [3] NCCL INFO Using network Socket
myles-MC62-G40-00:6742:6768 [2] NCCL INFO ncclCommInitAll comm 0x5b9898450a10 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 42000 commId 0x3b8ea5f57b417d28 - Init START
myles-MC62-G40-00:6742:6767 [1] NCCL INFO ncclCommInitAll comm 0x5b989840f4b0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 41000 commId 0x3b8ea5f57b417d28 - Init START
myles-MC62-G40-00:6742:6766 [0] NCCL INFO ncclCommInitAll comm 0x5b98983cdff0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 2000 commId 0x3b8ea5f57b417d28 - Init START
myles-MC62-G40-00:6742:6769 [3] NCCL INFO ncclCommInitAll comm 0x5b9898491f70 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 61000 commId 0x3b8ea5f57b417d28 - Init START
myles-MC62-G40-00:6742:6769 [3] NCCL INFO Bootstrap timings total 0.001024 (create 0.000054, send 0.000170, recv 0.000572, ring 0.000102, delay 0.000000)
myles-MC62-G40-00:6742:6767 [1] NCCL INFO Bootstrap timings total 0.001077 (create 0.000059, send 0.000183, recv 0.000488, ring 0.000135, delay 0.000000)
myles-MC62-G40-00:6742:6766 [0] NCCL INFO Bootstrap timings total 0.001053 (create 0.000038, send 0.000133, recv 0.000436, ring 0.000116, delay 0.000000)
myles-MC62-G40-00:6742:6768 [2] NCCL INFO Bootstrap timings total 0.001120 (create 0.000064, send 0.000199, recv 0.000582, ring 0.000085, delay 0.000001)
myles-MC62-G40-00:6742:6769 [3] NCCL INFO NCCL_P2P_LEVEL set by environment to SYS
myles-MC62-G40-00:6742:6769 [3] NCCL INFO NVLS multicast support is not available on dev 3
myles-MC62-G40-00:6742:6766 [0] NCCL INFO NVLS multicast support is not available on dev 0
myles-MC62-G40-00:6742:6767 [1] NCCL INFO NVLS multicast support is not available on dev 1
myles-MC62-G40-00:6742:6768 [2] NCCL INFO NVLS multicast support is not available on dev 2
myles-MC62-G40-00:6742:6766 [0] NCCL INFO comm 0x5b98983cdff0 rank 0 nRanks 4 nNodes 1 localRanks 4 localRank 0 MNNVL 0
myles-MC62-G40-00:6742:6769 [3] NCCL INFO comm 0x5b9898491f70 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
myles-MC62-G40-00:6742:6766 [0] NCCL INFO Channel 00/02 : 0 1 2 3
myles-MC62-G40-00:6742:6767 [1] NCCL INFO comm 0x5b989840f4b0 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
myles-MC62-G40-00:6742:6766 [0] NCCL INFO Channel 01/02 : 0 1 2 3
myles-MC62-G40-00:6742:6766 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
myles-MC62-G40-00:6742:6766 [0] NCCL INFO P2P Chunksize set to 131072
myles-MC62-G40-00:6742:6768 [2] NCCL INFO comm 0x5b9898450a10 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
myles-MC62-G40-00:6742:6769 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
myles-MC62-G40-00:6742:6768 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
myles-MC62-G40-00:6742:6768 [2] NCCL INFO P2P Chunksize set to 131072
myles-MC62-G40-00:6742:6769 [3] NCCL INFO P2P Chunksize set to 131072
myles-MC62-G40-00:6742:6767 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
myles-MC62-G40-00:6742:6767 [1] NCCL INFO P2P Chunksize set to 131072
myles-MC62-G40-00:6742:6770 [0] NCCL INFO [Proxy Service] Device 0 CPU core 32
myles-MC62-G40-00:6742:6777 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 111
myles-MC62-G40-00:6742:6775 [3] NCCL INFO [Proxy Service UDS] Device 3 CPU core 89
myles-MC62-G40-00:6742:6774 [1] NCCL INFO [Proxy Service] Device 1 CPU core 55
myles-MC62-G40-00:6742:6771 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 107
myles-MC62-G40-00:6742:6773 [2] NCCL INFO [Proxy Service] Device 2 CPU core 16
myles-MC62-G40-00:6742:6772 [3] NCCL INFO [Proxy Service] Device 3 CPU core 65
myles-MC62-G40-00:6742:6776 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 79
myles-MC62-G40-00:6742:6769 [3] NCCL INFO Channel 00/0 : 3[3] -> 0[0] via P2P/direct pointer
myles-MC62-G40-00:6742:6766 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/direct pointer
myles-MC62-G40-00:6742:6767 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/direct pointer
myles-MC62-G40-00:6742:6766 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/direct pointer
myles-MC62-G40-00:6742:6768 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/direct pointer
myles-MC62-G40-00:6742:6769 [3] NCCL INFO Channel 01/0 : 3[3] -> 0[0] via P2P/direct pointer
myles-MC62-G40-00:6742:6767 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/direct pointer
myles-MC62-G40-00:6742:6768 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/direct pointer
myles-MC62-G40-00:6742:6769 [3] NCCL INFO Connected all rings
myles-MC62-G40-00:6742:6769 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/direct pointer
myles-MC62-G40-00:6742:6766 [0] NCCL INFO Connected all rings
myles-MC62-G40-00:6742:6768 [2] NCCL INFO Connected all rings
myles-MC62-G40-00:6742:6767 [1] NCCL INFO Connected all rings
myles-MC62-G40-00:6742:6769 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/direct pointer
myles-MC62-G40-00:6742:6768 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/direct pointer
myles-MC62-G40-00:6742:6767 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/direct pointer
myles-MC62-G40-00:6742:6768 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/direct pointer
myles-MC62-G40-00:6742:6767 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/direct pointer
myles-MC62-G40-00:6742:6766 [0] NCCL INFO Connected all trees
myles-MC62-G40-00:6742:6769 [3] NCCL INFO Connected all trees
myles-MC62-G40-00:6742:6768 [2] NCCL INFO Connected all trees
myles-MC62-G40-00:6742:6767 [1] NCCL INFO Connected all trees
myles-MC62-G40-00:6742:6778 [1] NCCL INFO [Proxy Progress] Device 1 CPU core 116
myles-MC62-G40-00:6742:6780 [2] NCCL INFO [Proxy Progress] Device 2 CPU core 16
myles-MC62-G40-00:6742:6781 [3] NCCL INFO [Proxy Progress] Device 3 CPU core 65
myles-MC62-G40-00:6742:6779 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 35
myles-MC62-G40-00:6742:6767 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
myles-MC62-G40-00:6742:6767 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
myles-MC62-G40-00:6742:6766 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
myles-MC62-G40-00:6742:6766 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
myles-MC62-G40-00:6742:6769 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
myles-MC62-G40-00:6742:6769 [3] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
myles-MC62-G40-00:6742:6768 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
myles-MC62-G40-00:6742:6768 [2] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
myles-MC62-G40-00:6742:6766 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
myles-MC62-G40-00:6742:6769 [3] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
myles-MC62-G40-00:6742:6769 [3] NCCL INFO ncclCommInitAll comm 0x5b9898491f70 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 61000 commId 0x3b8ea5f57b417d28 - Init COMPLETE
myles-MC62-G40-00:6742:6769 [3] NCCL INFO Init timings - ncclCommInitAll: rank 3 nranks 4 total 0.40 (kernels 0.32, alloc 0.03, bootstrap 0.00, allgathers 0.00, topo 0.02, graphs 0.00, connections 0.03, rest 0.00)
myles-MC62-G40-00:6742:6767 [1] NCCL INFO ncclCommInitAll comm 0x5b989840f4b0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 41000 commId 0x3b8ea5f57b417d28 - Init COMPLETE
myles-MC62-G40-00:6742:6766 [0] NCCL INFO ncclCommInitAll comm 0x5b98983cdff0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 2000 commId 0x3b8ea5f57b417d28 - Init COMPLETE
myles-MC62-G40-00:6742:6766 [0] NCCL INFO Init timings - ncclCommInitAll: rank 0 nranks 4 total 0.40 (kernels 0.31, alloc 0.04, bootstrap 0.00, allgathers 0.00, topo 0.02, graphs 0.00, connections 0.03, rest 0.00)
myles-MC62-G40-00:6742:6767 [1] NCCL INFO Init timings - ncclCommInitAll: rank 1 nranks 4 total 0.40 (kernels 0.31, alloc 0.03, bootstrap 0.00, allgathers 0.00, topo 0.02, graphs 0.00, connections 0.02, rest 0.00)
myles-MC62-G40-00:6742:6768 [2] NCCL INFO ncclCommInitAll comm 0x5b9898450a10 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 42000 commId 0x3b8ea5f57b417d28 - Init COMPLETE
myles-MC62-G40-00:6742:6768 [2] NCCL INFO Init timings - ncclCommInitAll: rank 2 nranks 4 total 0.40 (kernels 0.31, alloc 0.03, bootstrap 0.00, allgathers 0.00, topo 0.02, graphs 0.00, connections 0.03, rest 0.00)
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1    16.71    0.00    0.00      0    15.98    0.00    0.00      0
          16             4     float     sum      -1    16.60    0.00    0.00      0    16.31    0.00    0.00      0
          32             8     float     sum      -1    16.13    0.00    0.00      0    16.44    0.00    0.00      0
          64            16     float     sum      -1    16.21    0.00    0.01      0    16.28    0.00    0.01      0
         128            32     float     sum      -1    16.52    0.01    0.01      0    17.47    0.01    0.01      0
         256            64     float     sum      -1    17.69    0.01    0.02      0    17.78    0.01    0.02      0
         512           128     float     sum      -1    17.77    0.03    0.04      0    17.69    0.03    0.04      0
        1024           256     float     sum      -1    17.42    0.06    0.09      0    17.47    0.06    0.09      0
        2048           512     float     sum      -1    27.93    0.07    0.11      0    17.66    0.12    0.17      0
        4096          1024     float     sum      -1    18.02    0.23    0.34      0    17.87    0.23    0.34      0
        8192          2048     float     sum      -1    18.00    0.46    0.68      0    18.00    0.46    0.68      0
       16384          4096     float     sum      -1    17.70    0.93    1.39      0    17.73    0.92    1.39      0
       32768          8192     float     sum      -1    18.78    1.74    2.62      0    18.95    1.73    2.59      0
       65536         16384     float     sum      -1    25.80    2.54    3.81      0    25.91    2.53    3.79      0
      131072         32768     float     sum      -1    42.44    3.09    4.63      0    42.26    3.10    4.65      0
      262144         65536     float     sum      -1    66.25    3.96    5.94      0    66.16    3.96    5.94      0
      524288        131072     float     sum      -1    94.11    5.57    8.36      0    94.16    5.57    8.35      0
     1048576        262144     float     sum      -1    152.7    6.87   10.30      0    157.6    6.65    9.98      0
     2097152        524288     float     sum      -1    271.8    7.72   11.58      0    275.3    7.62   11.43      0
     4194304       1048576     float     sum      -1    510.6    8.21   12.32      0    510.9    8.21   12.31      0
     8388608       2097152     float     sum      -1   1000.9    8.38   12.57      0    995.7    8.42   12.64      0
    16777216       4194304     float     sum      -1   1993.2    8.42   12.63      0   1979.1    8.48   12.72      0
    33554432       8388608     float     sum      -1   3966.7    8.46   12.69      0   3942.4    8.51   12.77      0
    67108864      16777216     float     sum      -1   7888.2    8.51   12.76      0   7865.1    8.53   12.80      0
   134217728      33554432     float     sum      -1    15705    8.55   12.82      0    15684    8.56   12.84      0
myles-MC62-G40-00:6742:6742 [0] NCCL INFO comm 0x5b98983cdff0 rank 0 nranks 4 cudaDev 0 busId 2000 - Destroy COMPLETE
myles-MC62-G40-00:6742:6742 [3] NCCL INFO comm 0x5b9898491f70 rank 3 nranks 4 cudaDev 3 busId 61000 - Destroy COMPLETE
myles-MC62-G40-00:6742:6742 [2] NCCL INFO comm 0x5b9898450a10 rank 2 nranks 4 cudaDev 2 busId 42000 - Destroy COMPLETE
myles-MC62-G40-00:6742:6742 [1] NCCL INFO comm 0x5b989840f4b0 rank 1 nranks 4 cudaDev 1 busId 41000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 5.02577 
#

@mylesgoose
Copy link
Author

./build/all_reduce_perf -g 2
# nThread 1 nGpus 2 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   8411 on myles-MC62-G40-00 device  0 [0x41] NVIDIA GeForce RTX 4090
#  Rank  1 Group  0 Pid   8411 on myles-MC62-G40-00 device  1 [0x42] NVIDIA GeForce RTX 4090
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
    33554432       8388608     float     sum      -1   1390.4   24.13   24.13      0   1371.8   24.46   24.46      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 24.2967 
#

myles@myles-MC62-G40-00:~/nccl-tests$ 

@mylesgoose
Copy link
Author

./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 4090, pciBusID: 41, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 4090, pciBusID: 42, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0	     1     1
     1	     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 912.14  11.56 
     1  11.41 903.18 
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1 
     0 914.28  26.33 
     1  26.34 933.39 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 919.18  11.42 
     1  11.44 915.08 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 919.34  51.92 
     1  52.01 913.00 
P2P=Disabled Latency Matrix (us)
   GPU     0      1 
     0   1.43  10.43 
     1  10.50   1.38 

   CPU     0      1 
     0   2.37   6.78 
     1   6.72   2.30 
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1 
     0   1.44   0.96 
     1   0.98   1.39 

   CPU     0      1 
     0   2.31   1.99 
     1   2.05   2.27 

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
myles@myles-MC62-G40-00:~/cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest$ 

@ewhacc
Copy link

ewhacc commented Nov 16, 2024

@mylesgoose Omg, thanks a ton! My stuff & methods are about the same, except I try ubuntu 24.04 now.
I didn't know iommu should be disabled. Let me try tomorrow, and I will you informed.
BTW, I am also building with Supreme water block :)
Thanks again!

@ewhacc
Copy link

ewhacc commented Nov 16, 2024

@mylesgoose Yeah, iommu is on. That's why I have succeed with 550 before, but now failed with both 550 & 560.
Certainly, I disabled iommu before, now I forgot it & checked only BAR.
Sorry for making you spend a long time to check. I appreciate your time and effort.

$ sudo dmesg | grep -e DMAR -e IOMMU
[ 2.149951] pci 0000:60:00.2: AMD-Vi: IOMMU performance counters supported
[ 2.159899] pci 0000:40:00.2: AMD-Vi: IOMMU performance counters supported
[ 2.165749] pci 0000:20:00.2: AMD-Vi: IOMMU performance counters supported
[ 2.172119] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported

@mylesgoose
Copy link
Author

mylesgoose commented Nov 16, 2024

@mylesgoose Yeah, iommu is on. That's why I have succeed with 550 before, but now failed with both 550 & 560.
Certainly, I disabled iommu before, now I forgot it & checked only BAR.
Sorry for making you spend a long time to check. I appreciate your time and effort.

$ sudo dmesg | grep -e DMAR -e IOMMU
[ 2.149951] pci 0000:60:00.2: AMD-Vi: IOMMU performance counters supported
[ 2.159899] pci 0000:40:00.2: AMD-Vi: IOMMU performance counters supported
[ 2.165749] pci 0000:20:00.2: AMD-Vi: IOMMU performance counters supported
[ 2.172119] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported

@ewhacc iommu off.. how are you installing it do you download that source and compile.. also it works with ubuntu 24.04 and 24.10 however those versions have switched from x11 and gdm3 to Wayland. And Wayland I think does some security checks on the memorry. Of the gpu. And it does not render the windows correctly if at all. If your using ubuntu server you won't notice but if using desktop. You need to install sddm as display manager and use one of the other desktop environment i think uts called proton3 or something an x11 desktop environment. Anything like from gnome does not work correctly on Wayland. Don't know why..

@mylesgoose
Copy link
Author

mylesgoose commented Nov 16, 2024

the output from your sudo dmesg | grep -e DMAR -e IOMMU command confirms that IOMMU is enabled in your BIOS.
Here's a breakdown:

  • AMD-Vi: This indicates that your system is using AMD's virtualization technology, which includes IOMMU.
  • IOMMU performance counters supported: This clearly shows that IOMMU is not only enabled but also has performance counters available, meaning it's fully functional. You need to DISABLE IOMMU it needs to be off in bios. Also in grub config you can say Intel iommu off or amd_iommu=off something like that. Ask gpt4 foe exact fraze.

@ewhacc
Copy link

ewhacc commented Nov 16, 2024

@mylesgoose I have disabled iommu in grub. I will disable iommu in bios too tomorrow.

Everything works perfectly! p2pBandwidthLatencyTest, simpleP2P, nvbandwidth, nccl-tests.
Also, tested my qLoRA training code. Thank so much.

Yes, I have download the source & compile.
sudo apt-get install nvidia-open-560
and, remove modules, install new modules from the source

Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 1446.76  36.11
     1  36.12 1791.86
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 1468.34  50.88
     1  50.84 1824.55

./all_reduce_perf -g 2
# nThread 1 nGpus 2 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   1624 on      asrok device  0 [0x01] NVIDIA GeForce RTX 4090
#  Rank  1 Group  0 Pid   1624 on      asrok device  1 [0x41] NVIDIA GeForce RTX 4090
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
    33554432       8388608     float     sum      -1   1399.3   23.98   23.98      0   1396.4   24.03   24.03      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 24.0037
#

@ewhacc
Copy link

ewhacc commented Nov 16, 2024

I have it already done. Thank you for explanation.
I just forgot about iommu, though I know how to disable ^^

@mylesgoose
Copy link
Author

mylesgoose commented Nov 16, 2024

@ewhacc mate so glad you got it working. Well done. Also its a good idea to keep the original modules from apt. Because you can make a script to simply replace the p2p modules now with apt ones and reboot. And then things that use secure ram like steam games or Wayland etc can function as normal. As you have one script to copy files from backup to original location and one script to install the p2p ones when machine learning if needed. Can you edit your last message to put shell around the terminal output.. so it looks tidy for someone else to read.

@ewhacc
Copy link

ewhacc commented Nov 16, 2024

@mylesgoose I didn't keep the original modules. This is only for LLM training. I don't have X too. :)
Oh, I confirm it works well with 24.04.
I will try to edit the last message with nccl output. I just didn't want markdown.

I noticed little bandwidth down after disabling iommu in grub. I will check again after disable iommu in bios too.

52.43 -> 50.88 (GB/s)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants