For optimized performance, you may need to set additional environment variables depending on the versions of your libfabric.
Setting | Explanation |
---|---|
NCCL_DEBUG=info |
Set this to get debug information from NCCL, this will allow you to see if NCCL is using EFA and what versions it's using. This will print out a lot of debug information so we advise turning it off unless you suspect NCCL issues, see NCCL_DEBUG for more info. |
FI_EFA_USE_HUGE_PAGE=0 |
Set to 0 when you see os.fork() causes OSError: Cannot allocate memory . Typically happen by multi-process PyTorch data loader. Disabling huge page causes minor performance hit, but it's needed to prevent fork fails due to the operating system running out of huge pages. |
FI_EFA_FORK_SAFE=1 |
Not needed for kernel>=5.15. Still fine to set it though no effect. See ref. |
FI_EFA_USE_DEVICE_RDMA=1 |
Do not set for libfabric>=1.18.0 and aws-ofi-nccl>=1.7.0. It's not harmful to set it on p4/p5 on the newer software, but you just don't have to set it. |
FI_EFA_SET_CUDA_SYNC_MEMOPS=0 |
Set this on efa-installer<1.29.1 and nccl>=2.19.0 to prevent NCCL error register_rail_mr_buffer:617 NCCL WARN NET/OFI Unable to register memory (type = 2) for device 4. RC: -22, Error: Invalid argument . |
FI_EFA_ENABLE_SHM_TRANSFER=1 |
Not needed. This is really a no-op, the default already to enable SHMEM |
FI_PROVIDER=efa |
Use for aws-ofi-nccl<=1.5.0 AND p4/p5 instances. |
NCCL_PROTO=simple |
Use for aws-ofi-nccl<=1.5.0 and p4/p5 instances. |
NCCL_SOCKET_NTHREADS |
Not applicable for EFA. |
NCCL_SOCKET_IFNAME |
Set this to en to cover both p5.48xlarge and p4d(e).24xlarge . For other instances check ifconfig to see the active network interface. |
NCCL_NSOCKS_PERTHREAD |
Not applicable for EFA. |
NCCL_MIN_CHANNELS=xxx |
Recommend to leave it out to use the default. For e.g., on p4d/p4de, the number of channels should be 8, which is the minimum for a 4-NIC platform. The reduction message is split by number of GPUs in the job, then the number of channels, so having more channels than necessary causes smaller messages which causes EFA to be starved for data. |
NCCL_BUFFSIZE=xxx |
Recommend to leave it out to use the default. |
RDMAV_FORK_SAFE=1 |
Do not use. This is a RDMA-core environment variable. Prefer FI_EFA_FORK_SAFE (if it still makes sense for your Linux kernel version). The two looks the same, but actually behaves very differently, especially on newer kernels, where RDMAV_FORK_SAFE=1 can break things. |
NCCL_SHM_USE_CUDA_MEMCPY=1 |
Setting this when you run NCCL on g6/g5. It gives x2 performance in comparison to default memcpy |
RDMAV_* |
Do not use |
NCCL version | Recommend one of the stable releases. |
Use cuda>=12.0, nccl>=2.18.0 (recommend at least 2.18.5), aws-ofi-nccl>=1.7.2 (recommend at least 1.7.3).
The table below shows number of NVLinks for p4de.24xlarge
and p5.48xlarge
instances:
Instance | GPU | # NVLinks | Generation |
---|---|---|---|
p4de.24xlarge | A100 80GB | 12 | 3rd |
p5.48xlarge | H100 | 18 | 4th |
nvidia-smi nvlink -s
is the command to get the status for all NVLinks for each of the GPUs. Below we see this data for GPU 0 of a p4de.24xlarge
instance
ubuntu@ip-172-31-35-99:~$ nvidia-smi nvlink -s
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-370ec676-e407-3115-836a-8ebcb3c4f62a)
Link 0: 25 GB/s
Link 1: 25 GB/s
Link 2: 25 GB/s
Link 3: 25 GB/s
Link 4: 25 GB/s
Link 5: 25 GB/s
Link 6: 25 GB/s
Link 7: 25 GB/s
Link 8: 25 GB/s
Link 9: 25 GB/s
Link 10: 25 GB/s
Link 11: 25 GB/s
The dcgm command to validate the NVLinks is sudo dcgmi diag -r 2 -p pcie.gpu_nvlinks_expected_up=<# NVLinks>
. For p4de.24xlarge
instance, this diagnostic looks like:
ubuntu@ip-172-31-35-99:~$ dcgmi diag -r 2 -p pcie.gpu_nvlinks_expected_up=12
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic | Result |
+===========================+================================================+
|----- Metadata ----------+------------------------------------------------|
| DCGM Version | 3.3.3 |
| Driver Version Detected | 535.104.12 |
| GPU Device IDs Detected | 20b2,20b2,20b2,20b2,20b2,20b2,20b2,20b2 |
|----- Deployment --------+------------------------------------------------|
| Denylist | Pass |
| NVML Library | Pass |
| CUDA Main Library | Pass |
| Permissions and OS Blocks | Pass |
| Persistence Mode | Pass |
| Environment Variables | Pass |
| Page Retirement/Row Remap | Pass |
| Graphics Processes | Pass |
| Inforom | Pass |
+----- Integration -------+------------------------------------------------+
| PCIe | Pass - All |
+----- Hardware ----------+------------------------------------------------+
| GPU Memory | Pass - All |
+----- Stress ------------+------------------------------------------------+
+---------------------------+------------------------------------------------+
export FI_EFA_USE_HUGE_PAGE=0
export FI_EFA_USE_HUGE_PAGE=0
export FI_EFA_USE_DEVICE_RDMA=1
export FI_EFA_USE_HUGE_PAGE=0
export FI_EFA_USE_DEVICE_RDMA=1
export FI_PROVIDER=efa
export NCCL_PROTO=simple