Poor performance with NVLink #57

froody · 2020-09-24T21:07:20Z

I was running some benchmarks with torch-ucc using xccl for collectives, and I noticed very bad performance compared to NCCL. See numbers here: https://gist.github.com/froody/a86a5b2c5d9f46aedba7e930f4b4e08d

It's possible this is due to a misconfiguration, I built xccl with cuda and ucx support, but without sharp or vmc support. My question is - is it expected for xccl to properly utilize NVLink when available (in this case on a DGX-1 doing all-reduce across all 8 GPUs)?

I also noticed when running the benchmarks that CPU utilization as very high for all workers which seemed to be due to high-frequency polling.

Also as you can see in the output, ucc fails trying to reduce a 2gb tensor whereas nccl fails trying to reduce an 8gb tensor. This could be indicative of a leak somewhere.

Repro steps:
Run benchmark here: https://gist.github.com/froody/01ed6ce8d6ab72bd868431d793591379
Use BACKEND=ucc or BACKEND=nccl to select backend

hardware: DGX-1, Driver Version: 418.116.00
cuda: 10.1
pytorch: 1.6.0
ucx: 1.9.0
torch-ucc: a277d7da24ae6e8a40bda658d0f0d4e06fcadb8b
xccl: 2e97986

The text was updated successfully, but these errors were encountered:

srinivas212 · 2020-09-26T01:43:51Z

Does affinitizing MPI rank to GPU expected to help?

froody · 2020-09-28T19:55:38Z

Do you mean torch.cuda.set_device()? If so then yes. I also changed torch_ucc to use cudaGetDevice in ProcessGroupUCC::progress_loop instead of hard-coding device 0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance with NVLink #57

Poor performance with NVLink #57

froody commented Sep 24, 2020

srinivas212 commented Sep 26, 2020

froody commented Sep 28, 2020

Poor performance with NVLink #57

Poor performance with NVLink #57

Comments

froody commented Sep 24, 2020

srinivas212 commented Sep 26, 2020

froody commented Sep 28, 2020