Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance with NVLink #57

Open
froody opened this issue Sep 24, 2020 · 2 comments
Open

Poor performance with NVLink #57

froody opened this issue Sep 24, 2020 · 2 comments

Comments

@froody
Copy link

froody commented Sep 24, 2020

I was running some benchmarks with torch-ucc using xccl for collectives, and I noticed very bad performance compared to NCCL. See numbers here: https://gist.github.com/froody/a86a5b2c5d9f46aedba7e930f4b4e08d

It's possible this is due to a misconfiguration, I built xccl with cuda and ucx support, but without sharp or vmc support. My question is - is it expected for xccl to properly utilize NVLink when available (in this case on a DGX-1 doing all-reduce across all 8 GPUs)?

I also noticed when running the benchmarks that CPU utilization as very high for all workers which seemed to be due to high-frequency polling.

Also as you can see in the output, ucc fails trying to reduce a 2gb tensor whereas nccl fails trying to reduce an 8gb tensor. This could be indicative of a leak somewhere.

Repro steps:
Run benchmark here: https://gist.github.com/froody/01ed6ce8d6ab72bd868431d793591379
Use BACKEND=ucc or BACKEND=nccl to select backend

hardware: DGX-1, Driver Version: 418.116.00
cuda: 10.1
pytorch: 1.6.0
ucx: 1.9.0
torch-ucc: a277d7da24ae6e8a40bda658d0f0d4e06fcadb8b
xccl: 2e97986

@srinivas212
Copy link
Contributor

Does affinitizing MPI rank to GPU expected to help?

@froody
Copy link
Author

froody commented Sep 28, 2020

Do you mean torch.cuda.set_device()? If so then yes. I also changed torch_ucc to use cudaGetDevice in ProcessGroupUCC::progress_loop instead of hard-coding device 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants