Heterogeneous Operations on CUDA and ROCm Nodes Using UCX/UCC #9985
-
Hello UCX Team, I'm working on a high-performance computing project involving nodes with different GPU setups—some nodes with NVIDIA GPUs running CUDA and others with AMD GPUs running ROCm. I am exploring ways to perform efficient MPI operations across these heterogeneous nodes. Is it possible to use UCX and UCC to facilitate communication and collective operations between nodes with CUDA and ROCm environments? Specifically, can UCX and UCC act as a middleware to bridge the communication between RCCL (for ROCm) and NCCL (for CUDA)? If so, are there any specific configurations or build steps required to enable this interoperability? Thank you for your guidance and support. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 9 replies
-
UCX supports both Cuda and ROCm, and in theory, should support such an environment. However, that scenario was never tested or optimized. |
Beta Was this translation helpful? Give feedback.
Hi @cui36,
Thanks for the question! I’ve tested and successfully run UCC collectives with UCX over TCP on AWS g4dn and g4ad instances. I’ve documented everything here: https://github.com/RafalSiwek/troubleshoot-heterogenous-distributed-operations, so feel free to check it out.
Regarding RDMA-capable AWS clusters with AMD GPUs: unfortunately, EFA doesn’t currently support instances with AMD GPUs (AWS documentation). I haven’t come across or heard of any alternative ways to set up RDMA-capable networking for these instances.
At the moment, AWS doesn’t offer newer AMD GPUs beyond the Radeon Pro V520 (AWS AMD EC2 Instances), which is a bit disappointing. On the other hand, Azure does provide …