Skip to content

Heterogeneous Operations on CUDA and ROCm Nodes Using UCX/UCC #9985

Answered by RafalSiwek
RafalSiwek asked this question in Q&A
Discussion options

You must be logged in to vote

Hi @cui36,

Thanks for the question! I’ve tested and successfully run UCC collectives with UCX over TCP on AWS g4dn and g4ad instances. I’ve documented everything here: https://github.com/RafalSiwek/troubleshoot-heterogenous-distributed-operations, so feel free to check it out.

Regarding RDMA-capable AWS clusters with AMD GPUs: unfortunately, EFA doesn’t currently support instances with AMD GPUs (AWS documentation). I haven’t come across or heard of any alternative ways to set up RDMA-capable networking for these instances.

At the moment, AWS doesn’t offer newer AMD GPUs beyond the Radeon Pro V520 (AWS AMD EC2 Instances), which is a bit disappointing. On the other hand, Azure does provide …

Replies: 1 comment 9 replies

Comment options

You must be logged in to vote
9 replies
@RafalSiwek
Comment options

@RafalSiwek
Comment options

@cui36
Comment options

@RafalSiwek
Comment options

Answer selected by RafalSiwek
@RafalSiwek
Comment options

@tvegas1
Comment options

tvegas1 Jan 20, 2025
Collaborator

@cui36
Comment options

@cui36
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
4 participants