Heterogeneous Operations on CUDA and ROCm Nodes Using UCX/UCC #9985

RafalSiwek · 2024-07-01T14:36:01Z

RafalSiwek
Jul 1, 2024

Hello UCX Team,

I'm working on a high-performance computing project involving nodes with different GPU setups—some nodes with NVIDIA GPUs running CUDA and others with AMD GPUs running ROCm. I am exploring ways to perform efficient MPI operations across these heterogeneous nodes.

Is it possible to use UCX and UCC to facilitate communication and collective operations between nodes with CUDA and ROCm environments? Specifically, can UCX and UCC act as a middleware to bridge the communication between RCCL (for ROCm) and NCCL (for CUDA)? If so, are there any specific configurations or build steps required to enable this interoperability?

Thank you for your guidance and support.

Answered by RafalSiwek

Jan 20, 2025

Hi @cui36,

Thanks for the question! I’ve tested and successfully run UCC collectives with UCX over TCP on AWS g4dn and g4ad instances. I’ve documented everything here: https://github.com/RafalSiwek/troubleshoot-heterogenous-distributed-operations, so feel free to check it out.

Regarding RDMA-capable AWS clusters with AMD GPUs: unfortunately, EFA doesn’t currently support instances with AMD GPUs (AWS documentation). I haven’t come across or heard of any alternative ways to set up RDMA-capable networking for these instances.

At the moment, AWS doesn’t offer newer AMD GPUs beyond the Radeon Pro V520 (AWS AMD EC2 Instances), which is a bit disappointing. On the other hand, Azure does provide …

View full answer

yosefe · 2024-07-01T16:20:41Z

yosefe
Jul 1, 2024
Maintainer

UCX supports both Cuda and ROCm, and in theory, should support such an environment. However, that scenario was never tested or optimized.

9 replies

RafalSiwek Jul 1, 2024
Author

Thank you for the response. That sounds like a starting point. I'll try to test it and see how it works in practice.

RafalSiwek Jul 17, 2024
Author

I successfully built UCX with ROCm (v6.1.2) on an AMD node (AWS g4ad EC2) and with CUDA (v12.4) on an NVIDIA node (AWS g4dn EC2). I then attempted to run a distributed job using UCC, with PyTorch built from source (with the UCC implementation that is also converted to support ROCm for AMD builds). The standard network configuration was used, as the g4ad instances do not support RDMA-capable EFA.

Despite these efforts, I encountered the following issues:

When using UCX_TLS=tcp,cuda_cpy for CUDA and UCX_TLS=tcp,rocm_cpy for ROCm on both nodes, I received the following error:
```
Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x713c09400a08)
```

When using UCX_TLS=tcp on both nodes, UCC failed with the following debug message:

ucc_coll_score_map.c:126  UCC  DEBUG coll_score_map lookup failed -1 (Operation is not supported)

UCC log details for ROCm node

[1721246634.867159] [ip-rocm-node:117  :0]   ucc_proc_info.c:309  UCC  DEBUG proc pid 117, host ip-rocm-node, host_hash 9698589212454035227, sockid 0, numaid 0
[1721246634.867189] [ip-rocm-node:117  :0] ucc_constructor.c:188  UCC  INFO  version: 1.4.0, loaded from: /usr/lib/libucc.so.1, cfg file: n/a
[1721246634.867216] [ip-rocm-node:117  :0]          ucc_mc.c:67   UCC  DEBUG mc cpu mc initialized
[1721246634.867239] [ip-rocm-node:117  :0]          ucc_mc.c:67   UCC  DEBUG mc rocm mc initialized
[1721246634.867250] [ip-rocm-node:117  :0]          ucc_ec.c:63   UCC  DEBUG ec cpu ec initialized
[1721246634.867269] [ip-rocm-node:117  :0]          ucc_ec.c:63   UCC  DEBUG ec rocm ec initialized
[1721246634.867304] [ip-rocm-node:117  :0]    cl_basic_lib.c:20   CL_BASIC DEBUG initialized lib object: 0x292d8e0
[1721246634.867330] [ip-rocm-node:117  :0]         ucc_lib.c:150  UCC  DEBUG lib_prefix "TORCH_UCC_": initialized component "basic" score 10
[1721246634.867356] [ip-rocm-node:117  :0]     tl_rccl_lib.c:18   TL_RCCL DEBUG initialized lib object: 0x8af6b60
[1721246634.867371] [ip-rocm-node:117  :0]     tl_self_lib.c:20   TL_SELF DEBUG initialized lib object: 0x8d6c630
[1721246634.867411] [ip-rocm-node:117  :0]      tl_ucp_lib.c:69   TL_UCP DEBUG initialized lib object: 0x8b0f1e0
[1721246634.867461] [ip-rocm-node:117  :0] tl_rccl_context.c:82   TL_RCCL DEBUG fallback to event completion sync
[1721246634.867486] [ip-rocm-node:117  :0] tl_rccl_context.c:87   TL_RCCL DEBUG using event completion sync
[1721246634.867779] [ip-rocm-node:117  :0] tl_rccl_context.c:102  TL_RCCL DEBUG initialized tl context: 0x8ad8dc0
[1721246634.974267] [ip-rocm-node:117  :0]          parser.c:2305 UCX  WARN  unused environment variables: UCX_COMMIT; UCX_HOME
[1721246634.974267] [ip-rocm-node:117  :0]          parser.c:2305 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1721246634.974322] [ip-rocm-node:117  :0]  tl_ucp_context.c:281  TL_UCP DEBUG initialized tl context: 0x8ff4720
[1721246634.974342] [ip-rocm-node:117  :0] cl_basic_context.c:50   CL_BASIC DEBUG initialized cl context: 0x8eb76a0
[1721246634.983157] [ip-rocm-node:117  :0]     tl_ucp_team.c:103  TL_UCP DEBUG posted tl team: 0x8eb76d0
[1721246634.983174] [ip-rocm-node:117  :0]     tl_ucp_team.c:202  TL_UCP DEBUG initialized tl team: 0x8eb76d0
[1721246634.983180] [ip-rocm-node:117  :0]     ucc_context.c:839  UCC  DEBUG created ucc context 0x8ad8c50 for lib TORCH_UCC_
[1721246634.988557] [ip-rocm-node:117  :0]        ucc_team.c:369  UCC  DEBUG team 0x8b13c60 rank 1, ctx_rank 1, map_type 1
[1721246635.005591] [ip-rocm-node:117  :0]        ucc_team.c:637  UCC  DEBUG allocated ID 1 for team 0x8b13c60
[1721246635.005615] [ip-rocm-node:117  :0]          ucc_tl.c:227  UCC  DEBUG TL rccl is not reachable, skipping
[1721246635.005620] [ip-rocm-node:117  :0]   cl_basic_team.c:52   CL_BASIC DEBUG posted cl team: 0x8ce8bf0
[1721246635.005636] [ip-rocm-node:117  :0]          ucc_tl.c:298  TL_SELF DEBUG team size 2 is too big, max supported 1
[1721246635.005657] [ip-rocm-node:117  :0]     tl_ucp_team.c:100  TL_UCP DEBUG opt knomial radix: 2
[1721246635.005671] [ip-rocm-node:117  :0]     tl_ucp_team.c:103  TL_UCP DEBUG posted tl team: 0x8eca970
[1721246635.005676] [ip-rocm-node:117  :0]     tl_ucp_team.c:202  TL_UCP DEBUG initialized tl team: 0x8eca970
[1721246635.005689] [ip-rocm-node:117  :0]   cl_basic_team.c:124  CL_BASIC DEBUG failed to create tl rccl team: (-7)
[1721246635.005693] [ip-rocm-node:117  :0]   cl_basic_team.c:124  CL_BASIC DEBUG failed to create tl self team: (-1)
[1721246635.005705] [ip-rocm-node:117  :0]   cl_basic_team.c:120  CL_BASIC DEBUG initialized tl ucp team
[1721246635.005712] [ip-rocm-node:117  :0]     tl_ucp_team.c:230  TL_UCP DEBUG enable support for memory type host
[1721246635.010721] [ip-rocm-node:117  :0]          ucc_ee.c:42   UCC  DEBUG ee is created: 0x8ff5f40 ee_context: 0x8df4d10
[1721246635.014782] [ip-rocm-node:117  :0]          ucc_ee.c:42   UCC  DEBUG ee is created: 0x8af7900 ee_context: 0x8a03f50
[1721246635.018915] [ip-rocm-node:117  :0]          ucc_ee.c:42   UCC  DEBUG ee is created: 0x8df0040 ee_context: 0x8a0b140
[1721246635.019036] [ip-rocm-node:117  :0] ucc_coll_score_map.c:126  UCC  DEBUG coll_score_map lookup failed -1 (Operation is not supported)
[1721246635.019045] [ip-rocm-node:117  :0]        ucc_coll.c:235  UCC  DEBUG failed to init collective: not supported
[rank1]: RuntimeError: [/pytorch/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp:488] [Rank 1][ProcessGroupUCC-0][READY]failed to init cuda collective, error code -1: Operation is not supported, system error code 11

UCC log details for CUDA node

[1721246634.867938] [ip-cuda-node:127  :0]   ucc_proc_info.c:309  UCC  DEBUG proc pid 127, host ip-cuda-node, host_hash 9586961075252156666, sockid 0, numaid 0
[1721246634.867962] [ip-cuda-node:127  :0] ucc_constructor.c:188  UCC  INFO  version: 1.4.0, loaded from: /lib/libucc.so.1, cfg file: n/a
[1721246634.867994] [ip-cuda-node:127  :0]          ucc_mc.c:67   UCC  DEBUG mc cpu mc initialized
[1721246634.868021] [ip-cuda-node:127  :0]         mc_cuda.c:65   cuda mc DEBUG driver version 12040
[1721246634.868038] [ip-cuda-node:127  :0]          ucc_mc.c:67   UCC  DEBUG mc cuda mc initialized
[1721246634.868056] [ip-cuda-node:127  :0]          ucc_ec.c:63   UCC  DEBUG ec cpu ec initialized
[1721246634.870442] [ip-cuda-node:127  :0]          ucc_ec.c:63   UCC  DEBUG ec cuda ec initialized
[1721246634.870483] [ip-cuda-node:127  :0]    cl_basic_lib.c:20   CL_BASIC DEBUG initialized lib object: 0x7cdaa10
[1721246634.870512] [ip-cuda-node:127  :0]         ucc_lib.c:150  UCC  DEBUG lib_prefix "TORCH_UCC_": initialized component "basic" score 10
[1721246634.870544] [ip-cuda-node:127  :0]     tl_cuda_lib.c:35   TL_CUDA DEBUG initialized lib object: 0x87c3d40
[1721246634.870569] [ip-cuda-node:127  :0]     tl_nccl_lib.c:16   TL_NCCL DEBUG initialized lib object: 0x87c3e10
[1721246634.870593] [ip-cuda-node:127  :0]     tl_self_lib.c:20   TL_SELF DEBUG initialized lib object: 0x87c3e70
[1721246634.870642] [ip-cuda-node:127  :0]      tl_ucp_lib.c:69   TL_UCP DEBUG initialized lib object: 0x8489b50
[1721246634.871332] [ip-cuda-node:127  :0] tl_cuda_context.c:71   TL_CUDA DEBUG initialized tl context: 0x87c4420
[1721246634.871352] [ip-cuda-node:127  :0] tl_nccl_context.c:182  TL_NCCL DEBUG using memops completion sync
[1721246634.871835] [ip-cuda-node:127  :0] tl_nccl_context.c:205  TL_NCCL DEBUG initialized tl context: 0x87c4c20
[1721246634.932593] [ip-cuda-node:127  :0]          parser.c:2305 UCX  WARN  unused environment variables: UCX_COMMIT; UCX_HOME
[1721246634.932593] [ip-cuda-node:127  :0]          parser.c:2305 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1721246634.932653] [ip-cuda-node:127  :0]  tl_ucp_context.c:281  TL_UCP DEBUG initialized tl context: 0x7742680
[1721246634.932675] [ip-cuda-node:127  :0] cl_basic_context.c:50   CL_BASIC DEBUG initialized cl context: 0x87deed0
[1721246634.983607] [ip-cuda-node:127  :0]     tl_ucp_team.c:103  TL_UCP DEBUG posted tl team: 0x880da10
[1721246634.983621] [ip-cuda-node:127  :0]     tl_ucp_team.c:202  TL_UCP DEBUG initialized tl team: 0x880da10
[1721246634.983631] [ip-cuda-node:127  :0]     ucc_context.c:839  UCC  DEBUG created ucc context 0x87c4280 for lib TORCH_UCC_
[1721246634.989017] [ip-cuda-node:127  :0]        ucc_team.c:369  UCC  DEBUG team 0x880e180 rank 0, ctx_rank 0, map_type 1
[1721246635.005415] [ip-cuda-node:127  :0]        ucc_team.c:637  UCC  DEBUG allocated ID 1 for team 0x880e180
[1721246635.005438] [ip-cuda-node:127  :0]          ucc_tl.c:227  UCC  DEBUG TL cuda is not reachable, skipping
[1721246635.005446] [ip-cuda-node:127  :0]   cl_basic_team.c:52   CL_BASIC DEBUG posted cl team: 0x68f1a10
[1721246635.005456] [ip-cuda-node:127  :0]          ucc_tl.c:227  UCC  DEBUG TL nccl is not reachable, skipping
[1721246635.005472] [ip-cuda-node:127  :0]          ucc_tl.c:298  TL_SELF DEBUG team size 2 is too big, max supported 1
[1721246635.005496] [ip-cuda-node:127  :0]     tl_ucp_team.c:100  TL_UCP DEBUG opt knomial radix: 2
[1721246635.005506] [ip-cuda-node:127  :0]     tl_ucp_team.c:103  TL_UCP DEBUG posted tl team: 0x89678a0
[1721246635.005523] [ip-cuda-node:127  :0]     tl_ucp_team.c:202  TL_UCP DEBUG initialized tl team: 0x89678a0
[1721246635.005567] [ip-cuda-node:127  :0]   cl_basic_team.c:124  CL_BASIC DEBUG failed to create tl cuda team: (-7)
[1721246635.005574] [ip-cuda-node:127  :0]   cl_basic_team.c:124  CL_BASIC DEBUG failed to create tl nccl team: (-7)
[1721246635.005579] [ip-cuda-node:127  :0]   cl_basic_team.c:124  CL_BASIC DEBUG failed to create tl self team: (-1)
[1721246635.005583] [ip-cuda-node:127  :0]   cl_basic_team.c:120  CL_BASIC DEBUG initialized tl ucp team
[1721246635.005592] [ip-cuda-node:127  :0]     tl_ucp_team.c:230  TL_UCP DEBUG enable support for memory type host
[1721246635.005698] [ip-cuda-node:127  :0]        ucc_team.c:471  UCC  INFO  ===== COLL_SCORE_MAP (team_id 1, size 2) =====
[1721246635.005715] [ip-cuda-node:127  :0] ucc_coll_score_map.c:206  UCC  INFO  Allgather:
[1721246635.005715] [ip-cuda-node:127  :0] ucc_coll_score_map.c:206  UCC  INFO  	Host: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10
[1721246635.005733] [ip-cuda-node:127  :0] ucc_coll_score_map.c:206  UCC  INFO  Allgatherv:
[1721246635.005733] [ip-cuda-node:127  :0] ucc_coll_score_map.c:206  UCC  INFO  	Host: {0..inf}:TL_UCP:10
[1721246635.005748] [ip-cuda-node:127  :0] ucc_coll_score_map.c:206  UCC  INFO  Allreduce:
[1721246635.005748] [ip-cuda-node:127  :0] ucc_coll_score_map.c:206  UCC  INFO  	Host: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10
[1721246635.005761] [ip-cuda-node:127  :0] ucc_coll_score_map.c:206  UCC  INFO  Alltoall:
[1721246635.005761] [ip-cuda-node:127  :0] ucc_coll_score_map.c:206  UCC  INFO  	Host: {0..257}:TL_UCP:10 {258..inf}:TL_UCP:10
[1721246635.005773] [ip-cuda-node:127  :0] ucc_coll_score_map.c:206  UCC  INFO  Alltoallv:
[1721246635.005773] [ip-cuda-node:127  :0] ucc_coll_score_map.c:206  UCC  INFO  	Host: {0..inf}:TL_UCP:10
[1721246635.005790] [ip-cuda-node:127  :0] ucc_coll_score_map.c:206  UCC  INFO  Barrier:
[1721246635.005790] [ip-cuda-node:127  :0] ucc_coll_score_map.c:206  UCC  INFO  	Host: {0..inf}:TL_UCP:10
[1721246635.005806] [ip-cuda-node:127  :0] ucc_coll_score_map.c:206  UCC  INFO  Bcast:
[1721246635.005806] [ip-cuda-node:127  :0] ucc_coll_score_map.c:206  UCC  INFO  	Host: {0..inf}:TL_UCP:10
[1721246635.005821] [ip-cuda-node:127  :0] ucc_coll_score_map.c:206  UCC  INFO  Fanin:
[1721246635.005821] [ip-cuda-node:127  :0] ucc_coll_score_map.c:206  UCC  INFO  	Host: {0..inf}:TL_UCP:10
[1721246635.005829] [ip-cuda-node:127  :0] ucc_coll_score_map.c:206  UCC  INFO  Fanout:
[1721246635.005829] [ip-cuda-node:127  :0] ucc_coll_score_map.c:206  UCC  INFO  	Host: {0..inf}:TL_UCP:10
[1721246635.005845] [ip-cuda-node:127  :0] ucc_coll_score_map.c:206  UCC  INFO  Gather:
[1721246635.005845] [ip-cuda-node:127  :0] ucc_coll_score_map.c:206  UCC  INFO  	Host: {0..inf}:TL_UCP:10
[1721246635.005871] [ip-cuda-node:127  :0] ucc_coll_score_map.c:206  UCC  INFO  Gatherv:
[1721246635.005871] [ip-cuda-node:127  :0] ucc_coll_score_map.c:206  UCC  INFO  	Host: {0..inf}:TL_UCP:10
[1721246635.005887] [ip-cuda-node:127  :0] ucc_coll_score_map.c:206  UCC  INFO  Reduce:
[1721246635.005887] [ip-cuda-node:127  :0] ucc_coll_score_map.c:206  UCC  INFO  	Host: {0..inf}:TL_UCP:10
[1721246635.005900] [ip-cuda-node:127  :0] ucc_coll_score_map.c:206  UCC  INFO  Reduce_scatter:
[1721246635.005900] [ip-cuda-node:127  :0] ucc_coll_score_map.c:206  UCC  INFO  	Host: {0..inf}:TL_UCP:10
[1721246635.005941] [ip-cuda-node:127  :0] ucc_coll_score_map.c:206  UCC  INFO  Reduce_scatterv:
[1721246635.005941] [ip-cuda-node:127  :0] ucc_coll_score_map.c:206  UCC  INFO  	Host: {0..inf}:TL_UCP:10
[1721246635.006006] [ip-cuda-node:127  :0] ucc_coll_score_map.c:206  UCC  INFO  Scatterv:
[1721246635.006006] [ip-cuda-node:127  :0] ucc_coll_score_map.c:206  UCC  INFO  	Host: {0..inf}:TL_UCP:10
[1721246635.006020] [ip-cuda-node:127  :0]        ucc_team.c:474  UCC  INFO  ================================================
[1721246635.009711] [ip-cuda-node:127  :0]          ucc_ee.c:42   UCC  DEBUG ee is created: 0x8b1e760 ee_context: 0x89688b0
[1721246635.009729] [ip-cuda-node:127  :0]          ucc_ee.c:42   UCC  DEBUG ee is created: 0x8b1e7f0 ee_context: 0x8968860
[1721246635.009734] [ip-cuda-node:127  :0]          ucc_ee.c:42   UCC  DEBUG ee is created: 0x8b1e860 ee_context: 0x8a89620
[1721246635.009818] [ip-cuda-node:127  :0] ucc_coll_score_map.c:126  UCC  DEBUG coll_score_map lookup failed -1 (Operation is not supported)
[1721246635.009832] [ip-cuda-node:127  :0]        ucc_coll.c:236  UCC  DEBUG failed to init collective: not supported
[rank0]: RuntimeError: [/pytorch/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp:488] [Rank 0][ProcessGroupUCC-0][READY]failed to init cuda collective, error code -1: Operation is not supported, system error code 11

It would be greatly appreciated if you have any hints or suggestions on how to "force it" to run the distributed job for my PoC, or any guidance on configurations or debugging steps.

I appreciate your help.

cui36 Jan 20, 2025

Hi Rafał, thanks for this experiment. I would like to ask, since g4ad instances do not support RDMA-capable EFA, will UCX use TCP communication between the two clusters in this case? Besides, have you come across any AWS clusters with AMD GPUs that support RDMA?
Thanks!

RafalSiwek Jan 20, 2025
Author

Hi @cui36,

Thanks for the question! I’ve tested and successfully run UCC collectives with UCX over TCP on AWS g4dn and g4ad instances. I’ve documented everything here: https://github.com/RafalSiwek/troubleshoot-heterogenous-distributed-operations, so feel free to check it out.

Regarding RDMA-capable AWS clusters with AMD GPUs: unfortunately, EFA doesn’t currently support instances with AMD GPUs (AWS documentation). I haven’t come across or heard of any alternative ways to set up RDMA-capable networking for these instances.

At the moment, AWS doesn’t offer newer AMD GPUs beyond the Radeon Pro V520 (AWS AMD EC2 Instances), which is a bit disappointing. On the other hand, Azure does provide AMD GPUs like the Radeon Instinct MI25 on their NVv4 instances, which might even support RDMA. However, I haven’t had the chance (or the budget) to test those yet. Additionally, I suspect it may not be possible to achieve RDMA on heterogeneous clusters in public clouds. Most cloud providers seem to deploy Nvidia or AMD GPU clusters with interconnects like InfiniBand or RoCEv2, but it’s unlikely that there’s a direct RDMA connection between different GPU clusters.

Hope this helps! Let me know if you’ve got more questions.

Answer selected by RafalSiwek

RafalSiwek Jan 20, 2025
Author

Hi @cui36,

Thanks for the question! I’ve tested and successfully run UCC collectives with UCX over TCP on AWS g4dn and g4ad instances. I’ve documented everything here: https://github.com/RafalSiwek/troubleshoot-heterogenous-distributed-operations, so feel free to check it out.

Regarding RDMA-capable AWS clusters with AMD GPUs: unfortunately, EFA doesn’t currently support instances with AMD GPUs (AWS documentation). I haven’t come across or heard of any alternative ways to set up RDMA-capable networking for these instances.

At the moment, AWS doesn’t offer newer AMD GPUs beyond the Radeon Pro V520 (AWS AMD EC2 Instances), which is a bit disappointing. On the other hand, Azure does provide AMD GPUs like the Radeon Instinct MI25 on their NVv4 instances, which might even support RDMA. However, I haven’t had the chance (or the budget) to test those yet. Additionally, I suspect it may not be possible to achieve RDMA on heterogeneous clusters in public clouds. Most cloud providers seem to deploy Nvidia or AMD GPU clusters with interconnects like InfiniBand or RoCEv2, but it’s unlikely that there’s a direct RDMA connection between different GPU clusters.

Hope this helps! Let me know if you’ve got more questions.

Update on Azure instances with AMD GPUs:

The MI25 GPU is no longer supported on the latest ROCm releases. However, Azure now offers newer AMD GPU instances, including the MI300X (NDm A100v5-series) and the Radeon Pro V710, both of which support newer ROCm releases.

tvegas1 Jan 20, 2025
Collaborator

Hi Rafał, thanks for this experiment. I would like to ask, since g4ad instances do not support RDMA-capable EFA, will UCX use TCP communication between the two clusters in this case? Besides, have you come across any AWS clusters with AMD GPUs that support RDMA? Thanks!

Hi @cui36, without GPUDirect for EFA/AMD-GPU, EFA could be used with pipelined protocols if there is connectivity between clusters, else TCP can be used, although I never tried that myself.

cui36 Jan 20, 2025

Hi @cui36,
Thanks for the question! I’ve tested and successfully run UCC collectives with UCX over TCP on AWS g4dn and g4ad instances. I’ve documented everything here: https://github.com/RafalSiwek/troubleshoot-heterogenous-distributed-operations, so feel free to check it out.
Regarding RDMA-capable AWS clusters with AMD GPUs: unfortunately, EFA doesn’t currently support instances with AMD GPUs (AWS documentation). I haven’t come across or heard of any alternative ways to set up RDMA-capable networking for these instances.
At the moment, AWS doesn’t offer newer AMD GPUs beyond the Radeon Pro V520 (AWS AMD EC2 Instances), which is a bit disappointing. On the other hand, Azure does provide AMD GPUs like the Radeon Instinct MI25 on their NVv4 instances, which might even support RDMA. However, I haven’t had the chance (or the budget) to test those yet. Additionally, I suspect it may not be possible to achieve RDMA on heterogeneous clusters in public clouds. Most cloud providers seem to deploy Nvidia or AMD GPU clusters with interconnects like InfiniBand or RoCEv2, but it’s unlikely that there’s a direct RDMA connection between different GPU clusters.
Hope this helps! Let me know if you’ve got more questions.

Update on Azure instances with AMD GPUs:

The MI25 GPU is no longer supported on the latest ROCm releases. However, Azure now offers newer AMD GPU instances, including the MI300X (NDm A100v5-series) and the Radeon Pro V710, both of which support newer ROCm releases.

Great thanks, @RafalSiwek!

cui36 Jan 21, 2025

Hi Rafał, thanks for this experiment. I would like to ask, since g4ad instances do not support RDMA-capable EFA, will UCX use TCP communication between the two clusters in this case? Besides, have you come across any AWS clusters with AMD GPUs that support RDMA? Thanks!

Hi @cui36, without GPUDirect for EFA/AMD-GPU, EFA could be used with pipelined protocols if there is connectivity between clusters, else TCP can be used, although I never tried that myself.

Hi @tvegas1, thanks for the reply. Based on my understanding, protocol pipelining and TCP are not conflicting concepts. Instead, TCP can serve as the transport layer for a pipelined protocol. Could you explain more about the pipelined protocol? Like what is this protocol referring to? Thank you so much.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heterogeneous Operations on CUDA and ROCm Nodes Using UCX/UCC #9985

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Heterogeneous Operations on CUDA and ROCm Nodes Using UCX/UCC #9985

RafalSiwek Jul 1, 2024

Replies: 1 comment · 9 replies

yosefe Jul 1, 2024 Maintainer

RafalSiwek Jul 1, 2024 Author

RafalSiwek Jul 17, 2024 Author

cui36 Jan 20, 2025

RafalSiwek Jan 20, 2025 Author

RafalSiwek Jan 20, 2025 Author

tvegas1 Jan 20, 2025 Collaborator

cui36 Jan 20, 2025

cui36 Jan 21, 2025

RafalSiwek
Jul 1, 2024

Replies: 1 comment 9 replies

yosefe
Jul 1, 2024
Maintainer

RafalSiwek Jul 1, 2024
Author

RafalSiwek Jul 17, 2024
Author

RafalSiwek Jan 20, 2025
Author

RafalSiwek Jan 20, 2025
Author

tvegas1 Jan 20, 2025
Collaborator