You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
in the current implementation, permuteNetIds try to match the topology by permuting over all the ordering of the NIC.
This is feasible when the number of nics is small (i.e 8) or if the number of NICs is large, but MERGE_NICS is turned on (halving the number of NICs seen by rccl)
This is not feasible when those two conditions are false (i.e the number of NICs is 16 and MERGE_NICS is 0). In this case, permuteNetIds in rome_models.cc tries to do 16 factorial permutations which results in rccl hanging and never finishing as the number of permutations is simply too large to complete in a reasonable timeframe.
Operating System
Ubuntu Jammy
CPU
AMD EPYC 9534 64-Core Processor
GPU
MI300X
ROCm Version
6.3.0
ROCm Component
rccl
Steps to Reproduce
This is easily reproducible with any callstack that leads down to permuteNetIds given the system is setup to permute over a large number of NICs.
I have able to reproduce this issue with 16 interface per node, 2 node, setup, running rccl-test with MERGE_NICS off.
I have wrote a fix for this issue that I will be making a pr for.
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered:
Problem Description
in the current implementation,
permuteNetIds
try to match the topology by permuting over all the ordering of the NIC.This is feasible when the number of nics is small (i.e 8) or if the number of NICs is large, but
MERGE_NICS
is turned on (halving the number of NICs seen by rccl)This is not feasible when those two conditions are false (i.e the number of NICs is 16 and MERGE_NICS is 0). In this case,
permuteNetIds
in rome_models.cc tries to do 16 factorial permutations which results in rccl hanging and never finishing as the number of permutations is simply too large to complete in a reasonable timeframe.Operating System
Ubuntu Jammy
CPU
AMD EPYC 9534 64-Core Processor
GPU
MI300X
ROCm Version
6.3.0
ROCm Component
rccl
Steps to Reproduce
This is easily reproducible with any callstack that leads down to
permuteNetIds
given the system is setup to permute over a large number of NICs.I have able to reproduce this issue with 16 interface per node, 2 node, setup, running rccl-test with MERGE_NICS off.
I have wrote a fix for this issue that I will be making a pr for.
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered: