Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: permuteNetIds hangs when number of NICs is large and MERGE_NIC option is off #1565

Open
codinggosu opened this issue Feb 24, 2025 · 1 comment

Comments

@codinggosu
Copy link

Problem Description

in the current implementation, permuteNetIds try to match the topology by permuting over all the ordering of the NIC.
This is feasible when the number of nics is small (i.e 8) or if the number of NICs is large, but MERGE_NICS is turned on (halving the number of NICs seen by rccl)
This is not feasible when those two conditions are false (i.e the number of NICs is 16 and MERGE_NICS is 0). In this case, permuteNetIds in rome_models.cc tries to do 16 factorial permutations which results in rccl hanging and never finishing as the number of permutations is simply too large to complete in a reasonable timeframe.

Operating System

Ubuntu Jammy

CPU

AMD EPYC 9534 64-Core Processor

GPU

MI300X

ROCm Version

6.3.0

ROCm Component

rccl

Steps to Reproduce

This is easily reproducible with any callstack that leads down to permuteNetIds given the system is setup to permute over a large number of NICs.
I have able to reproduce this issue with 16 interface per node, 2 node, setup, running rccl-test with MERGE_NICS off.

I have wrote a fix for this issue that I will be making a pr for.

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

@codinggosu
Copy link
Author

#1566 addresses this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant