Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix: improve CXI support for ALCF Aurora configuration #3855

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ericjbohm
Copy link
Contributor

Add support for 8 NICs as found on Aurora.

@ericjbohm ericjbohm added this to the 8.0.1 milestone Nov 12, 2024
@ericjbohm ericjbohm self-assigned this Nov 12, 2024
Copy link
Contributor

@lvkale lvkale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a straightforward change, but since I have not looked at this code before:
does it handle only those 2 cases (1, 4 and 8)? I see the comment in code line 842 (near the assert) address this, but I am curious. I guess we will find out when the assert fails. What about cloud environments (or similar features coming on supercomputers) where a single node is allocated to multiple jobs, which each job presumably getting a subset of NICs.

@ericjbohm
Copy link
Contributor Author

This looks like a straightforward change, but since I have not looked at this code before: does it handle only those 2 cases (1, 4 and 8)? I see the comment in code line 842 (near the assert) address this, but I am curious. I guess we will find out when the assert fails. What about cloud environments (or similar features coming on supercomputers) where a single node is allocated to multiple jobs, which each job presumably getting a subset of NICs.

The four NIC case is special because the optimal ordering on machines like Frontier is linked to both the GPU and NIC, and a simple in order mapping [0,1,2,3] will be suboptimal. So, really the 8 cpu case is one that could probably be handled by a general approach with a carve out for 4. But there is only one such known configuration.

@@ -814,7 +817,18 @@ void LrtsInit(int *argc, char ***argv, int *numNodes, int *myNodeID)
/// short hsnOrder[numcxi]={2,1,3,0};
if(numcxi==4)
{
short hsnOrder[4]= {1,3,0,2};
short hsnOrder[8]= {1,1,3,3,0,0,2,2};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since myRank%quad <= numcxi (here 4) only first 4 elements of hsnOrder would be used, so change in line 820 may be a typo? Otherwise looks good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants