-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bugfix: improve CXI support for ALCF Aurora configuration #3855
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a straightforward change, but since I have not looked at this code before:
does it handle only those 2 cases (1, 4 and 8)? I see the comment in code line 842 (near the assert) address this, but I am curious. I guess we will find out when the assert fails. What about cloud environments (or similar features coming on supercomputers) where a single node is allocated to multiple jobs, which each job presumably getting a subset of NICs.
The four NIC case is special because the optimal ordering on machines like Frontier is linked to both the GPU and NIC, and a simple in order mapping [0,1,2,3] will be suboptimal. So, really the 8 cpu case is one that could probably be handled by a general approach with a carve out for 4. But there is only one such known configuration. |
@@ -814,7 +817,18 @@ void LrtsInit(int *argc, char ***argv, int *numNodes, int *myNodeID) | |||
/// short hsnOrder[numcxi]={2,1,3,0}; | |||
if(numcxi==4) | |||
{ | |||
short hsnOrder[4]= {1,3,0,2}; | |||
short hsnOrder[8]= {1,1,3,3,0,0,2,2}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since myRank%quad <= numcxi (here 4) only first 4 elements of hsnOrder would be used, so change in line 820 may be a typo? Otherwise looks good to me.
Add support for 8 NICs as found on Aurora.