Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix: improve CXI support for ALCF Aurora configuration #3855

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 16 additions & 2 deletions src/arch/ofi/machine.C
Original file line number Diff line number Diff line change
Expand Up @@ -696,6 +696,7 @@ void LrtsInit(int *argc, char ***argv, int *numNodes, int *myNodeID)
* should not be considered predictive of proximity. That
* relationship has to be detected by other means.


* 2. HWLOC doesn't have a hwloc_get_closest_nic because... NIC
* doesn't even rate an object type in their ontology, let
* alone get first class treatment. Given that PCI devices
Expand All @@ -714,7 +715,7 @@ void LrtsInit(int *argc, char ***argv, int *numNodes, int *myNodeID)
* do *not* have such convenient labeling as something special
* needs to happen to get their linuxfs utilities to inject
* that derived information into your topology object. As an
* interim solution we allow the user to map their cxi[0..3]
* interim solution we allow the user to map their cxi[0..7]
* selection using command line arguments.

* 2b. Likewise the 1:1 relationship we assume here between
Expand All @@ -741,6 +742,8 @@ void LrtsInit(int *argc, char ***argv, int *numNodes, int *myNodeID)
* CPU nodes. The user could easily be confused, so we can't
* rely on them telling us. This has to be determined at
* run time.

* 6. Aurora can apparently go up to cxi7.
*/

char *cximap=NULL;
Expand Down Expand Up @@ -814,7 +817,18 @@ void LrtsInit(int *argc, char ***argv, int *numNodes, int *myNodeID)
/// short hsnOrder[numcxi]={2,1,3,0};
if(numcxi==4)
{
short hsnOrder[4]= {1,3,0,2};
short hsnOrder[8]= {1,1,3,3,0,0,2,2};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since myRank%quad <= numcxi (here 4) only first 4 elements of hsnOrder would be used, so change in line 820 may be a typo? Otherwise looks good to me.

if(myRank%quad>numcxi)
{
CmiPrintf("Error: myrank %d quad %d myrank/quad %n",myRank,quad, myRank/quad);
CmiAbort("cxi mapping failure");
}
myNet=hsnOrder[myRank%quad];
}
else if(numcxi==8)
{
// no idea if this is a good ordering
short hsnOrder[8]= {0,1,2,3,4,5,6,7};
if(myRank%quad>numcxi)
{
CmiPrintf("Error: myrank %d quad %d myrank/quad %n",myRank,quad, myRank/quad);
Expand Down
Loading