Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial hwloc support #1108

Merged

Conversation

philipmarshall21
Copy link
Collaborator

@philipmarshall21 philipmarshall21 commented Jan 22, 2024

This PR is a subset of the changes PR #1107 and is intended to isolate the integration of hwloc as an optional dependency to SOS from the addition of multi-NIC functionality

The changes to the oac_* files reflect the current status of the main branch (at the time of this PR, that is commit c1cfc910d92af43f8c27807a9a84c9c13f4fbc65) as the upstream repo does not have any tags.

The changes to the opal_* files align with the v5.0.1 tag of the upstream repo.

@davidozog
Copy link
Member

Are the upstream OPAL changes taken from a release or a particular commit of OMPI? Either way, perhaps we should leave a note on this PR in case it ends up being helpful later.

@philipmarshall21
Copy link
Collaborator Author

Are the upstream OPAL changes taken from a release or a particular commit of OMPI? Either way, perhaps we should leave a note on this PR in case it ends up being helpful later.

The changes to the oac_* files reflect the current status of the main branch (currently, that is commit c1cfc910d92af43f8c27807a9a84c9c13f4fbc65) as the upstream repo does not have any tags.

The changes to the opal_* files align with the v5.0.1 tag of the upstream repo.

Copy link
Collaborator

@wrrobin wrrobin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We can explore other use-cases for hwloc later.

@davidozog
Copy link
Member

Side note: The UCX row with pmi-mpi looks slow in testing, but I think we can pretty safely ignore that...

@philipmarshall21
Copy link
Collaborator Author

Two quick questions @wrrobin @davidozog:

  1. With the addition of hwloc support, should we expect to see the rpath for hwloc when doing something like oshcc -show (rpath for libfabric and libsma are being set correctly)? If so, this is currently not the case, and I'll need to look into why this is not happening.
  2. In the changes to init.c that add hwloc API calls, should the application exit if the calls fail? Right now an ERROR message is printed but the application continues running.

@wrrobin
Copy link
Collaborator

wrrobin commented Feb 1, 2024

Two quick questions @wrrobin @davidozog:

  1. With the addition of hwloc support, should we expect to see the rpath for hwloc when doing something like oshcc -show (rpath for libfabric and libsma are being set correctly)? If so, this is currently not the case, and I'll need to look into why this is not happening.
  2. In the changes to init.c that add hwloc API calls, should the application exit if the calls fail? Right now an ERROR message is printed but the application continues running.

For 1. I would guess we should see the rpath, considering your changes in the PR. Are you using any other flags that might skip the configury code?
For 2. my thought is we should not terminate. If one of the hwloc APIs fail or return error, it should print a warning that hwloc is not enabled for some reason. But, we should let the application proceed. Do other runtimes behave the same?

@philipmarshall21
Copy link
Collaborator Author

Two quick questions @wrrobin @davidozog:

  1. With the addition of hwloc support, should we expect to see the rpath for hwloc when doing something like oshcc -show (rpath for libfabric and libsma are being set correctly)? If so, this is currently not the case, and I'll need to look into why this is not happening.
  2. In the changes to init.c that add hwloc API calls, should the application exit if the calls fail? Right now an ERROR message is printed but the application continues running.

For 1. I would guess we should see the rpath, considering your changes in the PR. Are you using any other flags that might skip the configury code? For 2. my thought is we should not terminate. If one of the hwloc APIs fail or return error, it should print a warning that hwloc is not enabled for some reason. But, we should let the application proceed. Do other runtimes behave the same?

Ok, I've made the suggested changes (ensure rpath for hwloc is set in configure.ac, failed hwloc API calls now send warning message rather than error).

Copy link
Member

@davidozog davidozog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@philipmarshall21 - I ran into a couple minor things but pending those, I think this PR is ready to merge. All the RPATH stuff is working on my end, thanks for patching that.

#endif // HWLOC_ENFORCE_SINGLE_SOCKET || HWLOC_ENFORCE_SINGLE_NUMA_NODE
hwloc_issue_topology:
RAISE_WARN_MSG("Please verify your hwloc installation\n");
hwloc_exit:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a minor thing, but prob worth addressing if possible - we see this warning when building with hwloc but without enforcing single socket or numa node:

../../src/init.c:452:5: warning: label ‘hwloc_exit’ defined but not used [-Wunused-label]
  452 |     hwloc_exit:

One option might be to remove hwloc_issue_topology (it's not adding much value anyway), then we can remove hwloc_exit (and revive it if it's ever needed).

[m4_set_remove([oac_var_scope_active_set], oac_var_scope_var)])dnl
oac_var_scope_pop oac_var_scope_stack
m4_popdef([oac_var_scope_stack])dnl
])dnl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super minor, but looks like we lost a newline relative to upstream on this file.

@philipmarshall21 philipmarshall21 merged commit 0e4cff6 into Sandia-OpenSHMEM:main Feb 7, 2024
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants