Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hmem/cuda: avoid stub loading at runtime #10365

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

aws-nslick
Copy link
Contributor

When the CUDA toolkit is installed, a set of "stub" libraries are installed under /usr/local/cuda*/lib64/stubs/. These libraries include a SONAME field with a `.1' suffix, but the filenames of these stubs are bare. eg:

$ readelf -d /usr/local/cuda-12.5/lib64/stubs/libnvidia-ml.so | grep soname
0x000000000000000e (SONAME) Library soname: [libnvidia-ml.so.1]

The CUDA toolkit does not include any library file with the name libnvidia-ml.so.1 (or libcuda.so.1, etc.), as these are provided by the driver package. This disconnect between the stub filename in the toolkit and the SONAME within it is done intentionally to allow linking with the stub at build time, while ensuring it's never loaded at runtime.

In normal dynamic linking cases (ie: without dlopen), the SONAME field of libnvidia-ml.so.1 is used in the DT_NEEDED tag, where that filename can only come from a driver package and this ensures that the stub library will never match.

Match the same behavior and provide .1 suffixes to dlopen where appropriate for NVIDIA libraries.

@aws-nslick
Copy link
Contributor Author

bot:aws:retest

Copy link
Contributor

@shijin-aws shijin-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename the PR/commit title to be hmem/cuda: avoid stub loading at runtime

@aws-nslick
Copy link
Contributor Author

Can you rename the PR/commit title to be hmem/cuda: avoid stub loading at runtime

Done, thanks.

One other thing I'd ask reviewers to think about is to consider searching inside CUDA_HOME/lib64/stubs in autoconf, instead of today where it searches only in CUDA_HOME/lib64 and CUDA_HOME/lib, which would make it possible to build libfabric without any build-time dependency on the driver; the cuda toolkit alone would be sufficient for all cases.

@shijin-aws shijin-aws changed the title fix(cuda): avoid stub loading at runtime hmem/cuda: avoid stub loading at runtime Sep 9, 2024
@shijin-aws
Copy link
Contributor

bot:aws:retest

@aws-nslick
Copy link
Contributor Author

Please wait on an ack from @bwbarrett before merging as there was some disagreement on a similar change here and I want to make sure we're aligned.

@bwbarrett
Copy link
Contributor

Note that I objected to this commit in the OFI plugin because it doesn't match what Nvidia has done in the past with NCCL. If we want to make a change, it should be to adopt cudaGetDriverEntryPoint().

@aws-nslick
Copy link
Contributor Author

@bwbarrett such changes were made on the ofi plugin side; do you have a problem with this approach for libfabric?

When the CUDA toolkit is installed, a set of "stub" libraries are
installed under /usr/local/cuda*/lib64/stubs/. These libraries include a
SONAME field with a `.1' suffix, but the filenames of these stubs are
bare. eg:

 > $ readelf -d /usr/local/cuda-12.5/lib64/stubs/libnvidia-ml.so | grep soname
 > 0x000000000000000e (SONAME)  Library soname: [libnvidia-ml.so.1]

The CUDA toolkit does not include any library file with the name
`libnvidia-ml.so.1` (or `libcuda.so.1`, etc.), as these are provided by
the driver package. This disconnect between the stub filename in the
toolkit and the SONAME within it is done intentionally to allow linking
with the stub at build time, while ensuring it's never loaded at
runtime.

In normal dynamic linking cases (ie: without dlopen), the SONAME field
of `libnvidia-ml.so.1` is used in the DT_NEEDED tag, where that filename
can only come from a driver package and this ensures that the stub
library will never match.

Match the same behavior and provide `.1` suffixes to dlopen where
appropriate for NVIDIA libraries.

Signed-off-by: Nicholas Sielicki <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants