hmem/cuda: avoid stub loading at runtime #10365

aws-nslick · 2024-09-08T17:12:43Z

When the CUDA toolkit is installed, a set of "stub" libraries are installed under /usr/local/cuda*/lib64/stubs/. These libraries include a SONAME field with a `.1' suffix, but the filenames of these stubs are bare. eg:

$ readelf -d /usr/local/cuda-12.5/lib64/stubs/libnvidia-ml.so | grep soname
0x000000000000000e (SONAME) Library soname: [libnvidia-ml.so.1]

The CUDA toolkit does not include any library file with the name libnvidia-ml.so.1 (or libcuda.so.1, etc.), as these are provided by the driver package. This disconnect between the stub filename in the toolkit and the SONAME within it is done intentionally to allow linking with the stub at build time, while ensuring it's never loaded at runtime.

In normal dynamic linking cases (ie: without dlopen), the SONAME field of libnvidia-ml.so.1 is used in the DT_NEEDED tag, where that filename can only come from a driver package and this ensures that the stub library will never match.

Match the same behavior and provide .1 suffixes to dlopen where appropriate for NVIDIA libraries.

aws-nslick · 2024-09-09T18:11:39Z

bot:aws:retest

shijin-aws

Can you rename the PR/commit title to be hmem/cuda: avoid stub loading at runtime

aws-nslick · 2024-09-09T18:26:07Z

Can you rename the PR/commit title to be hmem/cuda: avoid stub loading at runtime

Done, thanks.

One other thing I'd ask reviewers to think about is to consider searching inside CUDA_HOME/lib64/stubs in autoconf, instead of today where it searches only in CUDA_HOME/lib64 and CUDA_HOME/lib, which would make it possible to build libfabric without any build-time dependency on the driver; the cuda toolkit alone would be sufficient for all cases.

shijin-aws · 2024-09-10T17:40:11Z

bot:aws:retest

aws-nslick · 2024-09-10T22:15:34Z

Please wait on an ack from @bwbarrett before merging as there was some disagreement on a similar change here and I want to make sure we're aligned.

bwbarrett · 2024-09-11T08:59:50Z

Note that I objected to this commit in the OFI plugin because it doesn't match what Nvidia has done in the past with NCCL. If we want to make a change, it should be to adopt cudaGetDriverEntryPoint().

aws-nslick · 2024-10-01T20:58:12Z

@bwbarrett such changes were made on the ofi plugin side; do you have a problem with this approach for libfabric?

When the CUDA toolkit is installed, a set of "stub" libraries are installed under /usr/local/cuda*/lib64/stubs/. These libraries include a SONAME field with a `.1' suffix, but the filenames of these stubs are bare. eg: > $ readelf -d /usr/local/cuda-12.5/lib64/stubs/libnvidia-ml.so | grep soname > 0x000000000000000e (SONAME) Library soname: [libnvidia-ml.so.1] The CUDA toolkit does not include any library file with the name `libnvidia-ml.so.1` (or `libcuda.so.1`, etc.), as these are provided by the driver package. This disconnect between the stub filename in the toolkit and the SONAME within it is done intentionally to allow linking with the stub at build time, while ensuring it's never loaded at runtime. In normal dynamic linking cases (ie: without dlopen), the SONAME field of `libnvidia-ml.so.1` is used in the DT_NEEDED tag, where that filename can only come from a driver package and this ensures that the stub library will never match. Match the same behavior and provide `.1` suffixes to dlopen where appropriate for NVIDIA libraries. Signed-off-by: Nicholas Sielicki <[email protected]>

shijin-aws requested a review from j-xiong September 9, 2024 18:17

shijin-aws approved these changes Sep 9, 2024

View reviewed changes

aws-nslick force-pushed the avoid-stub-loading branch from 9180553 to 1ca4e0a Compare September 9, 2024 18:21

shijin-aws changed the title ~~fix(cuda): avoid stub loading at runtime~~ hmem/cuda: avoid stub loading at runtime Sep 9, 2024

j-xiong approved these changes Sep 10, 2024

View reviewed changes

aws-nslick mentioned this pull request Sep 10, 2024

fix(cuda): avoid loading stub aws/aws-ofi-nccl#581

Closed

aws-nslick force-pushed the avoid-stub-loading branch from 1ca4e0a to 907688f Compare October 27, 2024 23:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hmem/cuda: avoid stub loading at runtime #10365

hmem/cuda: avoid stub loading at runtime #10365

aws-nslick commented Sep 8, 2024

aws-nslick commented Sep 9, 2024

shijin-aws left a comment

aws-nslick commented Sep 9, 2024

shijin-aws commented Sep 10, 2024

aws-nslick commented Sep 10, 2024

bwbarrett commented Sep 11, 2024

aws-nslick commented Oct 1, 2024

hmem/cuda: avoid stub loading at runtime #10365

Are you sure you want to change the base?

hmem/cuda: avoid stub loading at runtime #10365

Conversation

aws-nslick commented Sep 8, 2024

aws-nslick commented Sep 9, 2024

shijin-aws left a comment

Choose a reason for hiding this comment

aws-nslick commented Sep 9, 2024

shijin-aws commented Sep 10, 2024

aws-nslick commented Sep 10, 2024

bwbarrett commented Sep 11, 2024

aws-nslick commented Oct 1, 2024