Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tracking] Handling multiple versions of CUDA libraries in the same closure #341650

Open
ConnorBaker opened this issue Sep 13, 2024 · 5 comments
Open
Labels
0.kind: bug 5. scope: tracking Long-lived issue tracking long-term fixes or multiple sub-problems 6.topic: cuda 6.topic: documentation Meta-discussion about documentation and its workflow

Comments

@ConnorBaker
Copy link
Contributor

ConnorBaker commented Sep 13, 2024

Describe the bug

When multiple versions of CUDA libraries are in the same closure, the order they are discovered and loaded by different packages is unclear.

This is issue is meant to track progress toward documentation, tools, and tests to communicate and ensure that libraries are loaded in a predictable and deterministic fashion.

As an example (courtesy of @SomeoneSerge), we should have tests which ensure both of the following work:

# test1
import torch
torch.randn(10, 10, device="cuda").sum().item()
import cv2
# do something with cv2 and cuda

# test2
import cv2
# do something with cv2 and cuda
import torch
torch.randn(10, 10, device="cuda").sum().item()

Related Links

PRs
Issues

Notify maintainers

@NixOS/cuda-maintainers


Add a 👍 reaction to issues you find important.

@ConnorBaker ConnorBaker added 0.kind: bug 6.topic: documentation Meta-discussion about documentation and its workflow 6.topic: cuda 5. scope: tracking Long-lived issue tracking long-term fixes or multiple sub-problems labels Sep 13, 2024
@samuela
Copy link
Member

samuela commented Sep 13, 2024

This seems like a Hard Problem, and I'm not aware of any package managers that actually solve it. Is there a need to fix this?

@Princemachiavelli
Copy link
Contributor

When I need to manually debug these kinds of issues, I use LD_DEBUG=libs but it's output isn't easy to parse. You might find using the LD_AUDIT interface easier and more informative. Unfortunately the only user space tool that implements it, latrace, appears to be unmaintained and not functional. Luckily the LD_AUDIT interface is only a few functions.

Is the thought that CUDA libraries are loading in a non-predictable fashion due to the use of dlopen or some source of non-determinism effecting the dynamic linker /lib/ld-linux.so.2?

@samuela
Copy link
Member

samuela commented Sep 14, 2024

Is the thought that CUDA libraries are loading in a non-predictable fashion due to the use of dlopen or some source of non-determinism effecting the dynamic linker /lib/ld-linux.so.2?

That's my understanding. Other uses should be explicit in RUNPATH/RPATH.

@SomeoneSerge
Copy link
Contributor

You might find using the LD_AUDIT interface easier and more informative

Oh! We were thinking of parsing LD_DEBUG but you're making a good point! FWIW there are more examples of using LD_AUDIT, e.g. in guix pack, flox, and now cachix/devenv#773 (comment)

@SomeoneSerge
Copy link
Contributor

However, note that for this particular test we don't even have to interact with ld.so in any particular way, we just verify that the program does not crash when using two potentially conflicting modules

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug 5. scope: tracking Long-lived issue tracking long-term fixes or multiple sub-problems 6.topic: cuda 6.topic: documentation Meta-discussion about documentation and its workflow
Projects
Status: New
Development

No branches or pull requests

4 participants