[Tracking] Handling multiple versions of CUDA libraries in the same closure #341650

ConnorBaker · 2024-09-13T19:26:23Z

Describe the bug

When multiple versions of CUDA libraries are in the same closure, the order they are discovered and loaded by different packages is unclear.

This is issue is meant to track progress toward documentation, tools, and tests to communicate and ensure that libraries are loaded in a predictable and deterministic fashion.

As an example (courtesy of @SomeoneSerge), we should have tests which ensure both of the following work:

# test1
import torch
torch.randn(10, 10, device="cuda").sum().item()
import cv2
# do something with cv2 and cuda

# test2
import cv2
# do something with cv2 and cuda
import torch
torch.randn(10, 10, device="cuda").sum().item()

Notify maintainers

@NixOS/cuda-maintainers

Add a 👍 reaction to issues you find important.

samuela · 2024-09-13T22:09:47Z

This seems like a Hard Problem, and I'm not aware of any package managers that actually solve it. Is there a need to fix this?

Princemachiavelli · 2024-09-14T04:43:55Z

When I need to manually debug these kinds of issues, I use LD_DEBUG=libs but it's output isn't easy to parse. You might find using the LD_AUDIT interface easier and more informative. Unfortunately the only user space tool that implements it, latrace, appears to be unmaintained and not functional. Luckily the LD_AUDIT interface is only a few functions.

Is the thought that CUDA libraries are loading in a non-predictable fashion due to the use of dlopen or some source of non-determinism effecting the dynamic linker /lib/ld-linux.so.2?

samuela · 2024-09-14T21:17:30Z

Is the thought that CUDA libraries are loading in a non-predictable fashion due to the use of dlopen or some source of non-determinism effecting the dynamic linker /lib/ld-linux.so.2?

That's my understanding. Other uses should be explicit in RUNPATH/RPATH.

SomeoneSerge · 2024-09-15T18:58:00Z

You might find using the LD_AUDIT interface easier and more informative

Oh! We were thinking of parsing LD_DEBUG but you're making a good point! FWIW there are more examples of using LD_AUDIT, e.g. in guix pack, flox, and now cachix/devenv#773 (comment)

SomeoneSerge · 2024-09-15T18:59:04Z

However, note that for this particular test we don't even have to interact with ld.so in any particular way, we just verify that the program does not crash when using two potentially conflicting modules

ConnorBaker added 0.kind: bug 6.topic: documentation Meta-discussion about documentation and its workflow 6.topic: cuda 5. scope: tracking Long-lived issue tracking long-term fixes or multiple sub-problems labels Sep 13, 2024

ConnorBaker mentioned this issue Sep 13, 2024

opencv: misc cleanups; fix CUDA build #339619

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tracking] Handling multiple versions of CUDA libraries in the same closure #341650

[Tracking] Handling multiple versions of CUDA libraries in the same closure #341650

ConnorBaker commented Sep 13, 2024 •

edited

Loading

samuela commented Sep 13, 2024

Princemachiavelli commented Sep 14, 2024

samuela commented Sep 14, 2024

SomeoneSerge commented Sep 15, 2024

SomeoneSerge commented Sep 15, 2024

[Tracking] Handling multiple versions of CUDA libraries in the same closure #341650

[Tracking] Handling multiple versions of CUDA libraries in the same closure #341650

Comments

ConnorBaker commented Sep 13, 2024 • edited Loading

Describe the bug

Related Links

PRs

Issues

Notify maintainers

samuela commented Sep 13, 2024

Princemachiavelli commented Sep 14, 2024

samuela commented Sep 14, 2024

SomeoneSerge commented Sep 15, 2024

SomeoneSerge commented Sep 15, 2024

ConnorBaker commented Sep 13, 2024 •

edited

Loading