Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save (Win11) & Load (Ubuntu) Leading to "NVIDIA-SMI couldn't find libnvidia-ml.so library in your system" #674

Open
SanBingYouYong opened this issue Sep 4, 2024 · 2 comments

Comments

@SanBingYouYong
Copy link

Machines:

  1. Windows 11 Laptop, Docker Desktop with nvidia-container-toolkit and WSL2 back-end.
  2. Remote server, running Ubuntu 20.04.6 LTS, docker with nvidia-container-toolkit installed.

Both systems runs containers with --gpus all flag fine and nvidia-smi outputs correctly.

Problem:

  • Used docker save -o saved_image.tar image_name:latest and uploaded the tar to remote server.
  • Imported the image with docker load -i saved_image.tar on remote server.
  • Run the image with docker run --gpus all -it image_name:latest.
  • nvidia-smi throws error:
    • NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system. Please also try adding directory that contains libnvidia-ml.so to your system PATH.
  • however, nvcc -V works fine.

Investigation taken:

  • Found posts from years ago about similar issues of nvcc working but nvidia-smi not, however, tried the following mitigations and no luck:
    • adding /usr/lib/x86... to PATH
    • ldconfig and then nvidia-smi
      • ldconfig produces the following:
/sbin/ldconfig.real: File /usr/lib/x86_64-linux-gnu/libdxcore.so is empty, not checked.
/sbin/ldconfig.real: File /usr/lib/x86_64-linux-gnu/libnvidia-ml.so is empty, not checked.
/sbin/ldconfig.real: File /usr/lib/x86_64-linux-gnu/libcuda.so.1 is empty, not checked.
/sbin/ldconfig.real: File /usr/lib/x86_64-linux-gnu/libcudadebugger.so.1 is empty, not checked.
/sbin/ldconfig.real: File /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 is empty, not checked.
/sbin/ldconfig.real: File /usr/lib/x86_64-linux-gnu/libcuda.so is empty, not checked.
/sbin/ldconfig.real: /usr/lib/x86_64-linux-gnu/libcuda.so.1 is not a symbolic link

/sbin/ldconfig.real: /usr/lib/x86_64-linux-gnu/libcudadebugger.so.1 is not a symbolic link

/sbin/ldconfig.real: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 is not a symbolic link

On the laptop where this image is packed, ldconfig returns only the following:

/sbin/ldconfig.real: /usr/lib/x86_64-linux-gnu/libcuda.so.1 is not a symbolic link

I blindly tried to copy paste the "empty" files above from the original image into the new image, turned out these are empty files too.

I also tried loading the tar right on the laptop and it also works fine. So I suspect this might be a problem with WSL2 backend exports?

@SanBingYouYong
Copy link
Author

Update: reversed also won't work

  • push image from ubuntu and pulled on win11 laptop, get the following error while trying to run:
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: mount error: file creation failed: /var/lib/docker/overlay2/fb664bfd78f00385d6fe4eb0f0a8f7ae9a1738b46f6e759004b44073c8ecb90c/merged/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: file exists: unknown.

It does seem like an incompatibility issue between the usual "linux" backend and the wsl2 backend for docker engine.

However, the official nvidia docker images runs fine, not sure if adding more layers will break it though.

@SanBingYouYong
Copy link
Author

Update: also tried enabling use containerd on Docker Desktop. Problem persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant