Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v1.17.0, mode = "cdi"] OCI runtime create failed: 'failed to create link'/'failed to check if link exists: unexpected link target' #772

Open
benz0li opened this issue Nov 4, 2024 · 11 comments
Assignees

Comments

@benz0li
Copy link

benz0li commented Nov 4, 2024

Full error message:

failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #1: error running hook: exit status 1, stdout: , stderr: time="yyyy-mm-ddThh:mm:ss+hh:mm" level=error msg="failed to create link [libGLX_nvidia.so.535.183.01 /usr/lib/x86_64-linux-gnu/libGLX_indirect.so.0]: failed to check if link exists: unexpected link target: libGLX_mesa.so.0": unknown"

Further information:

ls -al /usr/lib/x86_64-linux-gnu
...
lrwxrwxrwx  1 root root        27 month  dd hh:mm libGLX_indirect.so.0 -> libGLX_nvidia.so.535.183.01
lrwxrwxrwx  1 root root        27 month  dd hh:mm libGLX_nvidia.so.0 -> libGLX_nvidia.so.535.183.01
-rwxr-xr-x  1 root root   1195552 month  dd hh:mm libGLX_nvidia.so.535.183.01
...
@benz0li benz0li changed the title [v1.17.0, mode = "cdi"] OCI runtime create failed: 'failed to create link'/'failed to check if link exists: unexpected link target' [v1.17.0, mode = "cdi"] OCI runtime create failed: 'failed to create link'/'failed to check if link exists: unexpected link target' Nov 4, 2024
@benz0li
Copy link
Author

benz0li commented Nov 4, 2024

@elezar It works fine with v1.16.2, so I rolled back to that version.

@elezar elezar self-assigned this Nov 4, 2024
@elezar
Copy link
Member

elezar commented Nov 4, 2024

@benz0li could you provide information on the image that you're running?

I would assume that a symlink libGLX_indirect.so.0 -> libGLX_mesa.so.0 already exists in the image in this case.

@SurenNihalani
Copy link

This happened to us as well. I think we should rollback this image to avoid more people from hitting this

@benz0li
Copy link
Author

benz0li commented Nov 5, 2024

@benz0li could you provide information on the image that you're running?

Image: glcr.b-data.ch/jupyterlab/cuda/r/verse:4.4.2

Source: https://github.com/b-data/jupyterlab-r-docker-stack/blob/4786ca20927a2d34a8d4dcdb2f2b4fdfcc55660e/verse/latest.Dockerfile

I would assume that a symlink libGLX_indirect.so.0 -> libGLX_mesa.so.0 already exists in the image in this case.

Yes. Due to the installation of libgl1-mesa-dev and thus libglx-mesa0.

Source: https://github.com/b-data/jupyterlab-r-docker-stack/blob/4786ca20927a2d34a8d4dcdb2f2b4fdfcc55660e/verse/latest.Dockerfile#L43

@benz0li
Copy link
Author

benz0li commented Nov 5, 2024

The difference between the old (1.16.2) and new (1.17.0) CDI specification:

diff --git a/etc/cdi/nvidia-1.16.2.yaml.bak b/etc/cdi/nvidia.yaml
index 8eb6b1e..2b6936b 100644
--- a/etc/cdi/nvidia-1.16.2.yaml.bak
+++ b/etc/cdi/nvidia.yaml
@@ -13,8 +13,6 @@ containerEdits:
     - nvidia-cdi-hook
     - create-symlinks
     - --link
-    - libnvidia-allocator.so.535.183.01::/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.1
-    - --link
     - ../libnvidia-allocator.so.1::/usr/lib/x86_64-linux-gnu/gbm/nvidia-drm_gbm.so
     - --link
     - libnvidia-vulkan-producer.so.535.183.01::/usr/lib/x86_64-linux-gnu/libnvidia-vulkan-producer.so
@@ -22,11 +20,24 @@ containerEdits:
     - libglxserver_nvidia.so.535.183.01::/usr/lib/xorg/modules/extensions/libglxserver_nvidia.so
     hookName: createContainer
     path: /usr/bin/nvidia-cdi-hook
+  - args:
+    - nvidia-cdi-hook
+    - create-symlinks
+    - --link
+    - libGLX_nvidia.so.535.183.01::/usr/lib/x86_64-linux-gnu/libGLX_indirect.so.0
+    - --link
+    - libnvidia-opticalflow.so.1::/usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so
+    - --link
+    - libcuda.so.1::/usr/lib/x86_64-linux-gnu/libcuda.so
+    hookName: createContainer
+    path: /usr/bin/nvidia-cdi-hook
   - args:
     - nvidia-cdi-hook
     - update-ldcache
     - --folder
     - /usr/lib/x86_64-linux-gnu
+    - --folder
+    - /usr/lib/x86_64-linux-gnu/vdpau
     hookName: createContainer
     path: /usr/bin/nvidia-cdi-hook
   mounts:
@@ -332,6 +343,13 @@ containerEdits:
     - nosuid
     - nodev
     - bind
+  - containerPath: /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.535.183.01
+    hostPath: /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.535.183.01
+    options:
+    - ro
+    - nosuid
+    - nodev
+    - bind
   - containerPath: /usr/share/egl/egl_external_platform.d/10_nvidia_wayland.json
     hostPath: /usr/share/egl/egl_external_platform.d/10_nvidia_wayland.json
     options:

@benz0li
Copy link
Author

benz0li commented Nov 5, 2024

@elezar Using v1.17.0, it works with docker1 and mode = "auto":

docker run --rm -ti glcr.b-data.ch/jupyterlab/cuda/r/verse bash
==========
== CUDA ==
==========

CUDA Version 12.6.2

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

=============
== JUPYTER ==
=============

Entered start.sh with args: bash
Running hooks in: /usr/local/bin/start-notebook.d as uid: 1000 gid: 100
Sourcing shell script: /usr/local/bin/start-notebook.d/10-populate.sh
Done running hooks in: /usr/local/bin/start-notebook.d
Running hooks in: /usr/local/bin/before-notebook.d as uid: 1000 gid: 100
Sourcing shell script: /usr/local/bin/before-notebook.d/10-env.sh
TZ is set to Etc/UTC (/etc/localtime and /etc/timezone remain unchanged)
LANG is set to en_US.UTF-8
Sourcing shell script: /usr/local/bin/before-notebook.d/11-home.sh
Sourcing shell script: /usr/local/bin/before-notebook.d/12-r.sh
Sourcing shell script: /usr/local/bin/before-notebook.d/13-update-cran.sh
Sourcing shell script: /usr/local/bin/before-notebook.d/30-code-server.sh
Sourcing shell script: /usr/local/bin/before-notebook.d/50-rstudio.sh
Sourcing shell script: /usr/local/bin/before-notebook.d/71-tensorboard.sh
Sourcing shell script: /usr/local/bin/before-notebook.d/90-limits.sh
Sourcing shell script: /usr/local/bin/before-notebook.d/95-misc.sh
Done running hooks in: /usr/local/bin/before-notebook.d
Executing the command: bash

It does not work with docker and mode = "cdi":

docker run --rm -ti glcr.b-data.ch/jupyterlab/cuda/r/verse bash
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #1: error running hook: exit status 1, stdout: , stderr: time="2024-11-05T08:00:31Z" level=error msg="failed to create link [libGLX_nvidia.so.535.183.01 /usr/lib/x86_64-linux-gnu/libGLX_indirect.so.0]: failed to check if link exists: unexpected link target: libGLX_mesa.so.0": unknown.

It does not work with podman, either:

podman run --rm --device nvidia.com/gpu=all -ti glcr.b-data.ch/jupyterlab/cuda/r/verse bash
Error: OCI runtime error: crun: {"msg":"error executing hook `/usr/bin/nvidia-cdi-hook` (exit code: 1)","level":"error","time":"2024-11-05T08:03:31.269259Z"}

Footnotes

  1. "default-runtime": "nvidia"

@benz0li
Copy link
Author

benz0li commented Nov 5, 2024

@elezar Isn't libGLX_indirect.so.0 -> libGLX_mesa.so.0 in the image required [by default] so it also work when there is no NVIDIA/CUDA device available?

I.e. using Mesa instead. Is my understanding correct?

@benz0li
Copy link
Author

benz0li commented Nov 5, 2024

Workaround for v1.17.0: Delete lines

    - --link
    - libGLX_nvidia.so.535.183.01::/usr/lib/x86_64-linux-gnu/libGLX_indirect.so.0

in /etc/cdi/nvidia.yaml.

@benz0li
Copy link
Author

benz0li commented Nov 5, 2024

If the server doesn't send back a vendor name, or sends back a vendor that the client can't load, then it'll fall back to using libGLX_indirect.so.0, which should be a symlink to another vendor library.

NVIDIA/libglvnd#177 (comment)

@elezar
Copy link
Member

elezar commented Nov 5, 2024

In the v1.17.0 update we changed the behaviour of our create-symlinks hook to be more robust, but missed an existent link such as this as a valid state in our testing. We are working on a fix for this and will release a patch release as soon as this is ready.

@benz0li
Copy link
Author

benz0li commented Nov 5, 2024

@elezar No worries. Very solid work on the NVIDIA Container Toolkit.

Almost never had a problem with it.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants