[v1.17.0, mode = "cdi"] OCI runtime create failed: 'failed to create link'/'failed to check if link exists: unexpected link target' #772

benz0li · 2024-11-04T06:34:07Z

Full error message:

failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #1: error running hook: exit status 1, stdout: , stderr: time="yyyy-mm-ddThh:mm:ss+hh:mm" level=error msg="failed to create link [libGLX_nvidia.so.535.183.01 /usr/lib/x86_64-linux-gnu/libGLX_indirect.so.0]: failed to check if link exists: unexpected link target: libGLX_mesa.so.0": unknown"

Further information:

ls -al /usr/lib/x86_64-linux-gnu

...
lrwxrwxrwx  1 root root        27 month  dd hh:mm libGLX_indirect.so.0 -> libGLX_nvidia.so.535.183.01
lrwxrwxrwx  1 root root        27 month  dd hh:mm libGLX_nvidia.so.0 -> libGLX_nvidia.so.535.183.01
-rwxr-xr-x  1 root root   1195552 month  dd hh:mm libGLX_nvidia.so.535.183.01
...

The text was updated successfully, but these errors were encountered:

benz0li · 2024-11-04T06:36:49Z

@elezar It works fine with v1.16.2, so I rolled back to that version.

elezar · 2024-11-04T21:57:30Z

@benz0li could you provide information on the image that you're running?

I would assume that a symlink libGLX_indirect.so.0 -> libGLX_mesa.so.0 already exists in the image in this case.

SurenNihalani · 2024-11-04T22:40:29Z

This happened to us as well. I think we should rollback this image to avoid more people from hitting this

benz0li · 2024-11-05T07:54:42Z

@benz0li could you provide information on the image that you're running?

Image: glcr.b-data.ch/jupyterlab/cuda/r/verse:4.4.2

Source: https://github.com/b-data/jupyterlab-r-docker-stack/blob/4786ca20927a2d34a8d4dcdb2f2b4fdfcc55660e/verse/latest.Dockerfile

I would assume that a symlink libGLX_indirect.so.0 -> libGLX_mesa.so.0 already exists in the image in this case.

Yes. Due to the installation of libgl1-mesa-dev and thus libglx-mesa0.

Source: https://github.com/b-data/jupyterlab-r-docker-stack/blob/4786ca20927a2d34a8d4dcdb2f2b4fdfcc55660e/verse/latest.Dockerfile#L43

benz0li · 2024-11-05T07:57:55Z

The difference between the old (1.16.2) and new (1.17.0) CDI specification:

diff --git a/etc/cdi/nvidia-1.16.2.yaml.bak b/etc/cdi/nvidia.yaml
index 8eb6b1e..2b6936b 100644
--- a/etc/cdi/nvidia-1.16.2.yaml.bak
+++ b/etc/cdi/nvidia.yaml
@@ -13,8 +13,6 @@ containerEdits:
     - nvidia-cdi-hook
     - create-symlinks
     - --link
-    - libnvidia-allocator.so.535.183.01::/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.1
-    - --link
     - ../libnvidia-allocator.so.1::/usr/lib/x86_64-linux-gnu/gbm/nvidia-drm_gbm.so
     - --link
     - libnvidia-vulkan-producer.so.535.183.01::/usr/lib/x86_64-linux-gnu/libnvidia-vulkan-producer.so
@@ -22,11 +20,24 @@ containerEdits:
     - libglxserver_nvidia.so.535.183.01::/usr/lib/xorg/modules/extensions/libglxserver_nvidia.so
     hookName: createContainer
     path: /usr/bin/nvidia-cdi-hook
+  - args:
+    - nvidia-cdi-hook
+    - create-symlinks
+    - --link
+    - libGLX_nvidia.so.535.183.01::/usr/lib/x86_64-linux-gnu/libGLX_indirect.so.0
+    - --link
+    - libnvidia-opticalflow.so.1::/usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so
+    - --link
+    - libcuda.so.1::/usr/lib/x86_64-linux-gnu/libcuda.so
+    hookName: createContainer
+    path: /usr/bin/nvidia-cdi-hook
   - args:
     - nvidia-cdi-hook
     - update-ldcache
     - --folder
     - /usr/lib/x86_64-linux-gnu
+    - --folder
+    - /usr/lib/x86_64-linux-gnu/vdpau
     hookName: createContainer
     path: /usr/bin/nvidia-cdi-hook
   mounts:
@@ -332,6 +343,13 @@ containerEdits:
     - nosuid
     - nodev
     - bind
+  - containerPath: /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.535.183.01
+    hostPath: /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.535.183.01
+    options:
+    - ro
+    - nosuid
+    - nodev
+    - bind
   - containerPath: /usr/share/egl/egl_external_platform.d/10_nvidia_wayland.json
     hostPath: /usr/share/egl/egl_external_platform.d/10_nvidia_wayland.json
     options:

benz0li · 2024-11-05T08:05:58Z

@elezar Using v1.17.0, it works with docker¹ and mode = "auto":

docker run --rm -ti glcr.b-data.ch/jupyterlab/cuda/r/verse bash

==========
== CUDA ==
==========

CUDA Version 12.6.2

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

=============
== JUPYTER ==
=============

Entered start.sh with args: bash
Running hooks in: /usr/local/bin/start-notebook.d as uid: 1000 gid: 100
Sourcing shell script: /usr/local/bin/start-notebook.d/10-populate.sh
Done running hooks in: /usr/local/bin/start-notebook.d
Running hooks in: /usr/local/bin/before-notebook.d as uid: 1000 gid: 100
Sourcing shell script: /usr/local/bin/before-notebook.d/10-env.sh
TZ is set to Etc/UTC (/etc/localtime and /etc/timezone remain unchanged)
LANG is set to en_US.UTF-8
Sourcing shell script: /usr/local/bin/before-notebook.d/11-home.sh
Sourcing shell script: /usr/local/bin/before-notebook.d/12-r.sh
Sourcing shell script: /usr/local/bin/before-notebook.d/13-update-cran.sh
Sourcing shell script: /usr/local/bin/before-notebook.d/30-code-server.sh
Sourcing shell script: /usr/local/bin/before-notebook.d/50-rstudio.sh
Sourcing shell script: /usr/local/bin/before-notebook.d/71-tensorboard.sh
Sourcing shell script: /usr/local/bin/before-notebook.d/90-limits.sh
Sourcing shell script: /usr/local/bin/before-notebook.d/95-misc.sh
Done running hooks in: /usr/local/bin/before-notebook.d
Executing the command: bash

It does not work with docker and mode = "cdi":

docker run --rm -ti glcr.b-data.ch/jupyterlab/cuda/r/verse bash

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #1: error running hook: exit status 1, stdout: , stderr: time="2024-11-05T08:00:31Z" level=error msg="failed to create link [libGLX_nvidia.so.535.183.01 /usr/lib/x86_64-linux-gnu/libGLX_indirect.so.0]: failed to check if link exists: unexpected link target: libGLX_mesa.so.0": unknown.

It does not work with podman, either:

podman run --rm --device nvidia.com/gpu=all -ti glcr.b-data.ch/jupyterlab/cuda/r/verse bash

Error: OCI runtime error: crun: {"msg":"error executing hook `/usr/bin/nvidia-cdi-hook` (exit code: 1)","level":"error","time":"2024-11-05T08:03:31.269259Z"}

"default-runtime": "nvidia" ↩

benz0li · 2024-11-05T08:14:52Z

@elezar Isn't libGLX_indirect.so.0 -> libGLX_mesa.so.0 in the image required [by default] so it also work when there is no NVIDIA/CUDA device available?

I.e. using Mesa instead. Is my understanding correct?

benz0li · 2024-11-05T08:31:00Z

Workaround for v1.17.0: Delete lines

    - --link
    - libGLX_nvidia.so.535.183.01::/usr/lib/x86_64-linux-gnu/libGLX_indirect.so.0

in /etc/cdi/nvidia.yaml.

benz0li · 2024-11-05T08:36:15Z

If the server doesn't send back a vendor name, or sends back a vendor that the client can't load, then it'll fall back to using libGLX_indirect.so.0, which should be a symlink to another vendor library.

– NVIDIA/libglvnd#177 (comment)

elezar · 2024-11-05T14:59:44Z

In the v1.17.0 update we changed the behaviour of our create-symlinks hook to be more robust, but missed an existent link such as this as a valid state in our testing. We are working on a fix for this and will release a patch release as soon as this is ready.

benz0li · 2024-11-05T15:57:49Z

@elezar No worries. Very solid work on the NVIDIA Container Toolkit.

Almost never had a problem with it.

Thank you!

elezar self-assigned this Nov 4, 2024

elezar mentioned this issue Nov 5, 2024

Force creation of symlinks in create-symlink hook #773

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v1.17.0, mode = "cdi"] OCI runtime create failed: 'failed to create link'/'failed to check if link exists: unexpected link target' #772

[v1.17.0, mode = "cdi"] OCI runtime create failed: 'failed to create link'/'failed to check if link exists: unexpected link target' #772

benz0li commented Nov 4, 2024

benz0li commented Nov 4, 2024

elezar commented Nov 4, 2024

SurenNihalani commented Nov 4, 2024

benz0li commented Nov 5, 2024

benz0li commented Nov 5, 2024

benz0li commented Nov 5, 2024

benz0li commented Nov 5, 2024 •

edited

Loading

benz0li commented Nov 5, 2024

benz0li commented Nov 5, 2024

elezar commented Nov 5, 2024

benz0li commented Nov 5, 2024

[v1.17.0, mode = "cdi"] OCI runtime create failed: 'failed to create link'/'failed to check if link exists: unexpected link target' #772

[v1.17.0, mode = "cdi"] OCI runtime create failed: 'failed to create link'/'failed to check if link exists: unexpected link target' #772

Comments

benz0li commented Nov 4, 2024

benz0li commented Nov 4, 2024

elezar commented Nov 4, 2024

SurenNihalani commented Nov 4, 2024

benz0li commented Nov 5, 2024

benz0li commented Nov 5, 2024

benz0li commented Nov 5, 2024

Footnotes

benz0li commented Nov 5, 2024 • edited Loading

benz0li commented Nov 5, 2024

benz0li commented Nov 5, 2024

elezar commented Nov 5, 2024

benz0li commented Nov 5, 2024

benz0li commented Nov 5, 2024 •

edited

Loading