Getting nvidia-device-plugin container CrashLoopBackOff | version v0.14.0 | container runtime : containerd #406

DineshwarSingh · 2023-05-24T08:57:30Z

Getting nvidia-device-plugin container CrashLoopBackOff error. Using K8-device-plugin version v0.14.0 and container runtime as containerd. Same is working fine with container runtime as dockerd.

Pod ErrorLog:

I0524 08:28:03.907585       1 main.go:256] Retreiving plugins.
W0524 08:28:03.908010       1 factory.go:31] No valid resources detected, creating a null CDI handler
I0524 08:28:03.908084       1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0524 08:28:03.908113       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0524 08:28:03.908121       1 factory.go:115] Incompatible platform detected
E0524 08:28:03.908130       1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0524 08:28:03.908136       1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0524 08:28:03.908142       1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0524 08:28:03.908149       1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E0524 08:28:03.915664       1 main.go:123] error starting plugins: error creating plugin manager: unable to create plugin manager: platform detection failed

nvidia-smi output:

sh-4.2$ nvidia-smi
Wed May 24 08:57:00 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                        On | 00000000:00:1E.0 Off |                    0 |
| N/A   25C    P8                9W /  70W|      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

The text was updated successfully, but these errors were encountered:

elezar · 2023-05-24T09:09:02Z

Hi @DineshwarSingh, could you comment on how the device plugin is configured / installed. Note that the device plugin also requires that the NVIDIA Container Toolkit be installed on the system and be configured as a runtime class in Containerd. Have you installed the toolkit and configured Containerd to use it as a runtime?

DineshwarSingh · 2023-05-24T11:09:07Z

Hi @elezar,
Thanks for your response !
we are using amazon linux 2 and NVIDIA Container Toolkit is installed. Please see below details:
sh-4.2$ nvidia-ctk --version
NVIDIA Container Toolkit CLI version 1.13.1

Regards,
Dinesh

elezar · 2023-05-24T11:12:56Z

@DineshwarSingh how is the Device Plugin deployed?

What are the contents of your Containerd config.toml file?

DineshwarSingh · 2023-05-24T12:56:41Z

@elezar device plugin is deployed using helm version v0.14.0.
/etc/containerd/config.toml content is as below:
version = 2
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"

elezar · 2023-05-24T13:13:36Z

Thanks for the information. I assume that containerd has been restared since the config has been changed?

Could you also provide the contents of your /etc/nvidia-container-runtime/config.toml file?

DineshwarSingh · 2023-05-24T13:32:17Z

Hi @elezar,
please find below contents of your /etc/nvidia-container-runtime/config.toml file:

disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false

[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
#no-cgroups = false
#user = "root:video"
ldconfig = "@/sbin/ldconfig"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"

#Specify the runtimes to consider. This list is processed in order and the PATH
#searched for matching executables unless the entry is an absolute path.
runtimes = [
"docker-runc",
"runc",
]

mode = "auto"

[nvidia-container-runtime.modes.csv]

mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

=========================================================================
Thanks,
Dinesh

DineshwarSingh · 2023-05-31T05:29:12Z

Hi @elezar,
Do you have any suggestion to fix this issue? Any possible help is highly appreciated !

Thanks,
Dinesh

zachfi · 2023-06-02T16:55:07Z

I am also seeing this in my environment. My config and output look the same as above. Happy to provide more details.

zachfi · 2023-06-02T17:03:04Z

The following packages are installed.

[root@k4 ~]# pacman -Q | grep nvidia
libnvidia-container 1.13.1-1
libnvidia-container-tools 1.13.1-1
nvidia 530.41.03-15
nvidia-container-runtime 3.13.1-1
nvidia-container-toolkit 1.13.1-1
nvidia-utils 530.41.03-1
opencl-nvidia 530.41.03-1

Contianerd has been verified with the following.

[root@k4 ~]# ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:12.1.1-runtime-ubuntu20.04 n nvidia-smi
Fri Jun  2 17:01:46 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03              Driver Version: 530.41.03    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 970          Off| 00000000:01:00.0 Off |                  N/A |
| 11%   43C    P0               35W / 170W|      0MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

The plugin was installed with the following manifest.

# https://github.com/NVIDIA/k8s-device-plugin#enabling-gpu-support-in-kubernetes
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: nvidia-device-plugin
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        app.kubernetes.io/name: nvidia-device-plugin
      annotations: {}
    spec:
      priorityClassName: system-node-critical
      securityContext: {}
      containers:
        - image: nvcr.io/nvidia/k8s-device-plugin:v0.14.0
          imagePullPolicy: IfNotPresent
          name: nvidia-device-plugin-ctr
          env:
            - name: NVIDIA_MIG_MONITOR_DEVICES
              value: all
          securityContext:
            capabilities:
              add:
                - SYS_ADMIN
          volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins
      tolerations:
        - key: CriticalAddonsOnly
          operator: Exists
        - effect: NoSchedule
          key: nvidia.com/gpu
          operator: Exists
---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia
---
apiVersion: v1
kind: Pod
metadata:
  name: nbody-gpu-benchmark
  namespace: default
spec:
  restartPolicy: OnFailure
  runtimeClassName: nvidia
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:nbody
      args: ['nbody', '-gpu', '-benchmark']
      resources:
        limits:
          nvidia.com/gpu: 1
      env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: all
        - name: NVIDIA_DRIVER_CAPABILITIES
          value: all

Using the following information to get started. https://docs.k3s.io/advanced#nvidia-container-runtime-support

Let me know what other details are helpful.

elezar · 2023-06-05T13:26:25Z

Note that using the CTR comman: ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:12.1.1-runtime-ubuntu20.04 n nvidia-smi is not sufficient in this case since this uses a different mechanism for injecting the devices (by adding the nvidia-container-runtime-hook directly) from the Device plugin.

Note that in an earlier comment you mention /etc/containerd/config.toml referencing the NVIDIA Container runtime, but the k3s documentation mentions:

Confirm that the nvidia container runtime has been found by k3s: grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml

Could you confirm that the nvidia runtime is defined there too? If it is (but is not the default runtime), specifying the nvidia runtime class name for the device plugin pod too may address this.

zachfi · 2023-06-08T01:53:48Z

Good eyes. It looks like it also is detected here.

# grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

Here is the full config, but how do I know if it is the non-default?

version = 2
[plugins."io.containerd.internal.v1.opt"]
  path = "/var/lib/rancher/k3s/agent/containerd"
[plugins."io.containerd.grpc.v1.cri"]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  enable_unprivileged_ports = true
  enable_unprivileged_icmp = true
  sandbox_image = "rancher/mirrored-pause:3.6"
[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
  endpoint = ["https://registry.default.svc.cluster.local:5000"]
[plugins."io.containerd.grpc.v1.cri".registry.configs."registry.default.svc.cluster.local:5000".tls]
  ca_file = "/etc/ssl/ca.pem"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

zachfi · 2023-06-08T01:56:56Z

Also note that the GPU benchmark pod also fails to schedule with the runtimeClassName set due to "Insufficient nvidia.com/gpu".

elezar · 2023-06-08T09:20:36Z

@zachfi since there is no plugins."io.containerd.grpc.v1.cri".containerd.default_runtime_name = "nvidia" entry, the nvidia runtime is not the default runtime. As such you would also need to launch the Device Plugin specifying runtimeClassname: nvidia. This ensures that the containers of the device plugin are started using the nvidia-container-runtime, injecting the required devices. (Please also confirm that NVIDIA_VISIBLE_DEVICES = all in this container too).

The issue that the benchmark container is seeing is because tha device plugin is not reporting any nvidia.com/gpu resources to the Kubelet and as such the pod cannot be scheduled.

zachfi · 2023-06-08T14:00:14Z

NVIDIA_VISIBLE_DEVICES was set in the benchmark pod, but not in the daemonset pods. I've added this and it looks to fail in the same way.

❯ k -n kube-system logs nvidia-device-plugin-qxs5n
I0608 13:59:52.537232       1 main.go:154] Starting FS watcher.
I0608 13:59:52.537319       1 main.go:161] Starting OS watcher.
I0608 13:59:52.537643       1 main.go:176] Starting Plugins.
I0608 13:59:52.537659       1 main.go:234] Loading configuration.
I0608 13:59:52.537778       1 main.go:242] Updating config with default resource matching patterns.
I0608 13:59:52.537990       1 main.go:253] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0608 13:59:52.538003       1 main.go:256] Retreiving plugins.
W0608 13:59:52.538273       1 factory.go:31] No valid resources detected, creating a null CDI handler
I0608 13:59:52.538319       1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0608 13:59:52.538344       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0608 13:59:52.538352       1 factory.go:115] Incompatible platform detected
E0608 13:59:52.538361       1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0608 13:59:52.538368       1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0608 13:59:52.538373       1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0608 13:59:52.538379       1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E0608 13:59:52.538500       1 main.go:123] error starting plugins: error creating plugin manager: unable to create plugin manager: platform detection failed

elezar · 2023-06-08T15:43:31Z

Are you ALSO specifying the nvidia runtime class for the device plugin containers?

zachfi · 2023-06-09T00:15:28Z

Amazing! This was the missing piece. Once I added this, the plugin deployed and registered the GPU and then the benchmark was able to run. Thank you for the assist @elezar. The daemonset example I had pulled didn't have this setting, so perhaps it is also missing from helm, and adding it would resolve the issue for @DineshwarSingh as well.

SergeSpinoza · 2023-06-09T15:14:46Z

@zachfi Hi! What parameter and where did you end up adding it? I have exactly the same problem (I install via Helm)

elezar · 2023-06-09T16:27:10Z

@SergeSpinoza the device plugin needs to specify runtimeClassName: nvidia when being deployed in cases where the nvidia runtime is not the default.

The helm deamonset template does define:

      {{- if .Values.runtimeClassName }}
      runtimeClassName: {{ .Values.runtimeClassName }}
      {{- end }}

So deploying with --set runtimeClassName=nvidia should have the desired effect.

SergeSpinoza · 2023-06-09T18:00:10Z

@elezar Thanks. After I add runtimeClassName: nvidia I have other error: Error creating: pods "nvdp-nvidia-device-plugin-" is forbidden: pod rejected: RuntimeClass "nvidia" not found

I have no runtimeclasses

# kubectl get runtimeclasses.node.k8s.io -A 
No resources found

I didn't really understand what I did wrong. I followed this instruction: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html

I installed on node with GPU - CUDA driver
I installed nvidia-container-toolkit
I added to /etc/containerd/config.toml

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          runtime_type = "io.containerd.runc.v1"
          runtime_engine = ""
          runtime_root = ""
          privileged_without_host_devices = false
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            systemdCgroup = true
            binaryName = "/usr/bin/nvidia-container-runtime"

and restart containerd

When executing the command:

ctr run --rm -t \
    --runc-binary=/usr/bin/nvidia-container-runtime \
    --env NVIDIA_VISIBLE_DEVICES=all \
    docker.io/nvidia/cuda:11.6.2-base-ubuntu20.04 \
    cuda-11.6.2-base-ubuntu20.04 nvidia-smi

i get:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A2           Off  | 00000000:AF:00.0 Off |                    0 |
|  0%   38C    P0    21W /  60W |   2308MiB / 15356MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

What did I miss?

zachfi · 2023-06-10T00:41:15Z

One thing I noticed today is that if you run nvidia-smi from the pod it won't print the processes, but if you run it from the host it will. Not sure the reason.

What I notice in that output is that it does NOT say "No processes found", and I think you have a process running on the GPU currently.

zachfi · 2023-06-10T00:44:34Z

Also, one of the versions of the darmonset I tried was using heml template so presumably the runtimeClass was not set by default, but I don't use helm. I might think it would make a good default value if it's required to operate properly, but perhaps there are reasons why not enable by default. The manifest I posted above has the runtimeClass defined.

simsicon · 2023-06-12T11:42:57Z

@SergeSpinoza I have the same experience, I think it would be better if there's some quick checklist so we can check what are we missing. @elezar

simsicon · 2023-06-12T14:10:14Z

My environment and setup:

k3s master node without GPU, one agent node with two nvidia 4090, running containerd.
$ nvidia-smi on agent node:

Mon Jun 12 21:52:43 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090         Off| 00000000:4B:00.0 Off |                  Off |
| 32%   32C    P0               51W / 450W|      0MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090         Off| 00000000:B1:00.0 Off |                  Off |
| 30%   31C    P0               46W / 450W|      0MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

containerd config(gpu related part): /var/lib/rancher/k3s/agent/etc/containerd/config.toml

version = 2
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

[plugins.cri.containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins.cri.containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

4.how k8s-device-plugin deployed: basic features manifest, followed this link: https://github.com/NVIDIA/k8s-device-plugin#enabling-gpu-support-in-kubernetes

problem:

I0612 14:04:07.967143       1 main.go:256] Retreiving plugins.                                                                                                             │
│ W0612 14:04:07.968294       1 factory.go:31] No valid resources detected, creating a null CDI handler                                                                      │
│ I0612 14:04:07.968400       1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or  │
│ I0612 14:04:07.968457       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found                                                         │
│ E0612 14:04:07.968505       1 factory.go:115] Incompatible platform detected                                                                                               │
│ E0612 14:04:07.968520       1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?                                                       │
│ E0612 14:04:07.968530       1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites                                │
│ E0612 14:04:07.968540       1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start                             │
│ E0612 14:04:07.968550       1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes            │
│ I0612 14:04:07.968567       1 main.go:287] No devices found. Waiting indefinitely.

ctr can boot nvidia containers.

I have to admit that I am not fully understand how k8s-device-plugin works, and feel very thankful for your hard works, I really need to figure what's the root cause here in my case, any ideas are appreciated.

SergeSpinoza · 2023-06-13T13:57:43Z

Manual RuntimeClass creation in kubernetes cluster helped me.

Manifest:

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia 
handler: nvidia

Docs: https://kubernetes.io/docs/concepts/containers/runtime-class/

simsicon · 2023-06-15T10:19:09Z

Manual RuntimeClass creation in kubernetes cluster helped me.

Manifest:
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia 
handler: nvidia 
Docs: https://kubernetes.io/docs/concepts/containers/runtime-class/

I have this RuntimeClass, but no effect to me, could you please share your k8s-device-plugin daemonset manifest?
I found the k8s-device-plugin daemonset manifest shared by @zachfi could work for me, I don't know why

Thanks

SergeSpinoza · 2023-06-15T19:53:14Z

@simsicon

My daemonset:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvdp-nvidia-device-plugin
  namespace: nvidia-device-plugin
  labels:
    app.kubernetes.io/instance: nvdp
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: nvidia-device-plugin
    app.kubernetes.io/version: 0.14.0
    helm.sh/chart: nvidia-device-plugin-0.14.0
  annotations:
    deprecated.daemonset.template.generation: '1'
    meta.helm.sh/release-name: nvdp
    meta.helm.sh/release-namespace: nvidia-device-plugin
spec:
  selector:
    matchLabels:
      app.kubernetes.io/instance: nvdp
      app.kubernetes.io/name: nvidia-device-plugin
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: nvdp
        app.kubernetes.io/name: nvidia-device-plugin
    spec:
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins
            type: ''
      containers:
        - name: nvidia-device-plugin-ctr
          image: nvcr.io/nvidia/k8s-device-plugin:v0.14.0
          env:
            - name: NVIDIA_MIG_MONITOR_DEVICES
              value: all
          volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
          securityContext:
            capabilities:
              add:
                - SYS_ADMIN
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: node.kubernetes.io/gpu-node
                    operator: In
                    values:
                      - 'true'
      schedulerName: default-scheduler
      tolerations:
        - key: CriticalAddonsOnly
          operator: Exists
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
        - key: node.kubernetes.io/gpu-node
          operator: Equal
          value: 'true'
          effect: NoSchedule
      priorityClassName: system-node-critical
      runtimeClassName: nvidia
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 0

matusnovak · 2023-09-23T14:45:46Z

Any update on this issue?

I am having the same problem.

I have followed the documentation here: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html to update containerd config. I have run sudo nvidia-ctk runtime configure --runtime=containerd. Here are the contents:

$ cat /etc/containerd/config.toml
version = 2
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

I can confirm that when k3s starts it recognizes the nvidia runtime:

$ grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

I have deployed the daemonset nvidia plugin via Helm based on these instructions: https://github.com/NVIDIA/k8s-device-plugin/#deployment-via-helm It has created a daemonset, describe output below:

$ kubectl --namespace nvidia-device-plugin describe daemonset
Name:           nvdp-nvidia-device-plugin
Selector:       app.kubernetes.io/instance=nvdp,app.kubernetes.io/name=nvidia-device-plugin
Node-Selector:  <none>
Labels:         app.kubernetes.io/instance=nvdp
                app.kubernetes.io/managed-by=Helm
                app.kubernetes.io/name=nvidia-device-plugin
                app.kubernetes.io/version=0.14.1
                helm.sh/chart=nvidia-device-plugin-0.14.1
Annotations:    deprecated.daemonset.template.generation: 1
                meta.helm.sh/release-name: nvdp
                meta.helm.sh/release-namespace: nvidia-device-plugin
Desired Number of Nodes Scheduled: 1
Current Number of Nodes Scheduled: 1
Number of Nodes Scheduled with Up-to-date Pods: 1
Number of Nodes Scheduled with Available Pods: 0
Number of Nodes Misscheduled: 0
Pods Status:  1 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:  app.kubernetes.io/instance=nvdp
           app.kubernetes.io/name=nvidia-device-plugin
  Containers:
   nvidia-device-plugin-ctr:
    Image:      nvcr.io/nvidia/k8s-device-plugin:v0.14.1
    Port:       <none>
    Host Port:  <none>
    Environment:
      NVIDIA_MIG_MONITOR_DEVICES:  all
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
  Volumes:
   device-plugin:
    Type:               HostPath (bare host directory volume)
    Path:               /var/lib/kubelet/device-plugins
    HostPathType:       
  Priority Class Name:  system-node-critical
Events:
  Type    Reason            Age   From                  Message
  ----    ------            ----  ----                  -------
  Normal  SuccessfulCreate  59m   daemonset-controller  Created pod: nvdp-nvidia-device-plugin-v9qsp

But the pod is crashing:

$ kubectl --namespace nvidia-device-plugin logs nvdp-nvidia-device-plugin-v9qsp
I0923 14:40:25.663427       1 main.go:154] Starting FS watcher.
I0923 14:40:25.663467       1 main.go:161] Starting OS watcher.
I0923 14:40:25.663590       1 main.go:176] Starting Plugins.
I0923 14:40:25.663599       1 main.go:234] Loading configuration.
I0923 14:40:25.663661       1 main.go:242] Updating config with default resource matching patterns.
I0923 14:40:25.663779       1 main.go:253] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0923 14:40:25.663788       1 main.go:256] Retreiving plugins.
W0923 14:40:25.663933       1 factory.go:31] No valid resources detected, creating a null CDI handler
I0923 14:40:25.663963       1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0923 14:40:25.663979       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0923 14:40:25.663984       1 factory.go:115] Incompatible platform detected
E0923 14:40:25.663990       1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0923 14:40:25.663995       1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0923 14:40:25.664000       1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0923 14:40:25.664005       1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E0923 14:40:25.682592       1 main.go:123] error starting plugins: error creating plugin manager: unable to create plugin manager: platform detection failed

Running nvidia-smi on the host shows:

$ nvidia-smi
Sat Sep 23 14:42:59 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA T400 4GB                Off | 00000000:01:00.0 Off |                  N/A |
| 38%   36C    P8              N/A /  31W |      1MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

I am able to run nvidia-smi from a container via ctr, example below:

$ sudo ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:12.2.0-base-ubuntu22.04 cuda-12-base nvidia-smi
Sat Sep 23 14:44:51 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA T400 4GB                Off | 00000000:01:00.0 Off |                  N/A |
| 38%   37C    P8              N/A /  31W |      1MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

What am I doing wrong?

Edit:
I have also tried with drivers versions 515, 470, and having the same issue.

madhureddy143 · 2023-09-26T09:08:42Z

My environment and setup:

k3s master node without GPU, one agent node with two nvidia 4090, running containerd.
$ nvidia-smi on agent node:

Mon Jun 12 21:52:43 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090         Off| 00000000:4B:00.0 Off |                  Off |
| 32%   32C    P0               51W / 450W|      0MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090         Off| 00000000:B1:00.0 Off |                  Off |
| 30%   31C    P0               46W / 450W|      0MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

containerd config(gpu related part): /var/lib/rancher/k3s/agent/etc/containerd/config.toml

version = 2
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

[plugins.cri.containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins.cri.containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

4.how k8s-device-plugin deployed: basic features manifest, followed this link: https://github.com/NVIDIA/k8s-device-plugin#enabling-gpu-support-in-kubernetes

problem:

I0612 14:04:07.967143       1 main.go:256] Retreiving plugins.                                                                                                             │
│ W0612 14:04:07.968294       1 factory.go:31] No valid resources detected, creating a null CDI handler                                                                      │
│ I0612 14:04:07.968400       1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or  │
│ I0612 14:04:07.968457       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found                                                         │
│ E0612 14:04:07.968505       1 factory.go:115] Incompatible platform detected                                                                                               │
│ E0612 14:04:07.968520       1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?                                                       │
│ E0612 14:04:07.968530       1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites                                │
│ E0612 14:04:07.968540       1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start                             │
│ E0612 14:04:07.968550       1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes            │
│ I0612 14:04:07.968567       1 main.go:287] No devices found. Waiting indefinitely.

ctr can boot nvidia containers.

I have to admit that I am not fully understand how k8s-device-plugin works, and feel very thankful for your hard works, I really need to figure what's the root cause here in my case, any ideas are appreciated.

@simsicon
most of your setup is correct

few suggestions,

If the config.toml directly modified, after restarting k3s service config.toml will be over written, to avoid this use config.toml.tmpl file.
sample URL :- https://github.com/skirsten/k3s/blob/f78a66b44e2ecbef64122be99a9aa9118a49d7e9/pkg/agent/templates/templates_linux.go#L10 and add default_runtime_name = "nvidia"
As device plugin is a daemonset, the logs u might be seeing is the one form the master node where GPU is not present, check the one from agent.
check configuration of containerd at the agent node was it pointing properly to the nvidia-container-runtime.

matusnovak · 2023-10-04T22:00:12Z

I believe I have figured it out. At least in my case.

If the config.toml directly modified, after restarting k3s service config.toml will be over written, to avoid this use config.toml.tmpl file.
sample URL :- https://github.com/skirsten/k3s/blob/f78a66b44e2ecbef64122be99a9aa9118a49d7e9/pkg/agent/templates/templates_linux.go#L10 and add default_runtime_name = "nvidia"

Most of the tutorials out there were suggesting a k3d template instead of the k3s template. I thought that was wrong and I assumed that the k3s service should "detect" the nvidia container runtime. It does but it does not make it the default one.

This template seems to work: https://github.com/skirsten/k3s/blob/f78a66b44e2ecbef64122be99a9aa9118a49d7e9/pkg/agent/templates/templates_linux.go#L10

But a simpler solution, in case you don't want to force every pod to have nvidia runtime, is to add runtimeClassName: nvidia to the https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml and after that everything starts to work just fine.

xinmans · 2023-10-15T14:11:16Z

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
# add runtimeClassName: nvidia to fix "No devices found. Waiting indefinitely." #406
runtimeClassName: nvidia

madeeldevops · 2023-11-13T10:02:42Z

I have used k8 device plugin damonset v0.14.0, and its working fine with the below version.
$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml
Also use runc.v2 and don't forget to verify that you have nvidia-container-runtime in your system. Usually at /usr/bin
These both should be mentioned in your /var/lib/rancher/k3s/agent/etc/containerd/config.toml file. At the end restart your k3s with systemctl and check the log of your daemonset pod. Its should be registered now.
Output will be like

k logs nvidia-device-plugin-daemonset-c8snh -n kube-system
I1113 09:15:44.110759       1 main.go:154] Starting FS watcher.
I1113 09:15:44.110825       1 main.go:161] Starting OS watcher.
I1113 09:15:44.111097       1 main.go:176] Starting Plugins.
I1113 09:15:44.111107       1 main.go:234] Loading configuration.
I1113 09:15:44.111224       1 main.go:242] Updating config with default resource matching patterns.
I1113 09:15:44.111357       1 main.go:253] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
   "sharing": {
    "timeSlicing": {}
  }
}
I1113 09:15:44.111365       1 main.go:256] Retreiving plugins.
I1113 09:15:44.112044       1 factory.go:107] Detected NVML platform: found NVML library
I1113 09:15:44.112072       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I1113 09:15:44.127433       1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I1113 09:15:44.127684       1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I1113 09:15:44.128750       1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet

My host have ubuntu 22.4

NAME         STATUS   ROLES                  AGE    VERSION        INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
devops-s2h   Ready    control-plane,master   138m   v1.27.7+k3s2   172.16.11.243   <none>        Ubuntu 22.04.3 LTS   6.2.0-36-generic   containerd://1.7.7-k3s1.27

duhow · 2023-11-29T09:44:10Z

Fix for me in k3s:

Create the RuntimeClass

cat <<EOF | kubectl apply -f -
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia 
handler: nvidia 
EOF

Patch nvidia-device-plugin to use this Runtime (modify according to your installation)

kubectl patch daemonset -n kube-system nvidia-device-plugin --type='json' \
  -p='[{"op": "add", "path": "/spec/template/spec/runtimeClassName", "value": "nvidia"}]'

vsndev3 · 2023-12-12T17:50:17Z

For k3s i fixed like this (snippet for impatient). More details: https://docs.k3s.io/advanced#configuring-containerd

sudo cp /var/lib/rancher/k3s/agent/etc/containerd/config.toml /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl

Edit /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl to add 'default_runtime_name = "nvidia"' as below:

[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"

sudo systemctl restart k3s

Check:
sudo kubectl logs gpu-feature-discovery-5kp4n | grep NVML
I1212 17:34:46.603358       1 factory.go:48] Detected NVML platform: found NVML library
I1212 17:34:46.603383       1 factory.go:64] Using NVML manager

sudo kubectl describe nodes | grep nvidia.com/gpu.count
                    nvidia.com/gpu.count=1

matusnovak mentioned this issue Sep 23, 2023

Latest NVIDIA Container Runtime Support not working anymore with K3S k3s-io/k3s#8248

Closed

klueska assigned elezar Jan 26, 2024

klueska added the triage label Jan 26, 2024

ArangoGutierrez added needs-triage issue or PR has not been assigned a priority-px label and removed triage labels Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting nvidia-device-plugin container CrashLoopBackOff | version v0.14.0 | container runtime : containerd #406

Getting nvidia-device-plugin container CrashLoopBackOff | version v0.14.0 | container runtime : containerd #406

DineshwarSingh commented May 24, 2023 •

edited by elezar

Loading

elezar commented May 24, 2023

DineshwarSingh commented May 24, 2023 •

edited

Loading

elezar commented May 24, 2023

DineshwarSingh commented May 24, 2023

elezar commented May 24, 2023

DineshwarSingh commented May 24, 2023 •

edited

Loading

DineshwarSingh commented May 31, 2023

zachfi commented Jun 2, 2023

zachfi commented Jun 2, 2023 •

edited

Loading

elezar commented Jun 5, 2023

zachfi commented Jun 8, 2023

zachfi commented Jun 8, 2023 •

edited

Loading

elezar commented Jun 8, 2023

zachfi commented Jun 8, 2023 •

edited

Loading

elezar commented Jun 8, 2023

zachfi commented Jun 9, 2023

SergeSpinoza commented Jun 9, 2023

elezar commented Jun 9, 2023

SergeSpinoza commented Jun 9, 2023

zachfi commented Jun 10, 2023

zachfi commented Jun 10, 2023

simsicon commented Jun 12, 2023

simsicon commented Jun 12, 2023

SergeSpinoza commented Jun 13, 2023

simsicon commented Jun 15, 2023

SergeSpinoza commented Jun 15, 2023

matusnovak commented Sep 23, 2023 •

edited

Loading

madhureddy143 commented Sep 26, 2023

matusnovak commented Oct 4, 2023 •

edited

Loading

xinmans commented Oct 15, 2023

madeeldevops commented Nov 13, 2023 •

edited

Loading

duhow commented Nov 29, 2023

vsndev3 commented Dec 12, 2023

Getting nvidia-device-plugin container CrashLoopBackOff | version v0.14.0 | container runtime : containerd #406

Getting nvidia-device-plugin container CrashLoopBackOff | version v0.14.0 | container runtime : containerd #406

Comments

DineshwarSingh commented May 24, 2023 • edited by elezar Loading

elezar commented May 24, 2023

DineshwarSingh commented May 24, 2023 • edited Loading

elezar commented May 24, 2023

DineshwarSingh commented May 24, 2023

elezar commented May 24, 2023

DineshwarSingh commented May 24, 2023 • edited Loading

DineshwarSingh commented May 31, 2023

zachfi commented Jun 2, 2023

zachfi commented Jun 2, 2023 • edited Loading

elezar commented Jun 5, 2023

zachfi commented Jun 8, 2023

zachfi commented Jun 8, 2023 • edited Loading

elezar commented Jun 8, 2023

zachfi commented Jun 8, 2023 • edited Loading

elezar commented Jun 8, 2023

zachfi commented Jun 9, 2023

SergeSpinoza commented Jun 9, 2023

elezar commented Jun 9, 2023

SergeSpinoza commented Jun 9, 2023

zachfi commented Jun 10, 2023

zachfi commented Jun 10, 2023

simsicon commented Jun 12, 2023

simsicon commented Jun 12, 2023

SergeSpinoza commented Jun 13, 2023

simsicon commented Jun 15, 2023

SergeSpinoza commented Jun 15, 2023

matusnovak commented Sep 23, 2023 • edited Loading

madhureddy143 commented Sep 26, 2023

matusnovak commented Oct 4, 2023 • edited Loading

xinmans commented Oct 15, 2023

madeeldevops commented Nov 13, 2023 • edited Loading

duhow commented Nov 29, 2023

vsndev3 commented Dec 12, 2023

DineshwarSingh commented May 24, 2023 •

edited by elezar

Loading

DineshwarSingh commented May 24, 2023 •

edited

Loading

DineshwarSingh commented May 24, 2023 •

edited

Loading

zachfi commented Jun 2, 2023 •

edited

Loading

zachfi commented Jun 8, 2023 •

edited

Loading

zachfi commented Jun 8, 2023 •

edited

Loading

matusnovak commented Sep 23, 2023 •

edited

Loading

matusnovak commented Oct 4, 2023 •

edited

Loading

madeeldevops commented Nov 13, 2023 •

edited

Loading