Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting nvidia-device-plugin container CrashLoopBackOff | version v0.14.0 | container runtime : containerd #406

Open
DineshwarSingh opened this issue May 24, 2023 · 33 comments
Assignees
Labels
needs-triage issue or PR has not been assigned a priority-px label

Comments

@DineshwarSingh
Copy link

DineshwarSingh commented May 24, 2023

Getting nvidia-device-plugin container CrashLoopBackOff error. Using K8-device-plugin version v0.14.0 and container runtime as containerd. Same is working fine with container runtime as dockerd.

Pod ErrorLog:

I0524 08:28:03.907585       1 main.go:256] Retreiving plugins.
W0524 08:28:03.908010       1 factory.go:31] No valid resources detected, creating a null CDI handler
I0524 08:28:03.908084       1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0524 08:28:03.908113       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0524 08:28:03.908121       1 factory.go:115] Incompatible platform detected
E0524 08:28:03.908130       1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0524 08:28:03.908136       1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0524 08:28:03.908142       1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0524 08:28:03.908149       1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E0524 08:28:03.915664       1 main.go:123] error starting plugins: error creating plugin manager: unable to create plugin manager: platform detection failed

nvidia-smi output:

sh-4.2$ nvidia-smi
Wed May 24 08:57:00 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                        On | 00000000:00:1E.0 Off |                    0 |
| N/A   25C    P8                9W /  70W|      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
@elezar
Copy link
Member

elezar commented May 24, 2023

Hi @DineshwarSingh, could you comment on how the device plugin is configured / installed. Note that the device plugin also requires that the NVIDIA Container Toolkit be installed on the system and be configured as a runtime class in Containerd. Have you installed the toolkit and configured Containerd to use it as a runtime?

@DineshwarSingh
Copy link
Author

DineshwarSingh commented May 24, 2023

Hi @elezar,
Thanks for your response !
we are using amazon linux 2 and NVIDIA Container Toolkit is installed. Please see below details:
sh-4.2$ nvidia-ctk --version
NVIDIA Container Toolkit CLI version 1.13.1

Regards,
Dinesh

@elezar
Copy link
Member

elezar commented May 24, 2023

@DineshwarSingh how is the Device Plugin deployed?

What are the contents of your Containerd config.toml file?

@DineshwarSingh
Copy link
Author

@elezar device plugin is deployed using helm version v0.14.0.
/etc/containerd/config.toml content is as below:
version = 2
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"

@elezar
Copy link
Member

elezar commented May 24, 2023

Thanks for the information. I assume that containerd has been restared since the config has been changed?

Could you also provide the contents of your /etc/nvidia-container-runtime/config.toml file?

@DineshwarSingh
Copy link
Author

DineshwarSingh commented May 24, 2023

Hi @elezar,
please find below contents of your /etc/nvidia-container-runtime/config.toml file:

disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false

[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
#no-cgroups = false
#user = "root:video"
ldconfig = "@/sbin/ldconfig"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"

#Specify the runtimes to consider. This list is processed in order and the PATH
#searched for matching executables unless the entry is an absolute path.
runtimes = [
"docker-runc",
"runc",
]

mode = "auto"

[nvidia-container-runtime.modes.csv]

mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

=========================================================================
Thanks,
Dinesh

@DineshwarSingh
Copy link
Author

Hi @elezar,
Do you have any suggestion to fix this issue? Any possible help is highly appreciated !

Thanks,
Dinesh

@zachfi
Copy link

zachfi commented Jun 2, 2023

I am also seeing this in my environment. My config and output look the same as above. Happy to provide more details.

@zachfi
Copy link

zachfi commented Jun 2, 2023

The following packages are installed.

[root@k4 ~]# pacman -Q | grep nvidia
libnvidia-container 1.13.1-1
libnvidia-container-tools 1.13.1-1
nvidia 530.41.03-15
nvidia-container-runtime 3.13.1-1
nvidia-container-toolkit 1.13.1-1
nvidia-utils 530.41.03-1
opencl-nvidia 530.41.03-1

Contianerd has been verified with the following.

[root@k4 ~]# ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:12.1.1-runtime-ubuntu20.04 n nvidia-smi
Fri Jun  2 17:01:46 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03              Driver Version: 530.41.03    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 970          Off| 00000000:01:00.0 Off |                  N/A |
| 11%   43C    P0               35W / 170W|      0MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

The plugin was installed with the following manifest.

# https://github.com/NVIDIA/k8s-device-plugin#enabling-gpu-support-in-kubernetes
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: nvidia-device-plugin
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        app.kubernetes.io/name: nvidia-device-plugin
      annotations: {}
    spec:
      priorityClassName: system-node-critical
      securityContext: {}
      containers:
        - image: nvcr.io/nvidia/k8s-device-plugin:v0.14.0
          imagePullPolicy: IfNotPresent
          name: nvidia-device-plugin-ctr
          env:
            - name: NVIDIA_MIG_MONITOR_DEVICES
              value: all
          securityContext:
            capabilities:
              add:
                - SYS_ADMIN
          volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins
      tolerations:
        - key: CriticalAddonsOnly
          operator: Exists
        - effect: NoSchedule
          key: nvidia.com/gpu
          operator: Exists
---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia
---
apiVersion: v1
kind: Pod
metadata:
  name: nbody-gpu-benchmark
  namespace: default
spec:
  restartPolicy: OnFailure
  runtimeClassName: nvidia
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:nbody
      args: ['nbody', '-gpu', '-benchmark']
      resources:
        limits:
          nvidia.com/gpu: 1
      env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: all
        - name: NVIDIA_DRIVER_CAPABILITIES
          value: all

Using the following information to get started. https://docs.k3s.io/advanced#nvidia-container-runtime-support

Let me know what other details are helpful.

@elezar
Copy link
Member

elezar commented Jun 5, 2023

Note that using the CTR comman: ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:12.1.1-runtime-ubuntu20.04 n nvidia-smi is not sufficient in this case since this uses a different mechanism for injecting the devices (by adding the nvidia-container-runtime-hook directly) from the Device plugin.

Note that in an earlier comment you mention /etc/containerd/config.toml referencing the NVIDIA Container runtime, but the k3s documentation mentions:

Confirm that the nvidia container runtime has been found by k3s: grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml

Could you confirm that the nvidia runtime is defined there too? If it is (but is not the default runtime), specifying the nvidia runtime class name for the device plugin pod too may address this.

@zachfi
Copy link

zachfi commented Jun 8, 2023

Good eyes. It looks like it also is detected here.

# grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

Here is the full config, but how do I know if it is the non-default?

version = 2
[plugins."io.containerd.internal.v1.opt"]
  path = "/var/lib/rancher/k3s/agent/containerd"
[plugins."io.containerd.grpc.v1.cri"]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  enable_unprivileged_ports = true
  enable_unprivileged_icmp = true
  sandbox_image = "rancher/mirrored-pause:3.6"
[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
  endpoint = ["https://registry.default.svc.cluster.local:5000"]
[plugins."io.containerd.grpc.v1.cri".registry.configs."registry.default.svc.cluster.local:5000".tls]
  ca_file = "/etc/ssl/ca.pem"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

@zachfi
Copy link

zachfi commented Jun 8, 2023

Also note that the GPU benchmark pod also fails to schedule with the runtimeClassName set due to "Insufficient nvidia.com/gpu".

@elezar
Copy link
Member

elezar commented Jun 8, 2023

@zachfi since there is no plugins."io.containerd.grpc.v1.cri".containerd.default_runtime_name = "nvidia" entry, the nvidia runtime is not the default runtime. As such you would also need to launch the Device Plugin specifying runtimeClassname: nvidia. This ensures that the containers of the device plugin are started using the nvidia-container-runtime, injecting the required devices. (Please also confirm that NVIDIA_VISIBLE_DEVICES = all in this container too).

The issue that the benchmark container is seeing is because tha device plugin is not reporting any nvidia.com/gpu resources to the Kubelet and as such the pod cannot be scheduled.

@zachfi
Copy link

zachfi commented Jun 8, 2023

NVIDIA_VISIBLE_DEVICES was set in the benchmark pod, but not in the daemonset pods. I've added this and it looks to fail in the same way.

❯ k -n kube-system logs nvidia-device-plugin-qxs5n
I0608 13:59:52.537232       1 main.go:154] Starting FS watcher.
I0608 13:59:52.537319       1 main.go:161] Starting OS watcher.
I0608 13:59:52.537643       1 main.go:176] Starting Plugins.
I0608 13:59:52.537659       1 main.go:234] Loading configuration.
I0608 13:59:52.537778       1 main.go:242] Updating config with default resource matching patterns.
I0608 13:59:52.537990       1 main.go:253] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0608 13:59:52.538003       1 main.go:256] Retreiving plugins.
W0608 13:59:52.538273       1 factory.go:31] No valid resources detected, creating a null CDI handler
I0608 13:59:52.538319       1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0608 13:59:52.538344       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0608 13:59:52.538352       1 factory.go:115] Incompatible platform detected
E0608 13:59:52.538361       1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0608 13:59:52.538368       1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0608 13:59:52.538373       1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0608 13:59:52.538379       1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E0608 13:59:52.538500       1 main.go:123] error starting plugins: error creating plugin manager: unable to create plugin manager: platform detection failed

@elezar
Copy link
Member

elezar commented Jun 8, 2023

Are you ALSO specifying the nvidia runtime class for the device plugin containers?

@zachfi
Copy link

zachfi commented Jun 9, 2023

Amazing! This was the missing piece. Once I added this, the plugin deployed and registered the GPU and then the benchmark was able to run. Thank you for the assist @elezar. The daemonset example I had pulled didn't have this setting, so perhaps it is also missing from helm, and adding it would resolve the issue for @DineshwarSingh as well.

@SergeSpinoza
Copy link

@zachfi Hi! What parameter and where did you end up adding it? I have exactly the same problem (I install via Helm)

@elezar
Copy link
Member

elezar commented Jun 9, 2023

@SergeSpinoza the device plugin needs to specify runtimeClassName: nvidia when being deployed in cases where the nvidia runtime is not the default.

The helm deamonset template does define:

      {{- if .Values.runtimeClassName }}
      runtimeClassName: {{ .Values.runtimeClassName }}
      {{- end }}

So deploying with --set runtimeClassName=nvidia should have the desired effect.

@SergeSpinoza
Copy link

@elezar Thanks. After I add runtimeClassName: nvidia I have other error: Error creating: pods "nvdp-nvidia-device-plugin-" is forbidden: pod rejected: RuntimeClass "nvidia" not found

I have no runtimeclasses

# kubectl get runtimeclasses.node.k8s.io -A 
No resources found

I didn't really understand what I did wrong. I followed this instruction: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html

  1. I installed on node with GPU - CUDA driver
  2. I installed nvidia-container-toolkit
  3. I added to /etc/containerd/config.toml
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          runtime_type = "io.containerd.runc.v1"
          runtime_engine = ""
          runtime_root = ""
          privileged_without_host_devices = false
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            systemdCgroup = true
            binaryName = "/usr/bin/nvidia-container-runtime"

and restart containerd

  1. When executing the command:
ctr run --rm -t \
    --runc-binary=/usr/bin/nvidia-container-runtime \
    --env NVIDIA_VISIBLE_DEVICES=all \
    docker.io/nvidia/cuda:11.6.2-base-ubuntu20.04 \
    cuda-11.6.2-base-ubuntu20.04 nvidia-smi

i get:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A2           Off  | 00000000:AF:00.0 Off |                    0 |
|  0%   38C    P0    21W /  60W |   2308MiB / 15356MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

What did I miss?

@zachfi
Copy link

zachfi commented Jun 10, 2023

One thing I noticed today is that if you run nvidia-smi from the pod it won't print the processes, but if you run it from the host it will. Not sure the reason.

What I notice in that output is that it does NOT say "No processes found", and I think you have a process running on the GPU currently.

@zachfi
Copy link

zachfi commented Jun 10, 2023

Also, one of the versions of the darmonset I tried was using heml template so presumably the runtimeClass was not set by default, but I don't use helm. I might think it would make a good default value if it's required to operate properly, but perhaps there are reasons why not enable by default. The manifest I posted above has the runtimeClass defined.

@simsicon
Copy link

@SergeSpinoza I have the same experience, I think it would be better if there's some quick checklist so we can check what are we missing. @elezar

@simsicon
Copy link

My environment and setup:

  1. k3s master node without GPU, one agent node with two nvidia 4090, running containerd.
  2. $ nvidia-smi on agent node:
Mon Jun 12 21:52:43 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090         Off| 00000000:4B:00.0 Off |                  Off |
| 32%   32C    P0               51W / 450W|      0MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090         Off| 00000000:B1:00.0 Off |                  Off |
| 30%   31C    P0               46W / 450W|      0MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
  1. containerd config(gpu related part): /var/lib/rancher/k3s/agent/etc/containerd/config.toml
version = 2
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

[plugins.cri.containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins.cri.containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

4.how k8s-device-plugin deployed: basic features manifest, followed this link: https://github.com/NVIDIA/k8s-device-plugin#enabling-gpu-support-in-kubernetes

  1. problem:
I0612 14:04:07.967143       1 main.go:256] Retreiving plugins.                                                                                                             │
│ W0612 14:04:07.968294       1 factory.go:31] No valid resources detected, creating a null CDI handler                                                                      │
│ I0612 14:04:07.968400       1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or  │
│ I0612 14:04:07.968457       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found                                                         │
│ E0612 14:04:07.968505       1 factory.go:115] Incompatible platform detected                                                                                               │
│ E0612 14:04:07.968520       1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?                                                       │
│ E0612 14:04:07.968530       1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites                                │
│ E0612 14:04:07.968540       1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start                             │
│ E0612 14:04:07.968550       1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes            │
│ I0612 14:04:07.968567       1 main.go:287] No devices found. Waiting indefinitely.
  1. ctr can boot nvidia containers.

I have to admit that I am not fully understand how k8s-device-plugin works, and feel very thankful for your hard works, I really need to figure what's the root cause here in my case, any ideas are appreciated.

@SergeSpinoza
Copy link

Manual RuntimeClass creation in kubernetes cluster helped me.

Manifest:

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia 
handler: nvidia 

Docs: https://kubernetes.io/docs/concepts/containers/runtime-class/

@simsicon
Copy link

Manual RuntimeClass creation in kubernetes cluster helped me.

Manifest:

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia 
handler: nvidia 

Docs: https://kubernetes.io/docs/concepts/containers/runtime-class/

I have this RuntimeClass, but no effect to me, could you please share your k8s-device-plugin daemonset manifest?
I found the k8s-device-plugin daemonset manifest shared by @zachfi could work for me, I don't know why

Thanks

@SergeSpinoza
Copy link

@simsicon

My daemonset:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvdp-nvidia-device-plugin
  namespace: nvidia-device-plugin
  labels:
    app.kubernetes.io/instance: nvdp
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: nvidia-device-plugin
    app.kubernetes.io/version: 0.14.0
    helm.sh/chart: nvidia-device-plugin-0.14.0
  annotations:
    deprecated.daemonset.template.generation: '1'
    meta.helm.sh/release-name: nvdp
    meta.helm.sh/release-namespace: nvidia-device-plugin
spec:
  selector:
    matchLabels:
      app.kubernetes.io/instance: nvdp
      app.kubernetes.io/name: nvidia-device-plugin
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: nvdp
        app.kubernetes.io/name: nvidia-device-plugin
    spec:
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins
            type: ''
      containers:
        - name: nvidia-device-plugin-ctr
          image: nvcr.io/nvidia/k8s-device-plugin:v0.14.0
          env:
            - name: NVIDIA_MIG_MONITOR_DEVICES
              value: all
          volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
          securityContext:
            capabilities:
              add:
                - SYS_ADMIN
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: node.kubernetes.io/gpu-node
                    operator: In
                    values:
                      - 'true'
      schedulerName: default-scheduler
      tolerations:
        - key: CriticalAddonsOnly
          operator: Exists
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
        - key: node.kubernetes.io/gpu-node
          operator: Equal
          value: 'true'
          effect: NoSchedule
      priorityClassName: system-node-critical
      runtimeClassName: nvidia
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 0

@matusnovak
Copy link

matusnovak commented Sep 23, 2023

Any update on this issue?

I am having the same problem.

I have followed the documentation here: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html to update containerd config. I have run sudo nvidia-ctk runtime configure --runtime=containerd. Here are the contents:

$ cat /etc/containerd/config.toml
version = 2
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

I can confirm that when k3s starts it recognizes the nvidia runtime:

$ grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

I have deployed the daemonset nvidia plugin via Helm based on these instructions: https://github.com/NVIDIA/k8s-device-plugin/#deployment-via-helm It has created a daemonset, describe output below:

$ kubectl --namespace nvidia-device-plugin describe daemonset
Name:           nvdp-nvidia-device-plugin
Selector:       app.kubernetes.io/instance=nvdp,app.kubernetes.io/name=nvidia-device-plugin
Node-Selector:  <none>
Labels:         app.kubernetes.io/instance=nvdp
                app.kubernetes.io/managed-by=Helm
                app.kubernetes.io/name=nvidia-device-plugin
                app.kubernetes.io/version=0.14.1
                helm.sh/chart=nvidia-device-plugin-0.14.1
Annotations:    deprecated.daemonset.template.generation: 1
                meta.helm.sh/release-name: nvdp
                meta.helm.sh/release-namespace: nvidia-device-plugin
Desired Number of Nodes Scheduled: 1
Current Number of Nodes Scheduled: 1
Number of Nodes Scheduled with Up-to-date Pods: 1
Number of Nodes Scheduled with Available Pods: 0
Number of Nodes Misscheduled: 0
Pods Status:  1 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:  app.kubernetes.io/instance=nvdp
           app.kubernetes.io/name=nvidia-device-plugin
  Containers:
   nvidia-device-plugin-ctr:
    Image:      nvcr.io/nvidia/k8s-device-plugin:v0.14.1
    Port:       <none>
    Host Port:  <none>
    Environment:
      NVIDIA_MIG_MONITOR_DEVICES:  all
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
  Volumes:
   device-plugin:
    Type:               HostPath (bare host directory volume)
    Path:               /var/lib/kubelet/device-plugins
    HostPathType:       
  Priority Class Name:  system-node-critical
Events:
  Type    Reason            Age   From                  Message
  ----    ------            ----  ----                  -------
  Normal  SuccessfulCreate  59m   daemonset-controller  Created pod: nvdp-nvidia-device-plugin-v9qsp

But the pod is crashing:

$ kubectl --namespace nvidia-device-plugin logs nvdp-nvidia-device-plugin-v9qsp
I0923 14:40:25.663427       1 main.go:154] Starting FS watcher.
I0923 14:40:25.663467       1 main.go:161] Starting OS watcher.
I0923 14:40:25.663590       1 main.go:176] Starting Plugins.
I0923 14:40:25.663599       1 main.go:234] Loading configuration.
I0923 14:40:25.663661       1 main.go:242] Updating config with default resource matching patterns.
I0923 14:40:25.663779       1 main.go:253] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0923 14:40:25.663788       1 main.go:256] Retreiving plugins.
W0923 14:40:25.663933       1 factory.go:31] No valid resources detected, creating a null CDI handler
I0923 14:40:25.663963       1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0923 14:40:25.663979       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0923 14:40:25.663984       1 factory.go:115] Incompatible platform detected
E0923 14:40:25.663990       1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0923 14:40:25.663995       1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0923 14:40:25.664000       1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0923 14:40:25.664005       1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E0923 14:40:25.682592       1 main.go:123] error starting plugins: error creating plugin manager: unable to create plugin manager: platform detection failed

Running nvidia-smi on the host shows:

$ nvidia-smi
Sat Sep 23 14:42:59 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA T400 4GB                Off | 00000000:01:00.0 Off |                  N/A |
| 38%   36C    P8              N/A /  31W |      1MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

I am able to run nvidia-smi from a container via ctr, example below:

$ sudo ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:12.2.0-base-ubuntu22.04 cuda-12-base nvidia-smi
Sat Sep 23 14:44:51 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA T400 4GB                Off | 00000000:01:00.0 Off |                  N/A |
| 38%   37C    P8              N/A /  31W |      1MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

What am I doing wrong?

Edit:
I have also tried with drivers versions 515, 470, and having the same issue.

@madhureddy143
Copy link

My environment and setup:

  1. k3s master node without GPU, one agent node with two nvidia 4090, running containerd.
  2. $ nvidia-smi on agent node:
Mon Jun 12 21:52:43 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090         Off| 00000000:4B:00.0 Off |                  Off |
| 32%   32C    P0               51W / 450W|      0MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090         Off| 00000000:B1:00.0 Off |                  Off |
| 30%   31C    P0               46W / 450W|      0MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
  1. containerd config(gpu related part): /var/lib/rancher/k3s/agent/etc/containerd/config.toml
version = 2
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

[plugins.cri.containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins.cri.containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

4.how k8s-device-plugin deployed: basic features manifest, followed this link: https://github.com/NVIDIA/k8s-device-plugin#enabling-gpu-support-in-kubernetes

  1. problem:
I0612 14:04:07.967143       1 main.go:256] Retreiving plugins.                                                                                                             │
│ W0612 14:04:07.968294       1 factory.go:31] No valid resources detected, creating a null CDI handler                                                                      │
│ I0612 14:04:07.968400       1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or  │
│ I0612 14:04:07.968457       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found                                                         │
│ E0612 14:04:07.968505       1 factory.go:115] Incompatible platform detected                                                                                               │
│ E0612 14:04:07.968520       1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?                                                       │
│ E0612 14:04:07.968530       1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites                                │
│ E0612 14:04:07.968540       1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start                             │
│ E0612 14:04:07.968550       1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes            │
│ I0612 14:04:07.968567       1 main.go:287] No devices found. Waiting indefinitely.
  1. ctr can boot nvidia containers.

I have to admit that I am not fully understand how k8s-device-plugin works, and feel very thankful for your hard works, I really need to figure what's the root cause here in my case, any ideas are appreciated.

@simsicon
most of your setup is correct

few suggestions,

  1. If the config.toml directly modified, after restarting k3s service config.toml will be over written, to avoid this use config.toml.tmpl file.
    sample URL :- https://github.com/skirsten/k3s/blob/f78a66b44e2ecbef64122be99a9aa9118a49d7e9/pkg/agent/templates/templates_linux.go#L10 and add default_runtime_name = "nvidia"
  2. As device plugin is a daemonset, the logs u might be seeing is the one form the master node where GPU is not present, check the one from agent.
  3. check configuration of containerd at the agent node was it pointing properly to the nvidia-container-runtime.

@matusnovak
Copy link

matusnovak commented Oct 4, 2023

I believe I have figured it out. At least in my case.

If the config.toml directly modified, after restarting k3s service config.toml will be over written, to avoid this use config.toml.tmpl file.
sample URL :- https://github.com/skirsten/k3s/blob/f78a66b44e2ecbef64122be99a9aa9118a49d7e9/pkg/agent/templates/templates_linux.go#L10 and add default_runtime_name = "nvidia"

Most of the tutorials out there were suggesting a k3d template instead of the k3s template. I thought that was wrong and I assumed that the k3s service should "detect" the nvidia container runtime. It does but it does not make it the default one.

This template seems to work: https://github.com/skirsten/k3s/blob/f78a66b44e2ecbef64122be99a9aa9118a49d7e9/pkg/agent/templates/templates_linux.go#L10

But a simpler solution, in case you don't want to force every pod to have nvidia runtime, is to add runtimeClassName: nvidia to the https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml and after that everything starts to work just fine.

@xinmans
Copy link

xinmans commented Oct 15, 2023

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
# add runtimeClassName: nvidia to fix "No devices found. Waiting indefinitely." #406
runtimeClassName: nvidia

@madeeldevops
Copy link

madeeldevops commented Nov 13, 2023

I have used k8 device plugin damonset v0.14.0, and its working fine with the below version.
$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml
Also use runc.v2 and don't forget to verify that you have nvidia-container-runtime in your system. Usually at /usr/bin
These both should be mentioned in your /var/lib/rancher/k3s/agent/etc/containerd/config.toml file. At the end restart your k3s with systemctl and check the log of your daemonset pod. Its should be registered now.
Output will be like

k logs nvidia-device-plugin-daemonset-c8snh -n kube-system
I1113 09:15:44.110759       1 main.go:154] Starting FS watcher.
I1113 09:15:44.110825       1 main.go:161] Starting OS watcher.
I1113 09:15:44.111097       1 main.go:176] Starting Plugins.
I1113 09:15:44.111107       1 main.go:234] Loading configuration.
I1113 09:15:44.111224       1 main.go:242] Updating config with default resource matching patterns.
I1113 09:15:44.111357       1 main.go:253] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
   "sharing": {
    "timeSlicing": {}
  }
}
I1113 09:15:44.111365       1 main.go:256] Retreiving plugins.
I1113 09:15:44.112044       1 factory.go:107] Detected NVML platform: found NVML library
I1113 09:15:44.112072       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I1113 09:15:44.127433       1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I1113 09:15:44.127684       1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I1113 09:15:44.128750       1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet

My host have ubuntu 22.4

NAME         STATUS   ROLES                  AGE    VERSION        INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
devops-s2h   Ready    control-plane,master   138m   v1.27.7+k3s2   172.16.11.243   <none>        Ubuntu 22.04.3 LTS   6.2.0-36-generic   containerd://1.7.7-k3s1.27

@duhow
Copy link

duhow commented Nov 29, 2023

Fix for me in k3s:

  1. Create the RuntimeClass
cat <<EOF | kubectl apply -f -
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia 
handler: nvidia 
EOF
  1. Patch nvidia-device-plugin to use this Runtime (modify according to your installation)
kubectl patch daemonset -n kube-system nvidia-device-plugin --type='json' \
  -p='[{"op": "add", "path": "/spec/template/spec/runtimeClassName", "value": "nvidia"}]'

@vsndev3
Copy link

vsndev3 commented Dec 12, 2023

For k3s i fixed like this (snippet for impatient). More details: https://docs.k3s.io/advanced#configuring-containerd

sudo cp /var/lib/rancher/k3s/agent/etc/containerd/config.toml /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl

Edit /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl to add 'default_runtime_name = "nvidia"' as below:

[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"

sudo systemctl restart k3s

Check:
sudo kubectl logs gpu-feature-discovery-5kp4n | grep NVML
I1212 17:34:46.603358       1 factory.go:48] Detected NVML platform: found NVML library
I1212 17:34:46.603383       1 factory.go:64] Using NVML manager

sudo kubectl describe nodes | grep nvidia.com/gpu.count
                    nvidia.com/gpu.count=1

@klueska klueska added the triage label Jan 26, 2024
@ArangoGutierrez ArangoGutierrez added needs-triage issue or PR has not been assigned a priority-px label and removed triage labels Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-triage issue or PR has not been assigned a priority-px label
Projects
None yet
Development

No branches or pull requests