You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
nvidia-device-plugin sits using 100% CPU when a new Pod with a GPU requirement is scheduled. The Pod is stuck as 'pending', with no further failure or error - either from the container or from Kubernetes itself.
Commands on the host such asnvidia-smi work prior to scheduling a Pod with a GPU requirement. Once this behaviour is triggered, I'm no longer able to run such commands until the host is rebooted.
2. Steps to reproduce the issue
Kubernetes cluster is K3s, version v1.22.9+k3s1.
Cluster has seven nodes - three server, three worker, and a fourth worker with a pair of GPUs - A100s. All nodes are running Ubuntu 20.04 with kernel 5.4.0-109-generic. They're virtual machines, with the GPU VM being provided with the GPUs via PCI pass-through (see output below from nvidia-smi).
The GPU node has nvidia-container-toolkit version 1.9.0-1 installed along with nvidia-driver-470-server version 470.103.01-0ubuntu0.20.04.1.
Once the cluster is up, NFD is deployed with kubectl apply -k "https://github.com/kubernetes-sigs/node-feature-discovery/deployment/overlays/default?ref=v0.11.0"
With NFD, the device plugin is installed by templating the Helm install and adding a nodeSelector with the PCI device corresponding with node that has the GPUs:
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
1. Issue or feature description
nvidia-device-plugin
sits using 100% CPU when a new Pod with a GPU requirement is scheduled. The Pod is stuck as 'pending', with no further failure or error - either from the container or from Kubernetes itself.Commands on the host such as
nvidia-smi
work prior to scheduling a Pod with a GPU requirement. Once this behaviour is triggered, I'm no longer able to run such commands until the host is rebooted.2. Steps to reproduce the issue
Kubernetes cluster is K3s, version
v1.22.9+k3s1
.Cluster has seven nodes - three server, three worker, and a fourth worker with a pair of GPUs - A100s. All nodes are running Ubuntu 20.04 with kernel
5.4.0-109-generic
. They're virtual machines, with the GPU VM being provided with the GPUs via PCI pass-through (see output below fromnvidia-smi
).The GPU node has nvidia-container-toolkit version
1.9.0-1
installed along with nvidia-driver-470-server version470.103.01-0ubuntu0.20.04.1
.Once the cluster is up, NFD is deployed with
kubectl apply -k "https://github.com/kubernetes-sigs/node-feature-discovery/deployment/overlays/default?ref=v0.11.0"
With NFD, the device plugin is installed by templating the Helm install and adding a
nodeSelector
with the PCI device corresponding with node that has the GPUs:Once the plugin has deployed, the node is successfully updated to reflect the available GPUs:
Attempting to deploy a test Pod that targets this node then triggers the problem:
There are no additional logs from the
nvidia-device-plugin
container.3. Information to attach (optional if deemed irrelevant)
Common error checking:
nvidia-smi -a
on your host:/etc/docker/daemon.json
)The equivalent here is the containerd configuration:
sudo journalctl -r -u kubelet
)The text was updated successfully, but these errors were encountered: