-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ImagePullBackOff caused by redundant information from the operator #647
Comments
This happens bot h with and without the ❯ k describe po nvidia-driver-daemonset-pmmz5
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 22s default-scheduler Successfully assigned nvidia-gpu-operator/nvidia-driver-daemonset-pmmz5 to rhode
Normal Pulled 21s kubelet Container image "nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.5" already present on machine
Normal Created 21s kubelet Created container k8s-driver-manager
Normal Started 21s kubelet Started container k8s-driver-manager
Normal Pulling 8s (x2 over 19s) kubelet Pulling image "nvcr.io/nvidia/driver:535.129.03-talosv1.6.1"
Warning Failed 6s (x2 over 18s) kubelet Failed to pull image "nvcr.io/nvidia/driver:535.129.03-talosv1.6.1": rpc error: code = NotFound desc = failed to pull and unpack image "nvcr.io/nvidia/driver:535.129.03-talosv1.6.1": failed to resolve reference "nvcr.io/nvidia/driver:535.129.03-talosv1.6.1": nvcr.io/nvidia/driver:535.129.03-talosv1.6.1: not found
Warning Failed 6s (x2 over 18s) kubelet Error: ErrImagePull |
I tried to set the https://catalog.ngc.nvidia.com/orgs/nvidia/containers/driver/tags |
Hi @uhthomas, Talos is not a supported distribution. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/platform-support.html#supported-operating-systems-and-kubernetes-platforms |
I am not familiar with Talos at all, but if you wanted to force the operator to pull the driver image for one of our supported distros, like Note, it is likely that the ubuntu22.04 image will fail to install the driver successfully on a different distribution. One way to proceed is to install the NVIDIA drivers following Talos's official guide: https://www.talos.dev/v1.6/talos-guides/configuration/nvidia-gpu/ and then install the GPU Operator with |
I've been working on making GPU Operator and related components work out of the box on Talos. There is some work left to do. See #1007, NVIDIA/nvidia-container-toolkit#700. For Talos, once some of the issues in NVIDIA components have been resolved, siderolabs/extensions#476 will provide a host driver installation compatible with the GPU Operator. I've also talked with SideroLabs on supporting driver containers, but that would require some changes in Talos. For security, they remove Additionally, if you use secure boot, then no pre-built driver container will work because SideroLabs throws away the kernel module signing key after they build the kernel and kernel module packages. I've also talked with them about that and there just isn't a clear solution that works for every customer and use case. As with many things in the Linux world, if you're serious about security, your best option is likely to some of Talos from source (kernel and kernel modules) and manage secure boot and kernel keys with your own key infrastructure. In such a scenario, either a host driver extension as linked above or driver containers will work if you retain the necessary private keys to sign the kernel modules. As for the redundant information, all the operator is doing is concatenating node labels:
On Talos, it just happens that both the kernel version and the OS release ID have "talos" in it. |
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
2. Issue or feature description
The operator tries to pull invalid images as it includes redundant information like the kernel and os?
3. Steps to reproduce the issue
Deploy the GPU operator with the default configuration on a Talos Kubernetes cluster.
4. Information to attach (optional if deemed irrelevant)
kubectl get pods -n OPERATOR_NAMESPACE
kubectl get ds -n OPERATOR_NAMESPACE
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
nvidia-smi
from the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
journalctl -u containerd > containerd.log
Collecting full debug bundle (optional):
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: [email protected]
The text was updated successfully, but these errors were encountered: