[k8s][docs] Clarify nvidia runtime is required for k8s (#2957)

* RKE2 instructions * GPU test pod
skypilot-org · Jan 9, 2024 · 1e53317 · 1e53317
1 parent a31fdd9
commit 1e53317
Show file tree

Hide file tree

Showing 2 changed files with 29 additions and 1 deletion.
diff --git a/docs/source/reference/kubernetes/kubernetes-setup.rst b/docs/source/reference/kubernetes/kubernetes-setup.rst
@@ -169,9 +169,22 @@ Setting up GPU support
 ~~~~~~~~~~~~~~~~~~~~~~
 If your Kubernetes cluster has Nvidia GPUs, ensure that:
 
-1. The Nvidia device plugin is installed (i.e., ``nvidia.com/gpu`` resource is available on each node).
+1. The Nvidia GPU operator is installed (i.e., ``nvidia.com/gpu`` resource is available on each node) and ``nvidia`` is set as the default runtime for your container engine. See `Nvidia's installation guide <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#install-nvidia-gpu-operator>`_ for more details.
 2. Each node in your cluster is labelled with the GPU type. This labelling can be done by adding a label of the format ``skypilot.co/accelerators: <gpu_name>``, where the ``<gpu_name>`` is the lowercase name of the GPU. For example, a node with V100 GPUs must have a label :code:`skypilot.co/accelerators: v100`.
 
+.. tip::
+    You can check if GPU operator is installed and the ``nvidia`` runtime is set as default by running:
+
+    .. code-block:: console
+
+        $ kubectl apply -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/gpu_test_pod.yaml
+        $ watch kubectl get pods
+        # If the pod status changes to completed after a few minutes, your Kubernetes environment is set up correctly.
+
+
+.. note::
+    If you are using RKE2, the GPU operator installation through helm requires extra flags to set ``nvidia`` as the default runtime for containerd. Refer to instructions on `Nvidia GPU Operator installation with Helm on RKE2 <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#custom-configuration-for-runtime-containerd>`_ for details.
+
 We provide a convenience script that automatically detects GPU types and labels each node. You can run it with:
 
 .. code-block:: console

diff --git a/tests/kubernetes/gpu_test_pod.yaml b/tests/kubernetes/gpu_test_pod.yaml
@@ -0,0 +1,15 @@
+# Runs nvidia-smi in a pod to test GPU operator and nvidia runtime are setup correctly
+# Run with kubectl apply -f gpu_pod_test.yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name: nvidia-smi
+spec:
+  restartPolicy: Never
+  containers:
+  - name: nvidia-smi
+    image: nvidia/cuda:12.3.1-devel-ubuntu20.04
+    command: ["nvidia-smi"]
+    resources:
+      limits:
+         nvidia.com/gpu: "1"