Skip to content

Commit

Permalink
[k8s][docs] Clarify nvidia runtime is required for k8s (#2957)
Browse files Browse the repository at this point in the history
* RKE2 instructions

* GPU test pod
  • Loading branch information
romilbhardwaj authored Jan 9, 2024
1 parent a31fdd9 commit 1e53317
Show file tree
Hide file tree
Showing 2 changed files with 29 additions and 1 deletion.
15 changes: 14 additions & 1 deletion docs/source/reference/kubernetes/kubernetes-setup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -169,9 +169,22 @@ Setting up GPU support
~~~~~~~~~~~~~~~~~~~~~~
If your Kubernetes cluster has Nvidia GPUs, ensure that:

1. The Nvidia device plugin is installed (i.e., ``nvidia.com/gpu`` resource is available on each node).
1. The Nvidia GPU operator is installed (i.e., ``nvidia.com/gpu`` resource is available on each node) and ``nvidia`` is set as the default runtime for your container engine. See `Nvidia's installation guide <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#install-nvidia-gpu-operator>`_ for more details.
2. Each node in your cluster is labelled with the GPU type. This labelling can be done by adding a label of the format ``skypilot.co/accelerators: <gpu_name>``, where the ``<gpu_name>`` is the lowercase name of the GPU. For example, a node with V100 GPUs must have a label :code:`skypilot.co/accelerators: v100`.

.. tip::
You can check if GPU operator is installed and the ``nvidia`` runtime is set as default by running:

.. code-block:: console
$ kubectl apply -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/gpu_test_pod.yaml
$ watch kubectl get pods
# If the pod status changes to completed after a few minutes, your Kubernetes environment is set up correctly.
.. note::
If you are using RKE2, the GPU operator installation through helm requires extra flags to set ``nvidia`` as the default runtime for containerd. Refer to instructions on `Nvidia GPU Operator installation with Helm on RKE2 <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#custom-configuration-for-runtime-containerd>`_ for details.

We provide a convenience script that automatically detects GPU types and labels each node. You can run it with:

.. code-block:: console
Expand Down
15 changes: 15 additions & 0 deletions tests/kubernetes/gpu_test_pod.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Runs nvidia-smi in a pod to test GPU operator and nvidia runtime are setup correctly
# Run with kubectl apply -f gpu_pod_test.yaml
apiVersion: v1
kind: Pod
metadata:
name: nvidia-smi
spec:
restartPolicy: Never
containers:
- name: nvidia-smi
image: nvidia/cuda:12.3.1-devel-ubuntu20.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: "1"

0 comments on commit 1e53317

Please sign in to comment.