Skip to content

Commit

Permalink
[Docs][k8s] Revamp k8s cluster admin docs (#3649)
Browse files Browse the repository at this point in the history
* Revamp cluster admin docs

* fixes

* comments

* revert changes

* comments

* minor

* update default auth mechanism

* update ports
  • Loading branch information
romilbhardwaj authored Jun 14, 2024
1 parent 5b2cedb commit a9f5db7
Show file tree
Hide file tree
Showing 9 changed files with 581 additions and 348 deletions.
40 changes: 20 additions & 20 deletions docs/source/cloud-setup/cloud-permissions/kubernetes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,31 +9,15 @@ for authentication and creating resources on your Kubernetes cluster.
When running inside your Kubernetes cluster (e.g., as a Spot controller or Serve controller),
SkyPilot can operate using either of the following three authentication methods:

1. **Using your local kubeconfig file**: In this case, SkyPilot will
copy your local ``~/.kube/config`` file to the controller pod and use it for
authentication. This is the default method when running inside the cluster,
and no additional configuration is required.

.. note::

If your cluster uses exec based authentication in your ``~/.kube/config`` file
(e.g., GKE uses exec auth by default), SkyPilot may not be able to authenticate using this method. In this case,
consider using the service account methods below.

2. **Creating a service account**: SkyPilot can automatically create the service
1. **Automatically create a service account**: SkyPilot can automatically create the service
account and roles for itself to manage resources in the Kubernetes cluster.
To use this method, set ``remote_identity: SERVICE_ACCOUNT`` to your
Kubernetes configuration in the :ref:`~/.sky/config.yaml <config-yaml>` file:

.. code-block:: yaml
kubernetes:
remote_identity: SERVICE_ACCOUNT
This is the default method when running inside the cluster, and no
additional configuration is required.

For details on the permissions that are granted to the service account,
refer to the `Minimum Permissions Required for SkyPilot`_ section below.

3. **Using a custom service account**: If you have a custom service account
2. **Using a custom service account**: If you have a custom service account
with the `necessary permissions <k8s-permissions_>`__, you can configure
SkyPilot to use it by adding this to your :ref:`~/.sky/config.yaml <config-yaml>` file:

Expand All @@ -42,6 +26,22 @@ SkyPilot can operate using either of the following three authentication methods:
kubernetes:
remote_identity: your-service-account-name
3. **Using your local kubeconfig file**: In this case, SkyPilot will
copy your local ``~/.kube/config`` file to the controller pod and use it for
authentication. To use this method, set ``remote_identity: LOCAL_CREDENTIALS`` to your
Kubernetes configuration in the :ref:`~/.sky/config.yaml <config-yaml>` file:

.. code-block:: yaml
kubernetes:
remote_identity: LOCAL_CREDENTIALS
.. note::

If your cluster uses exec based authentication in your ``~/.kube/config`` file
(e.g., GKE uses exec auth by default), SkyPilot may not be able to authenticate using this method. In this case,
consider using the service account methods below.

.. note::

Service account based authentication applies only when the remote SkyPilot
Expand Down
2 changes: 1 addition & 1 deletion docs/source/reference/kubernetes/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ Table of Contents
.. toctree::
:hidden:

kubernetes-getting-started
Getting Started <kubernetes-getting-started>
kubernetes-setup
kubernetes-troubleshooting

Expand Down
270 changes: 270 additions & 0 deletions docs/source/reference/kubernetes/kubernetes-deployment.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,270 @@
.. _kubernetes-deployment:

Deployment Guides
-----------------
Below we include minimal guides to set up a new Kubernetes cluster in different environments, including hosted services on the cloud.

.. grid:: 2
:gutter: 3

.. grid-item-card:: Local Development Cluster
:link: kubernetes-setup-kind
:link-type: ref
:text-align: center

Run a local Kubernetes cluster on your laptop with ``sky local up``.

.. grid-item-card:: On-prem Clusters (RKE2, K3s, etc.)
:link: kubernetes-setup-onprem
:link-type: ref
:text-align: center

For on-prem deployments with kubeadm, RKE2, K3s or other distributions.

.. grid-item-card:: Google Cloud - GKE
:link: kubernetes-setup-gke
:link-type: ref
:text-align: center

Google's hosted Kubernetes service.

.. grid-item-card:: Amazon - EKS
:link: kubernetes-setup-eks
:link-type: ref
:text-align: center

Amazon's hosted Kubernetes service.

.. _kubernetes-setup-kind:


Deploying locally on your laptop
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To try out SkyPilot on Kubernetes on your laptop or run SkyPilot
tasks locally without requiring any cloud access, we provide the
:code:`sky local up` CLI to create a 1-node Kubernetes cluster locally.

Under the hood, :code:`sky local up` uses `kind <https://kind.sigs.k8s.io/>`_,
a tool for creating a Kubernetes cluster on your local machine.
It runs a Kubernetes cluster inside a container, so no setup is required.

1. Install `Docker <https://docs.docker.com/engine/install/>`_ and `kind <https://kind.sigs.k8s.io/>`_.
2. Run :code:`sky local up` to launch a Kubernetes cluster and automatically configure your kubeconfig file:

.. code-block:: console
$ sky local up
3. Run :code:`sky check` and verify that Kubernetes is enabled in SkyPilot. You can now run SkyPilot tasks on this locally hosted Kubernetes cluster using :code:`sky launch`.
4. After you are done using the cluster, you can remove it with :code:`sky local down`. This will destroy the local kubernetes cluster and switch your kubeconfig back to it's original context:

.. code-block:: console
$ sky local down
.. note::
We recommend allocating at least 4 or more CPUs to your docker runtime to
ensure kind has enough resources. See instructions to increase CPU allocation
`here <https://kind.sigs.k8s.io/docs/user/known-issues/#failure-to-build-node-image>`_.

.. note::
kind does not support multiple nodes and GPUs.
It is not recommended for use in a production environment.
If you want to run a private on-prem cluster, see the section on :ref:`on-prem deployment <kubernetes-setup-onprem>` for more.


.. _kubernetes-setup-gke:

Deploying on Google Cloud GKE
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

1. Create a GKE standard cluster with at least 1 node. We recommend creating nodes with at least 4 vCPUs.

.. raw:: HTML

<details>

<summary>Example: create a GKE cluster with 2 nodes, each having 16 CPUs.</summary>

.. code-block:: bash
PROJECT_ID=$(gcloud config get-value project)
CLUSTER_NAME=testcluster
gcloud beta container --project "${PROJECT_ID}" clusters create "${CLUSTER_NAME}" --zone "us-central1-c" --no-enable-basic-auth --cluster-version "1.29.4-gke.1043002" --release-channel "regular" --machine-type "e2-standard-16" --image-type "COS_CONTAINERD" --disk-type "pd-balanced" --disk-size "100" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "2" --logging=SYSTEM,WORKLOAD --monitoring=SYSTEM --enable-ip-alias --network "projects/${PROJECT_ID}/global/networks/default" --subnetwork "projects/${PROJECT_ID}/regions/us-central1/subnetworks/default" --no-enable-intra-node-visibility --default-max-pods-per-node "110" --security-posture=standard --workload-vulnerability-scanning=disabled --no-enable-master-authorized-networks --addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --enable-managed-prometheus --enable-shielded-nodes --node-locations "us-central1-c"
.. raw:: html

</details>


2. Get the kubeconfig for your cluster. The following command will automatically update ``~/.kube/config`` with new kubecontext for the GKE cluster:

.. code-block:: console
$ gcloud container clusters get-credentials <cluster-name> --region <region>
# Example:
# gcloud container clusters get-credentials testcluster --region us-central1-c
3. [If using GPUs] If your GKE nodes have GPUs, you may need to to
`manually install <https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/>`_
nvidia drivers. You can do so by deploying the daemonset
depending on the GPU and OS on your nodes:

.. code-block:: console
# For Container Optimized OS (COS) based nodes with GPUs other than Nvidia L4 (e.g., V100, A100, ...):
$ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
# For Container Optimized OS (COS) based nodes with L4 GPUs:
$ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml
# For Ubuntu based nodes with GPUs other than Nvidia L4 (e.g., V100, A100, ...):
$ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml
# For Ubuntu based nodes with L4 GPUs:
$ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded-R525.yaml
To verify if GPU drivers are set up, run ``kubectl describe nodes`` and verify that ``nvidia.com/gpu`` is listed under the ``Capacity`` section.

4. Verify your kubernetes cluster is correctly set up for SkyPilot by running :code:`sky check`:

.. code-block:: console
$ sky check
5. [If using GPUs] Check available GPUs in the kubernetes cluster with :code:`sky show-gpus --cloud kubernetes`

.. code-block:: console
$ sky show-gpus --cloud kubernetes
GPU QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS
L4 1, 2, 3, 4 8 6
A100 1, 2 4 2
.. note::
GKE autopilot clusters are currently not supported. Only GKE standard clusters are supported.


.. _kubernetes-setup-eks:

Deploying on Amazon EKS
^^^^^^^^^^^^^^^^^^^^^^^

1. Create a EKS cluster with at least 1 node. We recommend creating nodes with at least 4 vCPUs.

2. Get the kubeconfig for your cluster. The following command will automatically update ``~/.kube/config`` with new kubecontext for the EKS cluster:

.. code-block:: console
$ aws eks update-kubeconfig --name <cluster-name> --region <region>
# Example:
# aws eks update-kubeconfig --name testcluster --region us-west-2
3. [If using GPUs] EKS clusters already come with Nvidia drivers set up. However, you will need to label the nodes with the GPU type. Use the SkyPilot node labelling tool to do so:

.. code-block:: console
python -m sky.utils.kubernetes.gpu_labeler
This will create a job on each node to read the GPU type from `nvidia-smi` and assign a ``skypilot.co/accelerator`` label to the node. You can check the status of these jobs by running:

.. code-block:: console
kubectl get jobs -n kube-system
4. Verify your kubernetes cluster is correctly set up for SkyPilot by running :code:`sky check`:

.. code-block:: console
$ sky check
5. [If using GPUs] Check available GPUs in the kubernetes cluster with :code:`sky show-gpus --cloud kubernetes`

.. code-block:: console
$ sky show-gpus --cloud kubernetes
GPU QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS
A100 1, 2 4 2
.. _kubernetes-setup-onprem:

Deploying on on-prem clusters
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

You can also deploy Kubernetes on your on-prem clusters using off-the-shelf tools,
such as `kubeadm <https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/>`_,
`k3s <https://docs.k3s.io/quick-start>`_ or
`Rancher <https://ranchermanager.docs.rancher.com/v2.5/pages-for-subheaders/kubernetes-clusters-in-rancher-setup>`_.
Please follow their respective guides to deploy your Kubernetes cluster.


.. _kubernetes-setup-onprem-distro-specific:

Notes for specific Kubernetes distributions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Some Kubernetes distributions require additional steps to set up GPU support.

Rancher Kubernetes Engine 2 (RKE2)
**********************************

Nvidia GPU operator installation on RKE2 through helm requires extra flags to set ``nvidia`` as the default runtime for containerd.

.. code-block:: console
$ helm install gpu-operator -n gpu-operator --create-namespace \
nvidia/gpu-operator $HELM_OPTIONS \
--set 'toolkit.env[0].name=CONTAINERD_CONFIG' \
--set 'toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl' \
--set 'toolkit.env[1].name=CONTAINERD_SOCKET' \
--set 'toolkit.env[1].value=/run/k3s/containerd/containerd.sock' \
--set 'toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS' \
--set 'toolkit.env[2].value=nvidia' \
--set 'toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT' \
--set-string 'toolkit.env[3].value=true'
Refer to instructions on `Nvidia GPU Operator installation with Helm on RKE2 <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#rancher-kubernetes-engine-2>`_ for details.

K3s
***

Installing Nvidia GPU operator on K3s is similar to `RKE2 instructions from Nvidia <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#rancher-kubernetes-engine-2>`_, but requires changing
the ``CONTAINERD_CONFIG`` variable to ``/var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl``. Here is an example command to install the Nvidia GPU operator on K3s:

.. code-block:: console
$ helm install gpu-operator -n gpu-operator --create-namespace \
nvidia/gpu-operator $HELM_OPTIONS \
--set 'toolkit.env[0].name=CONTAINERD_CONFIG' \
--set 'toolkit.env[0].value=/var/lib/rancher/k3s/agent/etc/containerd/config.toml' \
--set 'toolkit.env[1].name=CONTAINERD_SOCKET' \
--set 'toolkit.env[1].value=/run/k3s/containerd/containerd.sock' \
--set 'toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS' \
--set 'toolkit.env[2].value=nvidia'
Check the status of the GPU operator installation by running ``kubectl get pods -n gpu-operator``. It takes a few minutes to install and some CrashLoopBackOff errors are expected during the installation process.

.. tip::

If your gpu-operator installation stays stuck in CrashLoopBackOff, you may need to create a symlink to the ``ldconfig`` binary to work around a `known issue <https://github.com/NVIDIA/nvidia-docker/issues/614#issuecomment-423991632>`_ with nvidia-docker runtime. Run the following command on your nodes:

.. code-block:: console
$ ln -s /sbin/ldconfig /sbin/ldconfig.real
After the GPU operator is installed, create the nvidia RuntimeClass required by K3s. This runtime class will automatically be used by SkyPilot to schedule GPU pods:

.. code-block:: console
$ kubectl apply -f - <<EOF
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia
EOF
12 changes: 11 additions & 1 deletion docs/source/reference/kubernetes/kubernetes-getting-started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ To connect and use a Kubernetes cluster, SkyPilot needs:
In a typical workflow:

1. A cluster administrator sets up a Kubernetes cluster. Refer to admin guides for
:ref:`Kubernetes cluster setup <kubernetes-setup>` for different deployment environments (Amazon EKS, Google GKE, On-Prem and local debugging) and :ref:`required permissions <cloud-permissions-kubernetes>`.
:ref:`Kubernetes cluster setup <kubernetes-setup>` for different deployment environments (Amazon EKS, Google GKE, On-Prem and local debugging).

2. Users who want to run SkyPilot tasks on this cluster are issued Kubeconfig
files containing their credentials (`kube-context <https://kubernetes.io/docs/tasks/access-application-cluster/configure-access-multiple-clusters/#define-clusters-users-and-contexts>`_).
Expand Down Expand Up @@ -52,6 +52,16 @@ Once your cluster administrator has :ref:`setup a Kubernetes cluster <kubernetes
You can verify your credentials are setup correctly by running :code:`kubectl get pods`.

.. note::

If your cluster administrator has also provided you with a specific service account to use, set it in your ``~/.sky/config.yaml`` file:

.. code-block:: yaml
kubernetes:
remote_identity: your-service-account-name
2. Run :code:`sky check` and verify that Kubernetes is enabled in SkyPilot.

.. code-block:: console
Expand Down
Loading

0 comments on commit a9f5db7

Please sign in to comment.