Skip to content

Commit

Permalink
Merge branch 'master' of https://github.com/skypilot-org/skypilot int…
Browse files Browse the repository at this point in the history
…o k8s_multik8s_state2

# Conflicts:
#	sky/adaptors/kubernetes.py
  • Loading branch information
romilbhardwaj committed Sep 10, 2024
2 parents ef9da04 + 2d4059a commit 438e5b8
Show file tree
Hide file tree
Showing 22 changed files with 421 additions and 56 deletions.
2 changes: 1 addition & 1 deletion docs/source/examples/auto-failover.rst
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,7 @@ If a task would like to specify multiple candidate resources (not only GPUs), th

The regions specified that does not have the accelerator will be ignored automatically.

This will genereate the following output:
This will generate the following output:

.. code-block:: console
Expand Down
2 changes: 1 addition & 1 deletion docs/source/reference/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ Available fields and semantics:
# Advanced AWS configurations (optional).
# Apply to all new instances but not existing ones.
aws:
# Tags to assign to all instances launched by SkyPilot (optional).
# Tags to assign to all instances and buckets created by SkyPilot (optional).
#
# Example use case: cost tracking by user/team/project.
#
Expand Down
20 changes: 19 additions & 1 deletion docs/source/reference/kubernetes/kubernetes-deployment.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,13 @@ Below we include minimal guides to set up a new Kubernetes cluster in different

Amazon's hosted Kubernetes service.

.. grid-item-card:: On-demand Cloud VMs
:link: kubernetes-setup-ondemand
:link-type: ref
:text-align: center

We provide scripts to deploy k8s on on-demand cloud VMs.

.. _kubernetes-setup-kind:


Expand Down Expand Up @@ -267,4 +274,15 @@ After the GPU operator is installed, create the nvidia RuntimeClass required by
metadata:
name: nvidia
handler: nvidia
EOF
EOF
.. _kubernetes-setup-ondemand:

Deploying on cloud VMs
^^^^^^^^^^^^^^^^^^^^^^

You can also spin up on-demand cloud VMs and deploy Kubernetes on them.

We provide scripts to take care of provisioning VMs, installing Kubernetes, setting up GPU support and configuring your local kubeconfig.
Refer to our `Deploying Kubernetes on VMs guide <https://github.com/skypilot-org/skypilot/tree/master/examples/k8s_cloud_deploy>`_ for more details.
96 changes: 96 additions & 0 deletions examples/k8s_cloud_deploy/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Deploying a Kubernetes cluster on the cloud in 1-click with SkyPilot

This example demonstrates how to deploy a Kubernetes cluster on the cloud with SkyPilot. For the purposes of this guide, we will use lambda cloud as the cloud provider, but you can change cloud providers by editing `cloud_k8s.yaml`.

## Prerequisites
1. Latest SkyPilot nightly release:
```bash
pip install "skypilot-nightly[lambda,kubernetes]"
```

2. Use a cloud which supports opening ports on SkyPilot or manually expose ports 6443 and 443 on the VMs. This is required to expose k8s API server.

For example, if using lambda cloud, configure the firewall on the lambda cloud dashboard to allow inbound connections on port `443` and `6443`.

<p align="center">
<img src="https://i.imgur.com/uSA7BMH.png" alt="firewall" width="500"/>
</p>

## Instructions

1. Edit `cloud_k8s.yaml` to set the desired number of workers and GPUs per node. If using GCP, AWS or Azure, uncomment the ports line to allow inbound connections to the Kubernetes API server.
```yaml
resources:
cloud: lambda
accelerators: A10:1
# ports: 6443

num_nodes: 2
```
2. Use the convenience script to launch the cluster:
```bash
./launch_k8s.sh
```

SkyPilot will do all the heavy lifting for you: provision lambda VMs, deploy the k8s cluster, fetch the kubeconfig, and set up your local kubectl to connect to the cluster.

3. You should now be able to run `kubectl` and `sky` commands to interact with the cluster:
```console
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
129-80-133-44 Ready <none> 14m v1.30.4+k3s1
150-230-191-161 Ready control-plane,master 14m v1.30.4+k3s1

$ sky show-gpus --cloud kubernetes
Kubernetes GPUs
GPU QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS
A10 1 2 2

Kubernetes per node GPU availability
NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS
129-80-133-44 A10 1 1
150-230-191-161 A10 1 1
```

## Run AI workloads on your Kubernetes cluster with SkyPilot

### Development clusters
To launch a [GPU enabled development cluster](https://skypilot.readthedocs.io/en/latest/examples/interactive-development.html), run `sky launch -c mycluster --cloud kubernetes --gpus A10:1`.

SkyPilot will setup SSH config for you.
* [SSH access](https://skypilot.readthedocs.io/en/latest/examples/interactive-development.html#ssh): `ssh mycluster`
* [VSCode remote development](https://skypilot.readthedocs.io/en/latest/examples/interactive-development.html#vscode): `code --remote ssh-remote+mycluster "/"`


### Jobs
To run jobs, use `sky jobs launch --gpus A10:1 --cloud kubernetes -- 'nvidia-smi; sleep 600'`

You can submit multiple jobs and let SkyPilot handle queuing if the cluster runs out of resources:
```bash
$ sky jobs queue
Fetching managed job statuses...
Managed jobs
In progress tasks: 2 RUNNING, 1 STARTING
ID TASK NAME RESOURCES SUBMITTED TOT. DURATION JOB DURATION #RECOVERIES STATUS
3 - finetune 1x[A10:1] 24 secs ago 24s - 0 STARTING
2 - qlora 1x[A10:1] 2 min ago 2m 18s 12s 0 RUNNING
1 - sky-cmd 1x[A10:1] 4 mins ago 4m 27s 3m 12s 0 RUNNING
```

You can also observe the pods created by SkyPilot with `kubectl get pods`:
```bash
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
qlora-2-2ea4-head 1/1 Running 0 5m31s
sky-cmd-1-2ea4-head 1/1 Running 0 8m36s
sky-jobs-controller-2ea485ea-2ea4-head 1/1 Running 0 10m
```

Refer to [SkyPilot docs](https://skypilot.readthedocs.io/) for more.

## Teardown
To teardown the Kubernetes cluster, run:
```bash
sky down k8s
```
96 changes: 96 additions & 0 deletions examples/k8s_cloud_deploy/cloud_k8s.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
resources:
cloud: lambda
accelerators: A10:1
# Uncomment the following line to expose ports on a different cloud
# ports: 6443

num_nodes: 2

envs:
SKY_K3S_TOKEN: mytoken # Can be any string, used to join worker nodes to the cluster

run: |
wait_for_gpu_operator_installation() {
echo "Starting wait for GPU operator installation..."
SECONDS=0
TIMEOUT=600 # 10 minutes in seconds
while true; do
if kubectl describe nodes --kubeconfig ~/.kube/config | grep -q 'nvidia.com/gpu:'; then
echo "GPU operator installed."
break
elif [ $SECONDS -ge $TIMEOUT ]; then
echo "Timed out waiting for GPU operator installation."
exit 1
else
echo "Waiting for GPU operator installation..."
echo "To check status, see Nvidia GPU operator pods:"
echo "kubectl get pods -n gpu-operator --kubeconfig ~/.kube/config"
sleep 5
fi
done
}
if [ ${SKYPILOT_NODE_RANK} -ne 0 ]; then
# Worker nodes
MASTER_ADDR=`echo "$SKYPILOT_NODE_IPS" | head -n1`
echo "Worker joining k3s cluster @ ${MASTER_ADDR}"
curl -sfL https://get.k3s.io | K3S_URL=https://${MASTER_ADDR}:6443 K3S_TOKEN=${SKY_K3S_TOKEN} sh -
exit 0
fi
# Head node
curl -sfL https://get.k3s.io | K3S_TOKEN=${SKY_K3S_TOKEN} sh -
# Copy over kubeconfig file
echo "Copying kubeconfig file"
mkdir -p $HOME/.kube
sudo cp /etc/rancher/k3s/k3s.yaml $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
# Wait for k3s to be ready
echo "Waiting for k3s to be ready"
sleep 5
kubectl wait --for=condition=ready node --all --timeout=5m --kubeconfig ~/.kube/config
# =========== GPU support ===========
# Install helm
echo "Installing helm"
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
# Create namespace if it doesn't exist
echo "Creating namespace gpu-operator"
kubectl create namespace gpu-operator --kubeconfig ~/.kube/config || true
# Patch ldconfig
echo "Patching ldconfig"
sudo ln -s /sbin/ldconfig /sbin/ldconfig.real
# Install GPU operator
echo "Installing GPU operator"
helm install gpu-operator -n gpu-operator --create-namespace \
nvidia/gpu-operator $HELM_OPTIONS \
--set 'toolkit.env[0].name=CONTAINERD_CONFIG' \
--set 'toolkit.env[0].value=/var/lib/rancher/k3s/agent/etc/containerd/config.toml' \
--set 'toolkit.env[1].name=CONTAINERD_SOCKET' \
--set 'toolkit.env[1].value=/run/k3s/containerd/containerd.sock' \
--set 'toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS' \
--set 'toolkit.env[2].value=nvidia'
wait_for_gpu_operator_installation
# Create RuntimeClass
sleep 5
echo "Creating RuntimeClass"
kubectl apply --kubeconfig ~/.kube/config -f - <<EOF
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia
EOF
87 changes: 87 additions & 0 deletions examples/k8s_cloud_deploy/launch_k8s.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
#!/bin/bash
echo -e "\033[1m===== SkyPilot Kubernetes cluster deployment script =====\033[0m"
echo -e "This script will deploy a Kubernetes cluster on the cloud and GPUs specified in cloud_k8s.yaml.\n"

set -ex

# Read cluster name from environment variable if it exists, else use default value
CLUSTER_NAME=${CLUSTER_NAME:-k8s}

# Deploy the k8s cluster
sky launch -y -c ${CLUSTER_NAME} cloud_k8s.yaml

# Get the endpoint of the k8s cluster
# Attempt to get the primary endpoint and handle any errors
PRIMARY_ENDPOINT=""
SKY_STATUS_OUTPUT=$(SKYPILOT_DEBUG=0 sky status --endpoint 6443 ${CLUSTER_NAME} 2>&1) || true

# Check if the command was successful and if the output contains a valid IP address
if [[ "$SKY_STATUS_OUTPUT" != *"ValueError"* ]]; then
PRIMARY_ENDPOINT="$SKY_STATUS_OUTPUT"
else
echo "Primary endpoint retrieval failed or unsupported. Falling back to alternate method..."
fi

# If primary endpoint is empty or invalid, try to fetch from SSH config
if [[ -z "$PRIMARY_ENDPOINT" ]]; then
echo "Using alternate method to fetch endpoint..."

# Parse the HostName from the SSH config file
SSH_CONFIG_FILE="$HOME/.sky/generated/ssh/${CLUSTER_NAME}"
if [[ -f "$SSH_CONFIG_FILE" ]]; then
ENDPOINT=$(awk '/^ *HostName / { print $2; exit}' "$SSH_CONFIG_FILE")
ENDPOINT="${ENDPOINT}:6443"
fi

if [[ -z "$ENDPOINT" ]]; then
echo "Failed to retrieve a valid endpoint. Exiting."
exit 1
fi
else
ENDPOINT="$PRIMARY_ENDPOINT"
echo "Using primary endpoint: $ENDPOINT"
fi

# Rsync the remote kubeconfig to the local machine
mkdir -p ~/.kube
rsync -av ${CLUSTER_NAME}:'~/.kube/config' ~/.kube/config

KUBECONFIG_FILE="$HOME/.kube/config"

# Back up the original kubeconfig file if it exists
if [[ -f "$KUBECONFIG_FILE" ]]; then
echo "Backing up kubeconfig file to ${KUBECONFIG_FILE}.bak"
cp "$KUBECONFIG_FILE" "${KUBECONFIG_FILE}.bak"
fi

# Temporary file to hold the modified kubeconfig
TEMP_FILE=$(mktemp)

# Remove the certificate-authority-data, and replace the server with
awk '
BEGIN { in_cluster = 0 }
/^clusters:/ { in_cluster = 1 }
/^users:/ { in_cluster = 0 }
in_cluster && /^ *certificate-authority-data:/ { next }
in_cluster && /^ *server:/ {
print " server: https://'${ENDPOINT}'"
print " insecure-skip-tls-verify: true"
next
}
{ print }
' "$KUBECONFIG_FILE" > "$TEMP_FILE"

# Replace the original kubeconfig with the modified one
mv "$TEMP_FILE" "$KUBECONFIG_FILE"

echo "Updated kubeconfig file successfully."

sleep 5 # Wait for the cluster to be ready
sky check kubernetes

set +x
echo -e "\033[1m===== Kubernetes cluster deployment complete =====\033[0m"
echo -e "You can now access your k8s cluster with kubectl and skypilot.\n"
echo -e "• View the list of available GPUs on Kubernetes: \033[1msky show-gpus --cloud kubernetes\033[0m"
echo -e "• To launch a SkyPilot job running nvidia-smi on this cluster: \033[1msky launch --cloud kubernetes --gpus <GPU> -- nvidia-smi\033[0m"

2 changes: 1 addition & 1 deletion examples/resnet_distributed_torch.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ setup: |
git clone https://github.com/michaelzhiluo/pytorch-distributed-resnet
cd pytorch-distributed-resnet
# SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5).
pip3 install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
pip3 install -r requirements.txt numpy==1.26.4 torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
mkdir -p data && mkdir -p saved_models && cd data && \
wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
tar -xvzf cifar-10-python.tar.gz
Expand Down
7 changes: 4 additions & 3 deletions examples/tpu/tpuvm_mnist.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,11 @@ setup: |
conda create -n flax python=3.10 -y
conda activate flax
# Make sure to install TPU related packages in a conda env to avoid package conflicts.
pip install "jax[tpu]==0.4.26" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
pip install clu
pip install \
-f https://storage.googleapis.com/jax-releases/libtpu_releases.html "jax[tpu]==0.4.25" \
clu \
tensorflow tensorflow-datasets
pip install -e flax
pip install tensorflow tensorflow-datasets
fi
Expand Down
17 changes: 11 additions & 6 deletions sky/adaptors/azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -177,19 +177,24 @@ def get_client(name: str,
container_client = blob.ContainerClient.from_container_url(
container_url, credential)
try:
# Suppress noisy logs from Azure SDK when attempting
# to run exists() on private container without access.
# Reference:
# https://github.com/Azure/azure-sdk-for-python/issues/9422
azure_logger = logging.getLogger('azure')
original_level = azure_logger.getEffectiveLevel()
azure_logger.setLevel(logging.CRITICAL)
container_client.exists()
azure_logger.setLevel(original_level)
return container_client
except exceptions().ClientAuthenticationError as e:
# Caught when user attempted to use private container
# without access rights.
# without access rights. Raised error is handled at the
# upstream.
# Reference: https://learn.microsoft.com/en-us/troubleshoot/azure/entra/entra-id/app-integration/error-code-aadsts50020-user-account-identity-provider-does-not-exist # pylint: disable=line-too-long
if 'ERROR: AADSTS50020' in str(e):
with ux_utils.print_exception_no_traceback():
raise sky_exceptions.StorageBucketGetError(
'Attempted to fetch a non-existent public '
'container name: '
f'{container_client.container_name}. '
'Please check if the name is correct.')
raise e
with ux_utils.print_exception_no_traceback():
raise sky_exceptions.StorageBucketGetError(
'Failed to retreive the container client for the '
Expand Down
1 change: 1 addition & 0 deletions sky/adaptors/kubernetes.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ def wrapped(*args, **kwargs):


def _load_config(context: Optional[str] = None):
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
try:
# Load in-cluster config if running in a pod
# Kubernetes set environment variables for service discovery do not
Expand Down
Loading

0 comments on commit 438e5b8

Please sign in to comment.