diff --git a/docs/source/_gallery_original/index.rst b/docs/source/_gallery_original/index.rst index 67f4eef11dc..e8a540c883c 100644 --- a/docs/source/_gallery_original/index.rst +++ b/docs/source/_gallery_original/index.rst @@ -1,3 +1,5 @@ +.. _ai-gallery: + AI Gallery ==================== diff --git a/docs/source/getting-started/installation.rst b/docs/source/getting-started/installation.rst index 42336bcd5cc..e5b318d4f87 100644 --- a/docs/source/getting-started/installation.rst +++ b/docs/source/getting-started/installation.rst @@ -312,9 +312,10 @@ Cudo Compute ~~~~~~~~~~~~~~~~~~ `Cudo Compute `__ GPU cloud provides low cost GPUs powered with green energy. - -1. Create an API Key by following `this guide `__. -2. Download and install the `cudoctl `__ command line tool +1. Create a billing account by following `this guide `__. +2. Create a project ``__. +3. Create an API Key by following `this guide `__. +3. Download and install the `cudoctl `__ command line tool 3. Run :code:`cudoctl init`: .. code-block:: shell @@ -326,7 +327,7 @@ Cudo Compute ✔ context: default config file saved ~/.config/cudo/cudo.yml - pip install "cudocompute>=0.1.8" + pip install "cudo-compute>=0.1.10" If you want to want to use skypilot with a different Cudo Compute account or project, just run :code:`cudoctl init`: again. diff --git a/docs/source/reference/kubernetes/index.rst b/docs/source/reference/kubernetes/index.rst index 1edfde01240..30b95b0d5ee 100644 --- a/docs/source/reference/kubernetes/index.rst +++ b/docs/source/reference/kubernetes/index.rst @@ -3,247 +3,107 @@ Running on Kubernetes ============================= -.. note:: - Kubernetes support is under active development. `Please share your feedback `_ - or `directly reach out to the development team `_ - for feature requests and more. - SkyPilot tasks can be run on your private on-prem or cloud Kubernetes clusters. The Kubernetes cluster gets added to the list of "clouds" in SkyPilot and SkyPilot tasks can be submitted to your Kubernetes cluster just like any other cloud provider. -**Benefits of using SkyPilot to run jobs on your Kubernetes cluster:** - -* Get SkyPilot features (setup management, job execution, queuing, logging, SSH access) on your Kubernetes resources -* Replace complex Kubernetes manifests with simple SkyPilot tasks -* Seamlessly "burst" jobs to the cloud if your Kubernetes cluster is congested -* Retain observability and control over your cluster with your existing Kubernetes tools - -**Supported Kubernetes deployments:** - -* Hosted Kubernetes services (EKS, GKE) -* On-prem clusters (Kubeadm, Rancher) -* Local development clusters (KinD, minikube) - - -Kubernetes Cluster Requirements +Why use SkyPilot on Kubernetes? ------------------------------- -To connect and use a Kubernetes cluster, SkyPilot needs: - -* An existing Kubernetes cluster running Kubernetes v1.20 or later. -* A `Kubeconfig `_ file containing access credentials and namespace to be used. - -In a typical workflow: - -1. A cluster administrator sets up a Kubernetes cluster. Detailed admin guides for - different deployment environments (Amazon EKS, Google GKE, On-Prem and local debugging) are included in the :ref:`Kubernetes cluster setup guide `. - -2. Users who want to run SkyPilot tasks on this cluster are issued Kubeconfig - files containing their credentials (`kube-context `_). - SkyPilot reads this Kubeconfig file to communicate with the cluster. - -Submitting SkyPilot tasks to Kubernetes Clusters ------------------------------------------------- -.. _kubernetes-instructions: - -Once your cluster administrator has :ref:`setup a Kubernetes cluster ` and provided you with a kubeconfig file: - -0. Make sure `kubectl `_, ``socat`` and ``nc`` (netcat) are installed on your local machine. - - .. code-block:: console - - $ # MacOS - $ brew install kubectl socat netcat - - $ # Linux (may have socat already installed) - $ sudo apt-get install kubectl socat netcat - - -1. Place your kubeconfig file at ``~/.kube/config``. - - .. code-block:: console - - $ mkdir -p ~/.kube - $ cp /path/to/kubeconfig ~/.kube/config - - You can verify your credentials are setup correctly by running :code:`kubectl get pods`. - -2. Run :code:`sky check` and verify that Kubernetes is enabled in SkyPilot. - - .. code-block:: console - - $ sky check - - Checking credentials to enable clouds for SkyPilot. - ... - Kubernetes: enabled - ... - - - .. note:: - :code:`sky check` will also check if GPU support is available on your cluster. If GPU support is not available, it - will show the reason. - To setup GPU support on the cluster, refer to the :ref:`Kubernetes cluster setup guide `. - -4. You can now run any SkyPilot task on your Kubernetes cluster. - - .. code-block:: console +.. tab-set:: - $ sky launch --cpus 2+ task.yaml - == Optimizer == - Target: minimizing cost - Estimated cost: $0.0 / hour + .. tab-item:: For AI Developers + :sync: why-ai-devs-tab - Considered resources (1 node): - --------------------------------------------------------------------------------------------------- - CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN - --------------------------------------------------------------------------------------------------- - Kubernetes 2CPU--2GB 2 2 - kubernetes 0.00 ✔ - AWS m6i.large 2 8 - us-east-1 0.10 - Azure Standard_D2s_v5 2 8 - eastus 0.10 - GCP n2-standard-2 2 8 - us-central1 0.10 - IBM bx2-8x32 8 32 - us-east 0.38 - Lambda gpu_1x_a10 30 200 A10:1 us-east-1 0.60 - ---------------------------------------------------------------------------------------------------. + .. grid:: 2 + :gutter: 3 + .. grid-item-card:: ✅ Ease of use + :text-align: center -.. note:: - SkyPilot will use the cluster and namespace set in the ``current-context`` in the - kubeconfig file. To manage your ``current-context``: + .. + TODO(romilb): We should have a comparison of a popular Kubernetes manifest vs a SkyPilot YAML in terms of LoC in a mini blog and link it here. - .. code-block:: console + No complex kubernetes manifests - write a simple SkyPilot YAML and run with one command ``sky launch``. - $ # See current context - $ kubectl config current-context + .. grid-item-card:: 📋 Interactive development on Kubernetes + :text-align: center - $ # Switch current-context - $ kubectl config use-context mycontext + :ref:`SSH access to pods `, :ref:`VSCode integration `, :ref:`job management `, :ref:`autodown idle pods ` and more. - $ # Set a specific namespace to be used in the current-context - $ kubectl config set-context --current --namespace=mynamespace + .. grid-item-card:: ☁️ Burst to the cloud + :text-align: center + Kubernetes cluster is full? SkyPilot :ref:`seamlessly gets resources on the cloud ` to get your job running sooner. -Using Custom Images -------------------- -By default, we use and maintain a SkyPilot container image that has conda and a few other basic tools installed. + .. grid-item-card:: 🖼 Run popular models on Kubernetes + :text-align: center -To use your own image, add :code:`image_id: docker:` to the :code:`resources` section of your task YAML. + Train and serve `Llama-3 `_, `Mixtral `_, and more on your Kubernetes with ready-to-use recipes from the :ref:`AI gallery `. -.. code-block:: yaml - resources: - image_id: docker:myrepo/myimage:latest - ... + .. tab-item:: For Infrastructure Admins + :sync: why-admins-tab -Your image must satisfy the following requirements: + .. grid:: 2 + :gutter: 3 -* Image must be **debian-based** and must have the apt package manager installed. -* The default user in the image must have root privileges or passwordless sudo access. + .. grid-item-card:: ☁️ Unified platform for all Infrastructure + :text-align: center -.. note:: + Scale beyond your Kubernetes cluster to capacity on :ref:`across clouds and regions ` without manual intervention. - If your cluster runs on non-x86_64 architecture (e.g., Apple Silicon), your image must be built natively for that architecture. Otherwise, your job may get stuck at :code:`Start streaming logs ...`. See `GitHub issue `_ for more. + .. grid-item-card:: 🚯️ Minimize resource wastage + :text-align: center -Using Images from Private Repositories -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -To use images from private repositories (e.g., Private DockerHub, Amazon ECR, Google Container Registry), create a `secret `_ in your Kubernetes cluster and edit your :code:`~/.sky/config.yaml` to specify the secret like so: + SkyPilot can run with your custom pod scheduler and automatically terminate idle pods to free up resources for other users. -.. code-block:: yaml + .. grid-item-card:: 👀 Observability + :text-align: center - kubernetes: - pod_config: - spec: - imagePullSecrets: - - name: your-secret-here + Works with your existing observability and monitoring tools, such as the :ref:`Kubernetes Dashboard `. -.. tip:: + .. grid-item-card:: 🍽️ Self-serve infra for your teams + :text-align: center - If you use Amazon ECR, your secret credentials may expire every 12 hours. Consider using `k8s-ecr-login-renew `_ to automatically refresh your secrets. + Reduce operational overhead by letting your teams provision their own resources, while you retain control over the Kubernetes cluster. -Opening Ports -------------- +Table of Contents +----------------- -Opening ports on SkyPilot clusters running on Kubernetes is supported through two modes: +.. grid:: 3 + :gutter: 3 -1. `LoadBalancer services `_ (default) -2. `Nginx IngressController `_ + .. grid-item-card:: 👋 Get Started + :link: kubernetes-getting-started + :link-type: ref + :text-align: center -One of these modes must be supported and configured on your cluster. Refer to the :ref:`setting up ports on Kubernetes guide ` on how to do this. + Already have a kubeconfig? Launch your first SkyPilot task on Kubernetes - it's as simple as ``sky launch``. -.. tip:: + .. grid-item-card:: ⚙️ Cluster Configuration + :link: kubernetes-setup + :link-type: ref + :text-align: center - On Google GKE, Amazon EKS or other cloud-hosted Kubernetes services, the default LoadBalancer services mode is supported out of the box and no additional configuration is needed. + Are you a cluster admin? Find cluster deployment guides and setup instructions here. -Once your cluster is configured, launch a task which exposes services on a port by adding :code:`ports` to the :code:`resources` section of your task YAML. + .. grid-item-card:: 🔍️ Troubleshooting + :link: kubernetes-troubleshooting + :link-type: ref + :text-align: center -.. code-block:: yaml + Running into problems with SkyPilot on your Kubernetes cluster? Find common issues and solutions here. - # task.yaml - resources: - ports: 8888 - run: | - python -m http.server 8888 - -After launching the cluster with :code:`sky launch -c myclus task.yaml`, you can get the URL to access the port using :code:`sky status --endpoints myclus`. - -.. code-block:: bash - - # List all ports exposed by the cluster - $ sky status --endpoints myclus - 8888: 34.173.13.241:8888 - - # curl a specific port's endpoint - $ curl $(sky status --endpoint 8888 myclus) - ... - -.. tip:: - - To learn more about opening ports in SkyPilot tasks, see :ref:`Opening Ports `. - -FAQs ----- - -* **Are autoscaling Kubernetes clusters supported?** - - To run on an autoscaling cluster, you may need to adjust the resource provisioning timeout (:code:`Kubernetes.TIMEOUT` in `clouds/kubernetes.py`) to a large value to give enough time for the cluster to autoscale. We are working on a better interface to adjust this timeout - stay tuned! - -* **Can SkyPilot provision a Kubernetes cluster for me? Will SkyPilot add more nodes to my Kubernetes clusters?** - - The goal of Kubernetes support is to run SkyPilot tasks on an existing Kubernetes cluster. It does not provision any new Kubernetes clusters or add new nodes to an existing Kubernetes cluster. - -* **I have multiple users in my organization who share the same Kubernetes cluster. How do I provide isolation for their SkyPilot workloads?** - - For isolation, you can create separate Kubernetes namespaces and set them in the kubeconfig distributed to users. SkyPilot will use the namespace set in the kubeconfig for running all tasks. - -* **How can I specify custom configuration for the pods created by SkyPilot?** - - You can override the pod configuration used by SkyPilot by setting the :code:`pod_config` key in :code:`~/.sky/config.yaml`. - The value of :code:`pod_config` should be a dictionary that follows the `Kubernetes Pod API `_. - - For example, to set custom environment variables and attach a volume on your pods, you can add the following to your :code:`~/.sky/config.yaml` file: - - .. code-block:: yaml +.. toctree:: + :hidden: - kubernetes: - pod_config: - spec: - containers: - - env: - - name: MY_ENV_VAR - value: MY_ENV_VALUE - volumeMounts: # Custom volume mounts for the pod - - mountPath: /foo - name: example-volume - volumes: - - name: example-volume - hostPath: - path: /tmp - type: Directory + kubernetes-getting-started + kubernetes-setup + kubernetes-troubleshooting - For more details refer to :ref:`config-yaml`. Features and Roadmap -------------------- @@ -256,11 +116,4 @@ Kubernetes support is under active development. Some features are in progress an * Multi-node tasks - ✅ Available * Custom images - ✅ Available * Opening ports and exposing services - ✅ Available -* Multiple Kubernetes Clusters - 🚧 In progress - - -.. toctree:: - :hidden: - - kubernetes-setup - kubernetes-troubleshooting +* Multiple Kubernetes Clusters - 🚧 In progress \ No newline at end of file diff --git a/docs/source/reference/kubernetes/kubernetes-getting-started.rst b/docs/source/reference/kubernetes/kubernetes-getting-started.rst new file mode 100644 index 00000000000..c2162da3779 --- /dev/null +++ b/docs/source/reference/kubernetes/kubernetes-getting-started.rst @@ -0,0 +1,235 @@ +.. _kubernetes-getting-started: + +Getting Started on Kubernetes +============================= + +Prerequisites +------------- + +To connect and use a Kubernetes cluster, SkyPilot needs: + +* An existing Kubernetes cluster running Kubernetes v1.20 or later. +* A `Kubeconfig `_ file containing access credentials and namespace to be used. + +**Supported Kubernetes deployments:** + +* Hosted Kubernetes services (EKS, GKE) +* On-prem clusters (Kubeadm, Rancher, K3s) +* Local development clusters (KinD, minikube) + +In a typical workflow: + +1. A cluster administrator sets up a Kubernetes cluster. Detailed admin guides for + different deployment environments (Amazon EKS, Google GKE, On-Prem and local debugging) are included in the :ref:`Kubernetes cluster setup guide `. + +2. Users who want to run SkyPilot tasks on this cluster are issued Kubeconfig + files containing their credentials (`kube-context `_). + SkyPilot reads this Kubeconfig file to communicate with the cluster. + +Launching your first task +------------------------- +.. _kubernetes-instructions: + +Once your cluster administrator has :ref:`setup a Kubernetes cluster ` and provided you with a kubeconfig file: + +0. Make sure `kubectl `_, ``socat`` and ``nc`` (netcat) are installed on your local machine. + + .. code-block:: console + + $ # MacOS + $ brew install kubectl socat netcat + + $ # Linux (may have socat already installed) + $ sudo apt-get install kubectl socat netcat + + +1. Place your kubeconfig file at ``~/.kube/config``. + + .. code-block:: console + + $ mkdir -p ~/.kube + $ cp /path/to/kubeconfig ~/.kube/config + + You can verify your credentials are setup correctly by running :code:`kubectl get pods`. + +2. Run :code:`sky check` and verify that Kubernetes is enabled in SkyPilot. + + .. code-block:: console + + $ sky check + + Checking credentials to enable clouds for SkyPilot. + ... + Kubernetes: enabled + ... + + + .. note:: + :code:`sky check` will also check if GPU support is available on your cluster. If GPU support is not available, it + will show the reason. + To setup GPU support on the cluster, refer to the :ref:`Kubernetes cluster setup guide `. + +.. _kubernetes-optimizer-table: + +4. You can now run any SkyPilot task on your Kubernetes cluster. + + .. code-block:: console + + $ sky launch --cpus 2+ task.yaml + == Optimizer == + Target: minimizing cost + Estimated cost: $0.0 / hour + + Considered resources (1 node): + --------------------------------------------------------------------------------------------------- + CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN + --------------------------------------------------------------------------------------------------- + Kubernetes 2CPU--2GB 2 2 - kubernetes 0.00 ✔ + AWS m6i.large 2 8 - us-east-1 0.10 + Azure Standard_D2s_v5 2 8 - eastus 0.10 + GCP n2-standard-2 2 8 - us-central1 0.10 + IBM bx2-8x32 8 32 - us-east 0.38 + Lambda gpu_1x_a10 30 200 A10:1 us-east-1 0.60 + ---------------------------------------------------------------------------------------------------. + + +.. note:: + SkyPilot will use the cluster and namespace set in the ``current-context`` in the + kubeconfig file. To manage your ``current-context``: + + .. code-block:: console + + $ # See current context + $ kubectl config current-context + + $ # Switch current-context + $ kubectl config use-context mycontext + + $ # Set a specific namespace to be used in the current-context + $ kubectl config set-context --current --namespace=mynamespace + + +Using Custom Images +------------------- +By default, we use and maintain a SkyPilot container image that has conda and a few other basic tools installed. + +To use your own image, add :code:`image_id: docker:` to the :code:`resources` section of your task YAML. + +.. code-block:: yaml + + resources: + image_id: docker:myrepo/myimage:latest + ... + +Your image must satisfy the following requirements: + +* Image must be **debian-based** and must have the apt package manager installed. +* The default user in the image must have root privileges or passwordless sudo access. + +.. note:: + + If your cluster runs on non-x86_64 architecture (e.g., Apple Silicon), your image must be built natively for that architecture. Otherwise, your job may get stuck at :code:`Start streaming logs ...`. See `GitHub issue `_ for more. + +Using Images from Private Repositories +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +To use images from private repositories (e.g., Private DockerHub, Amazon ECR, Google Container Registry), create a `secret `_ in your Kubernetes cluster and edit your :code:`~/.sky/config.yaml` to specify the secret like so: + +.. code-block:: yaml + + kubernetes: + pod_config: + spec: + imagePullSecrets: + - name: your-secret-here + +.. tip:: + + If you use Amazon ECR, your secret credentials may expire every 12 hours. Consider using `k8s-ecr-login-renew `_ to automatically refresh your secrets. + + +Opening Ports +------------- + +Opening ports on SkyPilot clusters running on Kubernetes is supported through two modes: + +1. `LoadBalancer services `_ (default) +2. `Nginx IngressController `_ + +One of these modes must be supported and configured on your cluster. Refer to the :ref:`setting up ports on Kubernetes guide ` on how to do this. + +.. tip:: + + On Google GKE, Amazon EKS or other cloud-hosted Kubernetes services, the default LoadBalancer services mode is supported out of the box and no additional configuration is needed. + +Once your cluster is configured, launch a task which exposes services on a port by adding :code:`ports` to the :code:`resources` section of your task YAML. + +.. code-block:: yaml + + # task.yaml + resources: + ports: 8888 + + run: | + python -m http.server 8888 + +After launching the cluster with :code:`sky launch -c myclus task.yaml`, you can get the URL to access the port using :code:`sky status --endpoints myclus`. + +.. code-block:: bash + + # List all ports exposed by the cluster + $ sky status --endpoints myclus + 8888: 34.173.13.241:8888 + + # curl a specific port's endpoint + $ curl $(sky status --endpoint 8888 myclus) + ... + +.. tip:: + + To learn more about opening ports in SkyPilot tasks, see :ref:`Opening Ports `. + +FAQs +---- + +* **Are autoscaling Kubernetes clusters supported?** + + To run on an autoscaling cluster, you may need to adjust the resource provisioning timeout (:code:`Kubernetes.TIMEOUT` in `clouds/kubernetes.py`) to a large value to give enough time for the cluster to autoscale. We are working on a better interface to adjust this timeout - stay tuned! + +* **Can SkyPilot provision a Kubernetes cluster for me? Will SkyPilot add more nodes to my Kubernetes clusters?** + + The goal of Kubernetes support is to run SkyPilot tasks on an existing Kubernetes cluster. It does not provision any new Kubernetes clusters or add new nodes to an existing Kubernetes cluster. + +* **I have multiple users in my organization who share the same Kubernetes cluster. How do I provide isolation for their SkyPilot workloads?** + + For isolation, you can create separate Kubernetes namespaces and set them in the kubeconfig distributed to users. SkyPilot will use the namespace set in the kubeconfig for running all tasks. + +* **How do I view the pods created by SkyPilot on my Kubernetes cluster?** + + You can use your existing observability tools to filter resources with the label :code:`parent=skypilot` (:code:`kubectl get pods -l 'parent=skypilot'`). As an example, follow the instructions :ref:`here ` to deploy the Kubernetes Dashboard on your cluster. + +* **How can I specify custom configuration for the pods created by SkyPilot?** + + You can override the pod configuration used by SkyPilot by setting the :code:`pod_config` key in :code:`~/.sky/config.yaml`. + The value of :code:`pod_config` should be a dictionary that follows the `Kubernetes Pod API `_. + + For example, to set custom environment variables and attach a volume on your pods, you can add the following to your :code:`~/.sky/config.yaml` file: + + .. code-block:: yaml + + kubernetes: + pod_config: + spec: + containers: + - env: + - name: MY_ENV_VAR + value: MY_ENV_VALUE + volumeMounts: # Custom volume mounts for the pod + - mountPath: /foo + name: example-volume + volumes: + - name: example-volume + hostPath: + path: /tmp + type: Directory + + For more details refer to :ref:`config-yaml`. diff --git a/examples/stable_diffusion/README.md b/examples/stable_diffusion/README.md index 9bec3354848..2a4383f1347 100644 --- a/examples/stable_diffusion/README.md +++ b/examples/stable_diffusion/README.md @@ -8,7 +8,7 @@ 4. Run `ssh -L 7860:localhost:7860 stable-diffusion` -5. Open [`http://localhost:7860/`](http://localhost:7860/) in browser. +5. Open [`http://localhost:7860/`](http://localhost:7860/) in browser. If the page doesn't load, try again in a few minutes to allow the container to start. 6. Type in text prompt and click "Generate". diff --git a/examples/stable_diffusion/stable_diffusion_docker.yaml b/examples/stable_diffusion/stable_diffusion_docker.yaml index 47499fa2ea4..1a830241f14 100644 --- a/examples/stable_diffusion/stable_diffusion_docker.yaml +++ b/examples/stable_diffusion/stable_diffusion_docker.yaml @@ -7,8 +7,6 @@ file_mounts: /stable_diffusion: . setup: | - sudo curl "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose - sudo chmod +x /usr/local/bin/docker-compose cd stable-diffusion-webui-docker sudo rm -r stable-diffusion-webui-docker git clone https://github.com/AbdBarho/stable-diffusion-webui-docker.git @@ -17,9 +15,16 @@ setup: | wget https://huggingface.co/CompVis/stable-diffusion-v-1-4-original/resolve/main/sd-v1-4.ckpt -P models mv models/sd-v1-4.ckpt models/model.ckpt docker pull berkeleyskypilot/stable-diffusion - rm docker-compose.yml - cp /stable_diffusion/docker-compose.yml . run: | cd stable-diffusion-webui-docker - docker-compose up + docker run -d \ + --name model \ + --restart on-failure \ + -p 7860:7860 \ + -v $(pwd)/cache:/cache \ + -v $(pwd)/output:/output \ + -v $(pwd)/models:/models \ + -e CLI_ARGS='--extra-models-cpu --optimized-turbo' \ + --gpus all \ + berkeleyskypilot/stable-diffusion diff --git a/sky/backends/cloud_vm_ray_backend.py b/sky/backends/cloud_vm_ray_backend.py index 6933d3dfa0d..f4d141043bb 100644 --- a/sky/backends/cloud_vm_ray_backend.py +++ b/sky/backends/cloud_vm_ray_backend.py @@ -1,5 +1,4 @@ """Backend: runs on cloud virtual machines, managed by Ray.""" -import base64 import copy import enum import functools @@ -10,6 +9,7 @@ import os import pathlib import re +import shlex import signal import subprocess import sys @@ -141,9 +141,9 @@ # https://github.com/torvalds/linux/blob/master/include/uapi/linux/binfmts.h # # If a user have very long run or setup commands, the generated command may -# exceed the limit, as we encode the script in base64 and directly include it in -# the job submission command. If the command is too long, we instead write it to -# a file, rsync and execute it. +# exceed the limit, as we directly include scripts in job submission commands. +# If the command is too long, we instead write it to a file, rsync and execute +# it. # # We use 120KB as a threshold to be safe for other arguments that # might be added during ssh. @@ -3149,8 +3149,7 @@ def _setup_node(node_id: int) -> None: runner = runners[node_id] setup_script = log_lib.make_task_bash_script(setup, env_vars=setup_envs) - encoded_script = base64.b64encode( - setup_script.encode('utf-8')).decode('utf-8') + encoded_script = shlex.quote(setup_script) if (detach_setup or len(encoded_script) > _MAX_INLINE_SCRIPT_LENGTH): with tempfile.NamedTemporaryFile('w', prefix='sky_setup_') as f: @@ -3163,9 +3162,8 @@ def _setup_node(node_id: int) -> None: stream_logs=False) create_script_code = 'true' else: - create_script_code = ( - f'{{ echo "{encoded_script}" | base64 --decode > ' - f'{remote_setup_file_name}; }}') + create_script_code = (f'{{ echo {encoded_script} > ' + f'{remote_setup_file_name}; }}') if detach_setup: return @@ -3251,10 +3249,8 @@ def _exec_code_on_head( mkdir_code = (f'{cd} && mkdir -p {remote_log_dir} && ' f'touch {remote_log_path}') - encoded_script = base64.b64encode( - codegen.encode('utf-8')).decode('utf-8') - create_script_code = (f'{{ echo "{encoded_script}" | base64 --decode > ' - f'{script_path}; }}') + encoded_script = shlex.quote(codegen) + create_script_code = (f'{{ echo {encoded_script} > {script_path}; }}') job_submit_cmd = ( f'RAY_DASHBOARD_PORT=$({constants.SKY_PYTHON_CMD} -c "from sky.skylet import job_lib; print(job_lib.get_job_submission_port())" 2> /dev/null || echo 8265);' # pylint: disable=line-too-long f'{cd} && {constants.SKY_RAY_CMD} job submit ' diff --git a/sky/cli.py b/sky/cli.py index 9a45a35ae55..0bcec3d2f4b 100644 --- a/sky/cli.py +++ b/sky/cli.py @@ -2966,6 +2966,15 @@ def show_gpus( To show all regions for a specified accelerator, use ``sky show-gpus --all-regions``. + If ``--region`` or ``--all-regions`` is not specified, the price displayed + for each instance type is the lowest across all regions for both on-demand + and spot instances. There may be multiple regions with the same lowest + price. + + If ``--cloud kubernetes`` is specified, it will show the maximum quantities + of the GPU available on a single node and the real-time availability of + the GPU across all nodes in the Kubernetes cluster. + Definitions of certain fields: * ``DEVICE_MEM``: Memory of a single device; does not depend on the device @@ -2973,10 +2982,15 @@ def show_gpus( * ``HOST_MEM``: Memory of the host instance (VM). - If ``--region`` or ``--all-regions`` is not specified, the price displayed - for each instance type is the lowest across all regions for both on-demand - and spot instances. There may be multiple regions with the same lowest - price. + * ``QTY_PER_NODE`` (Kubernetes only): GPU quantities that can be requested + on a single node. + + * ``TOTAL_GPUS`` (Kubernetes only): Total number of GPUs available in the + Kubernetes cluster. + + * ``TOTAL_FREE_GPUS`` (Kubernetes only): Number of currently free GPUs + in the Kubernetes cluster. This is fetched in real-time and may change + when other users are using the cluster. """ # validation for the --region flag if region is not None and cloud is None: @@ -2999,9 +3013,64 @@ def show_gpus( if show_all and accelerator_str is not None: raise click.UsageError('--all is only allowed without a GPU name.') + # Kubernetes specific bools + cloud_is_kubernetes = isinstance(cloud_obj, sky_clouds.Kubernetes) + kubernetes_autoscaling = kubernetes_utils.get_autoscaler_type() is not None + kubernetes_is_enabled = sky_clouds.cloud_in_iterable( + sky_clouds.Kubernetes(), global_user_state.get_cached_enabled_clouds()) + + if cloud_is_kubernetes and region is not None: + raise click.UsageError( + 'The --region flag cannot be set with --cloud kubernetes.') + def _list_to_str(lst): return ', '.join([str(e) for e in lst]) + def _get_kubernetes_realtime_gpu_table( + name_filter: Optional[str] = None, + quantity_filter: Optional[int] = None): + if quantity_filter: + qty_header = 'QTY_FILTER' + free_header = 'FILTERED_FREE_GPUS' + else: + qty_header = 'QTY_PER_NODE' + free_header = 'TOTAL_FREE_GPUS' + realtime_gpu_table = log_utils.create_table( + ['GPU', qty_header, 'TOTAL_GPUS', free_header]) + counts, capacity, available = service_catalog.list_accelerator_realtime( + gpus_only=True, + clouds='kubernetes', + name_filter=name_filter, + region_filter=region, + quantity_filter=quantity_filter, + case_sensitive=False) + assert (set(counts.keys()) == set(capacity.keys()) == set( + available.keys())), (f'Keys of counts ({list(counts.keys())}), ' + f'capacity ({list(capacity.keys())}), ' + f'and available ({list(available.keys())}) ' + 'must be same.') + if len(counts) == 0: + err_msg = 'No GPUs found in Kubernetes cluster. ' + debug_msg = 'To further debug, run: sky check ' + if name_filter is not None: + gpu_info_msg = f' {name_filter!r}' + if quantity_filter is not None: + gpu_info_msg += (' with requested quantity' + f' {quantity_filter}') + err_msg = (f'Resources{gpu_info_msg} not found ' + 'in Kubernetes cluster. ') + debug_msg = ('To show available accelerators on kubernetes,' + ' run: sky show-gpus --cloud kubernetes ') + full_err_msg = (err_msg + kubernetes_utils.NO_GPU_HELP_MESSAGE + + debug_msg) + raise ValueError(full_err_msg) + for gpu, _ in sorted(counts.items()): + realtime_gpu_table.add_row([ + gpu, + _list_to_str(counts.pop(gpu)), capacity[gpu], available[gpu] + ]) + return realtime_gpu_table + def _output(): gpu_table = log_utils.create_table( ['COMMON_GPU', 'AVAILABLE_QUANTITIES']) @@ -3012,35 +3081,69 @@ def _output(): name, quantity = None, None - # Kubernetes specific bools - cloud_is_kubernetes = isinstance(cloud_obj, sky_clouds.Kubernetes) - kubernetes_autoscaling = kubernetes_utils.get_autoscaler_type( - ) is not None + # Optimization - do not poll for Kubernetes API for fetching + # common GPUs because that will be fetched later for the table after + # common GPUs. + clouds_to_list = cloud + if cloud is None: + clouds_to_list = [ + c for c in service_catalog.ALL_CLOUDS if c != 'kubernetes' + ] + k8s_messages = '' if accelerator_str is None: + # Collect k8s related messages in k8s_messages and print them at end + print_section_titles = False + # If cloud is kubernetes, we want to show real-time capacity + if kubernetes_is_enabled and (cloud is None or cloud_is_kubernetes): + try: + # If --cloud kubernetes is not specified, we want to catch + # the case where no GPUs are available on the cluster and + # print the warning at the end. + k8s_realtime_table = _get_kubernetes_realtime_gpu_table() + except ValueError as e: + if not cloud_is_kubernetes: + # Make it a note if cloud is not kubernetes + k8s_messages += 'Note: ' + k8s_messages += str(e) + else: + print_section_titles = True + yield (f'{colorama.Fore.CYAN}{colorama.Style.BRIGHT}' + f'Kubernetes GPUs{colorama.Style.RESET_ALL}\n') + yield from k8s_realtime_table.get_string() + if kubernetes_autoscaling: + k8s_messages += ( + '\n' + kubernetes_utils.KUBERNETES_AUTOSCALER_NOTE) + if cloud_is_kubernetes: + # Do not show clouds if --cloud kubernetes is specified + if not kubernetes_is_enabled: + yield ('Kubernetes is not enabled. To fix, run: ' + 'sky check kubernetes ') + yield k8s_messages + return + + # For show_all, show the k8s message at the start since output is + # long and the user may not scroll to the end. + if show_all and k8s_messages: + yield k8s_messages + yield '\n\n' + result = service_catalog.list_accelerator_counts( gpus_only=True, - clouds=cloud, + clouds=clouds_to_list, region_filter=region, ) - if len(result) == 0 and cloud_is_kubernetes: - yield kubernetes_utils.NO_GPU_ERROR_MESSAGE - if kubernetes_autoscaling: - yield '\n' - yield kubernetes_utils.KUBERNETES_AUTOSCALER_NOTE - return + if print_section_titles: + # If section titles were printed above, print again here + yield '\n\n' + yield (f'{colorama.Fore.CYAN}{colorama.Style.BRIGHT}' + f'Cloud GPUs{colorama.Style.RESET_ALL}\n') # "Common" GPUs - # If cloud is kubernetes, we want to show all GPUs here, even if - # they are not listed as common in SkyPilot. - if cloud_is_kubernetes: - for gpu, _ in sorted(result.items()): + for gpu in service_catalog.get_common_gpus(): + if gpu in result: gpu_table.add_row([gpu, _list_to_str(result.pop(gpu))]) - else: - for gpu in service_catalog.get_common_gpus(): - if gpu in result: - gpu_table.add_row([gpu, _list_to_str(result.pop(gpu))]) yield from gpu_table.get_string() # Google TPUs @@ -3058,16 +3161,12 @@ def _output(): other_table.add_row([gpu, _list_to_str(qty)]) yield from other_table.get_string() yield '\n\n' - if (cloud_is_kubernetes or - cloud is None) and kubernetes_autoscaling: - yield kubernetes_utils.KUBERNETES_AUTOSCALER_NOTE - yield '\n\n' else: yield ('\n\nHint: use -a/--all to see all accelerators ' '(including non-common ones) and pricing.') - if (cloud_is_kubernetes or - cloud is None) and kubernetes_autoscaling: - yield kubernetes_utils.KUBERNETES_AUTOSCALER_NOTE + if k8s_messages: + yield '\n' + yield k8s_messages return else: # Parse accelerator string @@ -3091,12 +3190,40 @@ def _output(): else: name, quantity = accelerator_str, None + print_section_titles = False + if (kubernetes_is_enabled and (cloud is None or cloud_is_kubernetes) and + not show_all): + # Print section title if not showing all and instead a specific + # accelerator is requested + print_section_titles = True + yield (f'{colorama.Fore.CYAN}{colorama.Style.BRIGHT}' + f'Kubernetes GPUs{colorama.Style.RESET_ALL}\n') + try: + k8s_realtime_table = _get_kubernetes_realtime_gpu_table( + name_filter=name, quantity_filter=quantity) + yield from k8s_realtime_table.get_string() + except ValueError as e: + # In the case of a specific accelerator, show the error message + # immediately (e.g., "Resources H100 not found ...") + yield str(e) + if kubernetes_autoscaling: + k8s_messages += ('\n' + + kubernetes_utils.KUBERNETES_AUTOSCALER_NOTE) + yield k8s_messages + if cloud_is_kubernetes: + # Do not show clouds if --cloud kubernetes is specified + if not kubernetes_is_enabled: + yield ('Kubernetes is not enabled. To fix, run: ' + 'sky check kubernetes ') + return + + # For clouds other than Kubernetes, get the accelerator details # Case-sensitive result = service_catalog.list_accelerators(gpus_only=True, name_filter=name, quantity_filter=quantity, region_filter=region, - clouds=cloud, + clouds=clouds_to_list, case_sensitive=False, all_regions=all_regions) # Import here to save module load speed. @@ -3128,16 +3255,17 @@ def _output(): new_result[gpu] = sorted_dataclasses result = new_result - if len(result) == 0: - if cloud == 'kubernetes': - yield kubernetes_utils.NO_GPU_ERROR_MESSAGE - return + if print_section_titles and not show_all: + yield '\n\n' + yield (f'{colorama.Fore.CYAN}{colorama.Style.BRIGHT}' + f'Cloud GPUs{colorama.Style.RESET_ALL}\n') + if len(result) == 0: quantity_str = (f' with requested quantity {quantity}' if quantity else '') - yield f'Resources \'{name}\'{quantity_str} not found. ' - yield 'Try \'sky show-gpus --all\' ' - yield 'to show available accelerators.' + cloud_str = f' on {cloud_obj}.' if cloud else ' in cloud catalogs.' + yield f'Resources \'{name}\'{quantity_str} not found{cloud_str} ' + yield 'To show available accelerators, run: sky show-gpus --all' return for i, (gpu, items) in enumerate(result.items()): diff --git a/sky/clouds/cudo.py b/sky/clouds/cudo.py index 7b4c13699d1..1a32bb0bd2c 100644 --- a/sky/clouds/cudo.py +++ b/sky/clouds/cudo.py @@ -46,6 +46,15 @@ class Cudo(clouds.Cloud): 'https://skypilot.readthedocs.io/en/latest/getting-started/installation.html' ) + _PROJECT_HINT = ( + 'Create a project and then set it as the default project,:\n' + f'{_INDENT_PREFIX} $ cudoctl projects create my-project-name\n' + f'{_INDENT_PREFIX} $ cudoctl init\n' + f'{_INDENT_PREFIX}For more info: ' + # pylint: disable=line-too-long + 'https://skypilot.readthedocs.io/en/latest/getting-started/installation.html' + ) + _CLOUD_UNSUPPORTED_FEATURES = { clouds.CloudImplementationFeatures.STOP: 'Stopping not supported.', clouds.CloudImplementationFeatures.SPOT_INSTANCE: @@ -289,7 +298,9 @@ def check_credentials(cls) -> Tuple[bool, Optional[str]]: project_id, error = cudo_api.get_project_id() if error is not None: return False, ( - f'Error getting project ' + f'Default project is not set. ' + f'{cls._PROJECT_HINT}\n' + f'{cls._INDENT_PREFIX}' f'{common_utils.format_exception(error, use_bracket=True)}') try: api = cudo_api.virtual_machines() diff --git a/sky/clouds/service_catalog/__init__.py b/sky/clouds/service_catalog/__init__.py index d380cce6757..7479cd77cf7 100644 --- a/sky/clouds/service_catalog/__init__.py +++ b/sky/clouds/service_catalog/__init__.py @@ -117,6 +117,46 @@ def list_accelerator_counts( return ret +def list_accelerator_realtime( + gpus_only: bool = True, + name_filter: Optional[str] = None, + region_filter: Optional[str] = None, + quantity_filter: Optional[int] = None, + clouds: CloudFilter = None, + case_sensitive: bool = True, +) -> Tuple[Dict[str, List[int]], Dict[str, int], Dict[str, int]]: + """List all accelerators offered by Sky with their realtime availability. + + Realtime availability is the total number of accelerators in the cluster + and number of accelerators available at the time of the call. + + Used for fixed size cluster settings, such as Kubernetes. + + Returns: + A tuple of three dictionaries mapping canonical accelerator names to: + - A list of available counts. (e.g., [1, 2, 4]) + - Total number of accelerators in the cluster (capacity). + - Number of accelerators available at the time of call (availability). + """ + qtys_map, total_accelerators_capacity, total_accelerators_available = ( + _map_clouds_catalog(clouds, + 'list_accelerators_realtime', + gpus_only, + name_filter, + region_filter, + quantity_filter, + case_sensitive=case_sensitive, + all_regions=False, + require_price=False)) + accelerator_counts: Dict[str, List[int]] = collections.defaultdict(list) + for gpu, items in qtys_map.items(): + for item in items: + accelerator_counts[gpu].append(item.accelerator_count) + accelerator_counts[gpu] = sorted(accelerator_counts[gpu]) + return (accelerator_counts, total_accelerators_capacity, + total_accelerators_available) + + def instance_type_exists(instance_type: str, clouds: CloudFilter = None) -> bool: """Check the existence of a instance type.""" diff --git a/sky/clouds/service_catalog/kubernetes_catalog.py b/sky/clouds/service_catalog/kubernetes_catalog.py index bd44847016e..602e19b5ff0 100644 --- a/sky/clouds/service_catalog/kubernetes_catalog.py +++ b/sky/clouds/service_catalog/kubernetes_catalog.py @@ -3,6 +3,7 @@ Kubernetes does not require a catalog of instances, but we need an image catalog mapping SkyPilot image tags to corresponding container image tags. """ +import re import typing from typing import Dict, List, Optional, Set, Tuple @@ -46,38 +47,109 @@ def list_accelerators( case_sensitive: bool = True, all_regions: bool = False, require_price: bool = True) -> Dict[str, List[common.InstanceTypeInfo]]: + # TODO(romilb): We should consider putting a lru_cache() with TTL to + # avoid multiple calls to kubernetes API in a short period of time (e.g., + # from the optimizer). + return list_accelerators_realtime(gpus_only, name_filter, region_filter, + quantity_filter, case_sensitive, + all_regions, require_price)[0] + + +def list_accelerators_realtime( + gpus_only: bool, + name_filter: Optional[str], + region_filter: Optional[str], + quantity_filter: Optional[int], + case_sensitive: bool = True, + all_regions: bool = False, + require_price: bool = True +) -> Tuple[Dict[str, List[common.InstanceTypeInfo]], Dict[str, int], Dict[str, + int]]: del all_regions, require_price # Unused. k8s_cloud = Kubernetes() if not any( map(k8s_cloud.is_same_cloud, sky_check.get_cached_enabled_clouds_or_refresh()) ) or not kubernetes_utils.check_credentials()[0]: - return {} + return {}, {}, {} has_gpu = kubernetes_utils.detect_gpu_resource() if not has_gpu: - return {} + return {}, {}, {} label_formatter, _ = kubernetes_utils.detect_gpu_label_formatter() if not label_formatter: - return {} + return {}, {}, {} - accelerators: Set[Tuple[str, int]] = set() + accelerators_qtys: Set[Tuple[str, int]] = set() key = label_formatter.get_label_key() nodes = kubernetes_utils.get_kubernetes_nodes() + # Get the pods to get the real-time GPU usage + pods = kubernetes_utils.get_kubernetes_pods() + # Total number of GPUs in the cluster + total_accelerators_capacity: Dict[str, int] = {} + # Total number of GPUs currently available in the cluster + total_accelerators_available: Dict[str, int] = {} + min_quantity_filter = quantity_filter if quantity_filter else 1 + for node in nodes: if key in node.metadata.labels: + allocated_qty = 0 accelerator_name = label_formatter.get_accelerator_from_label_value( node.metadata.labels.get(key)) + + # Check if name_filter regex matches the accelerator_name + regex_flags = 0 if case_sensitive else re.IGNORECASE + if name_filter and not re.match( + name_filter, accelerator_name, flags=regex_flags): + continue + accelerator_count = int( node.status.allocatable.get('nvidia.com/gpu', 0)) + # Generate the GPU quantities for the accelerators if accelerator_name and accelerator_count > 0: for count in range(1, accelerator_count + 1): - accelerators.add((accelerator_name, count)) + accelerators_qtys.add((accelerator_name, count)) + + for pod in pods: + # Get all the pods running on the node + if (pod.spec.node_name == node.metadata.name and + pod.status.phase in ['Running', 'Pending']): + # Iterate over all the containers in the pod and sum the + # GPU requests + for container in pod.spec.containers: + if container.resources.requests: + allocated_qty += int( + container.resources.requests.get( + 'nvidia.com/gpu', 0)) + + accelerators_available = accelerator_count - allocated_qty + + if accelerator_count >= min_quantity_filter: + quantized_count = (min_quantity_filter * + (accelerator_count // min_quantity_filter)) + if accelerator_name not in total_accelerators_capacity: + total_accelerators_capacity[ + accelerator_name] = quantized_count + else: + total_accelerators_capacity[ + accelerator_name] += quantized_count + + if accelerators_available >= min_quantity_filter: + quantized_availability = min_quantity_filter * ( + accelerators_available // min_quantity_filter) + if accelerator_name not in total_accelerators_available: + total_accelerators_available[ + accelerator_name] = quantized_availability + else: + total_accelerators_available[ + accelerator_name] += quantized_availability result = [] - for accelerator_name, accelerator_count in accelerators: + + # Generate dataframe for common.list_accelerators_impl + for accelerator_name, accelerator_count in accelerators_qtys: result.append( common.InstanceTypeInfo(cloud='Kubernetes', instance_type=None, @@ -98,9 +170,13 @@ def list_accelerators( ]) df['GpuInfo'] = True - return common.list_accelerators_impl('Kubernetes', df, gpus_only, - name_filter, region_filter, - quantity_filter, case_sensitive) + # Use common.list_accelerators_impl to get InstanceTypeInfo objects used + # by sky show-gpus when cloud is not specified. + qtys_map = common.list_accelerators_impl('Kubernetes', df, gpus_only, + name_filter, region_filter, + quantity_filter, case_sensitive) + + return qtys_map, total_accelerators_capacity, total_accelerators_available def validate_region_zone( diff --git a/sky/provision/cudo/cudo_wrapper.py b/sky/provision/cudo/cudo_wrapper.py index b60db7fd625..691c69bda8c 100644 --- a/sky/provision/cudo/cudo_wrapper.py +++ b/sky/provision/cudo/cudo_wrapper.py @@ -30,7 +30,7 @@ def launch(name: str, data_center_id: str, ssh_key: str, machine_type: str, try: api = cudo.cudo.cudo_api.virtual_machines() - vm = api.create_vm(cudo.cudo.cudo_api.project_id(), request) + vm = api.create_vm(cudo.cudo.cudo_api.project_id_throwable(), request) return vm.to_dict()['id'] except cudo.cudo.rest.ApiException as e: raise e @@ -52,7 +52,7 @@ def remove(instance_id: str): retry_interval = 5 retry_count = 0 state = 'unknown' - project_id = cudo.cudo.cudo_api.project_id() + project_id = cudo.cudo.cudo_api.project_id_throwable() while retry_count < max_retries: try: vm = api.get_vm(project_id, instance_id) @@ -91,7 +91,7 @@ def set_tags(instance_id: str, tags: Dict): def get_instance(vm_id): try: api = cudo.cudo.cudo_api.virtual_machines() - vm = api.get_vm(cudo.cudo.cudo_api.project_id(), vm_id) + vm = api.get_vm(cudo.cudo.cudo_api.project_id_throwable(), vm_id) vm_dict = vm.to_dict() return vm_dict except cudo.cudo.rest.ApiException as e: @@ -101,7 +101,7 @@ def get_instance(vm_id): def list_instances(): try: api = cudo.cudo.cudo_api.virtual_machines() - vms = api.list_vms(cudo.cudo.cudo_api.project_id()) + vms = api.list_vms(cudo.cudo.cudo_api.project_id_throwable()) instances = {} for vm in vms.to_dict()['vms']: ex_ip = vm['external_ip_address'] diff --git a/sky/provision/kubernetes/network.py b/sky/provision/kubernetes/network.py index 61870cb9119..875547e7677 100644 --- a/sky/provision/kubernetes/network.py +++ b/sky/provision/kubernetes/network.py @@ -8,8 +8,8 @@ from sky.utils import kubernetes_enums from sky.utils.resources_utils import port_ranges_to_set -_PATH_PREFIX = '/skypilot/{cluster_name_on_cloud}/{port}' -_LOADBALANCER_SERVICE_NAME = '{cluster_name_on_cloud}-skypilot-loadbalancer' +_PATH_PREFIX = '/skypilot/{namespace}/{cluster_name_on_cloud}/{port}' +_LOADBALANCER_SERVICE_NAME = '{cluster_name_on_cloud}--skypilot-lb' def open_ports( @@ -73,13 +73,14 @@ def _open_ports_using_ingress( 'https://github.com/kubernetes/ingress-nginx/blob/main/docs/deploy/index.md.' # pylint: disable=line-too-long ) - # Prepare service names, ports, for template rendering - service_details = [ - (f'{cluster_name_on_cloud}-skypilot-service--{port}', port, - _PATH_PREFIX.format(cluster_name_on_cloud=cluster_name_on_cloud, - port=port).rstrip('/').lstrip('/')) - for port in ports - ] + # Prepare service names, ports, for template rendering + service_details = [(f'{cluster_name_on_cloud}--skypilot-svc--{port}', port, + _PATH_PREFIX.format( + cluster_name_on_cloud=cluster_name_on_cloud, + port=port, + namespace=kubernetes_utils. + get_current_kube_config_context_namespace()).rstrip( + '/').lstrip('/')) for port in ports] # Generate ingress and services specs # We batch ingress rule creation because each rule triggers a hot reload of @@ -160,7 +161,7 @@ def _cleanup_ports_for_ingress( ) -> None: # Delete services for each port for port in ports: - service_name = f'{cluster_name_on_cloud}-skypilot-service--{port}' + service_name = f'{cluster_name_on_cloud}--skypilot-svc--{port}' network_utils.delete_namespaced_service( namespace=provider_config.get('namespace', 'default'), service_name=service_name, @@ -245,7 +246,10 @@ def _query_ports_for_ingress( result: Dict[int, List[common.Endpoint]] = {} for port in ports: path_prefix = _PATH_PREFIX.format( - cluster_name_on_cloud=cluster_name_on_cloud, port=port) + cluster_name_on_cloud=cluster_name_on_cloud, + port=port, + namespace=kubernetes_utils. + get_current_kube_config_context_namespace()) http_port, https_port = external_ports \ if external_ports is not None else (None, None) diff --git a/sky/provision/kubernetes/utils.py b/sky/provision/kubernetes/utils.py index d5140d8846b..d5f91f639f6 100644 --- a/sky/provision/kubernetes/utils.py +++ b/sky/provision/kubernetes/utils.py @@ -35,10 +35,10 @@ 'T': 2**40, 'P': 2**50, } -NO_GPU_ERROR_MESSAGE = 'No GPUs found in Kubernetes cluster. \ -If your cluster contains GPUs, make sure nvidia.com/gpu resource is available on the nodes and the node labels for identifying GPUs \ -(e.g., skypilot.co/accelerator) are setup correctly. \ -To further debug, run: sky check.' +NO_GPU_HELP_MESSAGE = ('If your cluster contains GPUs, make sure ' + 'nvidia.com/gpu resource is available on the nodes and ' + 'the node labels for identifying GPUs ' + '(e.g., skypilot.co/accelerator) are setup correctly. ') KUBERNETES_AUTOSCALER_NOTE = ( 'Note: Kubernetes cluster autoscaling is enabled. ' @@ -280,6 +280,18 @@ def get_kubernetes_nodes() -> List[Any]: return nodes +def get_kubernetes_pods() -> List[Any]: + try: + ns = get_current_kube_config_context_namespace() + pods = kubernetes.core_api().list_namespaced_pod( + ns, _request_timeout=kubernetes.API_TIMEOUT).items + except kubernetes.max_retry_error(): + raise exceptions.ResourcesUnavailableError( + 'Timed out when trying to get pod info from Kubernetes cluster. ' + 'Please check if the cluster is healthy and retry.') from None + return pods + + def check_instance_fits(instance: str) -> Tuple[bool, Optional[str]]: """Checks if the instance fits on the Kubernetes cluster. diff --git a/sky/setup_files/setup.py b/sky/setup_files/setup.py index adde7d6ab84..a1327cee622 100644 --- a/sky/setup_files/setup.py +++ b/sky/setup_files/setup.py @@ -235,7 +235,7 @@ def parse_readme(readme: str) -> str: 'remote': remote, 'runpod': ['runpod>=1.5.1'], 'fluidstack': [], # No dependencies needed for fluidstack - 'cudo': ['cudo-compute>=0.1.8'], + 'cudo': ['cudo-compute>=0.1.10'], 'paperspace': [], # No dependencies needed for paperspace 'vsphere': [ 'pyvmomi==8.0.1.0.2',