diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md index ae0a28b52ed..c7c31960482 100644 --- a/.github/pull_request_template.md +++ b/.github/pull_request_template.md @@ -11,4 +11,4 @@ Tested (run the relevant ones): - [ ] Any manual or new tests for this PR (please specify below) - [ ] All smoke tests: `pytest tests/test_smoke.py` - [ ] Relevant individual smoke tests: `pytest tests/test_smoke.py::test_fill_in_the_name` -- [ ] Backward compatibility tests: `bash tests/backward_comaptibility_tests.sh` +- [ ] Backward compatibility tests: `conda deactivate; bash -i tests/backward_compatibility_tests.sh` diff --git a/.github/workflows/pytest.yml b/.github/workflows/pytest.yml index 87d8cea9f16..bf84bea4d50 100644 --- a/.github/workflows/pytest.yml +++ b/.github/workflows/pytest.yml @@ -27,7 +27,7 @@ jobs: - tests/test_optimizer_random_dag.py - tests/test_storage.py - tests/test_wheels.py - - tests/test_spot_serve.py + - tests/test_jobs_and_serve.py - tests/test_yaml_parser.py runs-on: ubuntu-latest steps: diff --git a/.gitignore b/.gitignore index f1dbf59a52f..efa74dd744b 100644 --- a/.gitignore +++ b/.gitignore @@ -11,4 +11,4 @@ sky_logs/ sky/clouds/service_catalog/data_fetchers/*.csv .vscode/ .idea/ - +.env diff --git a/README.md b/README.md index bff69464175..a2704df3643 100644 --- a/README.md +++ b/README.md @@ -27,22 +27,25 @@ ---- :fire: *News* :fire: +- [Jun, 2024] Reproduce **GPT** with [llm.c](https://github.com/karpathy/llm.c/discussions/481) on any cloud: [**guide**](./llm/gpt-2/) +- [Apr, 2024] Serve and finetune [**Llama 3**](https://skypilot.readthedocs.io/en/latest/gallery/llms/llama-3.html) on any cloud or Kubernetes: [**example**](./llm/llama-3/) +- [Apr, 2024] Serve [**Qwen-110B**](https://qwenlm.github.io/blog/qwen1.5-110b/) on your infra: [**example**](./llm/qwen/) - [Apr, 2024] Using [**Ollama**](https://github.com/ollama/ollama) to deploy quantized LLMs on CPUs and GPUs: [**example**](./llm/ollama/) -- [Mar, 2024] Serve and deploy [**Databricks DBRX**](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm) on your infra: [**example**](./llm/dbrx/) - [Feb, 2024] Deploying and scaling [**Gemma**](https://blog.google/technology/developers/gemma-open-models/) with SkyServe: [**example**](./llm/gemma/) -- [Feb, 2024] Speed up your LLM deployments with [**SGLang**](https://github.com/sgl-project/sglang) for 5x throughput on SkyServe: [**example**](./llm/sglang/) - [Feb, 2024] Serving [**Code Llama 70B**](https://ai.meta.com/blog/code-llama-large-language-model-coding/) with vLLM and SkyServe: [**example**](./llm/codellama/) -- [Dec, 2023] Using [**LoRAX**](https://github.com/predibase/lorax) to serve 1000s of finetuned LLMs on a single instance in the cloud: [**example**](./llm/lorax/) - [Dec, 2023] [**Mixtral 8x7B**](https://mistral.ai/news/mixtral-of-experts/), a high quality sparse mixture-of-experts model, was released by Mistral AI! Deploy via SkyPilot on any cloud: [**example**](./llm/mixtral/) - [Nov, 2023] Using [**Axolotl**](https://github.com/OpenAccess-AI-Collective/axolotl) to finetune Mistral 7B on the cloud (on-demand and spot): [**example**](./llm/axolotl/) -- [Sep, 2023] [**Mistral 7B**](https://mistral.ai/news/announcing-mistral-7b/), a high-quality open LLM, was released! Deploy via SkyPilot on any cloud: [**Mistral docs**](https://docs.mistral.ai/self-deployment/skypilot) - [Sep, 2023] Case study: [**Covariant**](https://covariant.ai/) transformed AI development on the cloud using SkyPilot, delivering models 4x faster cost-effectively: [**read the case study**](https://blog.skypilot.co/covariant/) -- [Aug, 2023] Cookbook: Finetuning Llama 2 in your own cloud environment, privately: [**example**](./llm/vicuna-llama-2/), [**blog post**](https://blog.skypilot.co/finetuning-llama2-operational-guide/) +- [Aug, 2023] **Finetuning Cookbook**: Finetuning Llama 2 in your own cloud environment, privately: [**example**](./llm/vicuna-llama-2/), [**blog post**](https://blog.skypilot.co/finetuning-llama2-operational-guide/) - [June, 2023] Serving LLM 24x Faster On the Cloud [**with vLLM**](https://vllm.ai/) and SkyPilot: [**example**](./llm/vllm/), [**blog post**](https://blog.skypilot.co/serving-llm-24x-faster-on-the-cloud-with-vllm-and-skypilot/)
Archived +- [Mar, 2024] Serve and deploy [**Databricks DBRX**](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm) on your infra: [**example**](./llm/dbrx/) +- [Feb, 2024] Speed up your LLM deployments with [**SGLang**](https://github.com/sgl-project/sglang) for 5x throughput on SkyServe: [**example**](./llm/sglang/) +- [Dec, 2023] Using [**LoRAX**](https://github.com/predibase/lorax) to serve 1000s of finetuned LLMs on a single instance in the cloud: [**example**](./llm/lorax/) +- [Sep, 2023] [**Mistral 7B**](https://mistral.ai/news/announcing-mistral-7b/), a high-quality open LLM, was released! Deploy via SkyPilot on any cloud: [**Mistral docs**](https://docs.mistral.ai/self-deployment/skypilot) - [July, 2023] Self-Hosted **Llama-2 Chatbot** on Any Cloud: [**example**](./llm/llama-2/) - [April, 2023] [SkyPilot YAMLs](./llm/vicuna/) for finetuning & serving the [Vicuna LLM](https://lmsys.org/blog/2023-03-30-vicuna/) with a single command! @@ -151,6 +154,9 @@ To learn more, see our [Documentation](https://skypilot.readthedocs.io/en/latest Runnable examples: - LLMs on SkyPilot + - [GPT-2 via `llm.c`](./llm/gpt-2/) + - [Llama 3](./llm/llama-3/) + - [Qwen](./llm/qwen/) - [Databricks DBRX](./llm/dbrx/) - [Gemma](./llm/gemma/) - [Mixtral 8x7B](./llm/mixtral/); [Mistral 7B](https://docs.mistral.ai/self-deployment/skypilot/) (from official Mistral team) @@ -168,7 +174,7 @@ Runnable examples: - [LocalGPT](./llm/localgpt) - [Falcon](./llm/falcon) - Add yours here & see more in [`llm/`](./llm)! -- Framework examples: [PyTorch DDP](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_torch.yaml), [DeepSpeed](./examples/deepspeed-multinode/sky.yaml), [JAX/Flax on TPU](https://github.com/skypilot-org/skypilot/blob/master/examples/tpu/tpuvm_mnist.yaml), [Stable Diffusion](https://github.com/skypilot-org/skypilot/tree/master/examples/stable_diffusion), [Detectron2](https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml), [Distributed](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_tf_app.py) [TensorFlow](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_app_storage.yaml), [Ray Train](examples/distributed_ray_train/ray_train.yaml), [NeMo](https://github.com/skypilot-org/skypilot/blob/master/examples/nemo/nemo.yaml), [programmatic grid search](https://github.com/skypilot-org/skypilot/blob/master/examples/huggingface_glue_imdb_grid_search_app.py), [Docker](https://github.com/skypilot-org/skypilot/blob/master/examples/docker/echo_app.yaml), [Cog](https://github.com/skypilot-org/skypilot/blob/master/examples/cog/), [Unsloth](https://github.com/skypilot-org/skypilot/blob/master/examples/unsloth/unsloth.yaml), [Ollama](https://github.com/skypilot-org/skypilot/blob/master/llm/ollama) and [many more (`examples/`)](./examples). +- Framework examples: [PyTorch DDP](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_torch.yaml), [DeepSpeed](./examples/deepspeed-multinode/sky.yaml), [JAX/Flax on TPU](https://github.com/skypilot-org/skypilot/blob/master/examples/tpu/tpuvm_mnist.yaml), [Stable Diffusion](https://github.com/skypilot-org/skypilot/tree/master/examples/stable_diffusion), [Detectron2](https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml), [Distributed](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_tf_app.py) [TensorFlow](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_app_storage.yaml), [Ray Train](examples/distributed_ray_train/ray_train.yaml), [NeMo](https://github.com/skypilot-org/skypilot/blob/master/examples/nemo/nemo.yaml), [programmatic grid search](https://github.com/skypilot-org/skypilot/blob/master/examples/huggingface_glue_imdb_grid_search_app.py), [Docker](https://github.com/skypilot-org/skypilot/blob/master/examples/docker/echo_app.yaml), [Cog](https://github.com/skypilot-org/skypilot/blob/master/examples/cog/), [Unsloth](https://github.com/skypilot-org/skypilot/blob/master/examples/unsloth/unsloth.yaml), [Ollama](https://github.com/skypilot-org/skypilot/blob/master/llm/ollama), [llm.c](https://github.com/skypilot-org/skypilot/tree/master/llm/gpt-2) and [many more (`examples/`)](./examples). Follow updates: - [Twitter](https://twitter.com/skypilot_org) @@ -179,6 +185,7 @@ Read the research: - [SkyPilot paper](https://www.usenix.org/system/files/nsdi23-yang-zongheng.pdf) and [talk](https://www.usenix.org/conference/nsdi23/presentation/yang-zongheng) (NSDI 2023) - [Sky Computing whitepaper](https://arxiv.org/abs/2205.07147) - [Sky Computing vision paper](https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s02-stoica.pdf) (HotOS 2021) +- [Policy for Managed Spot Jobs](https://www.usenix.org/conference/nsdi24/presentation/wu-zhanghao) (NSDI 2024) ## Support and Questions We are excited to hear your feedback! diff --git a/docs/source/_gallery_original/index.rst b/docs/source/_gallery_original/index.rst index b38d53cad52..e8a540c883c 100644 --- a/docs/source/_gallery_original/index.rst +++ b/docs/source/_gallery_original/index.rst @@ -1,3 +1,5 @@ +.. _ai-gallery: + AI Gallery ==================== @@ -36,6 +38,8 @@ Contents Mistral 7B (Mistral AI) DBRX (Databricks) Llama-2 (Meta) + Llama-3 (Meta) + Qwen (Alibaba) CodeLlama (Meta) Gemma (Google) diff --git a/docs/source/_gallery_original/llms/llama-3.md b/docs/source/_gallery_original/llms/llama-3.md new file mode 120000 index 00000000000..54133a94f28 --- /dev/null +++ b/docs/source/_gallery_original/llms/llama-3.md @@ -0,0 +1 @@ +../../../../llm/llama-3/README.md \ No newline at end of file diff --git a/docs/source/_gallery_original/llms/qwen.md b/docs/source/_gallery_original/llms/qwen.md new file mode 120000 index 00000000000..2a6d513503f --- /dev/null +++ b/docs/source/_gallery_original/llms/qwen.md @@ -0,0 +1 @@ +../../../../llm/qwen/README.md \ No newline at end of file diff --git a/docs/source/_static/custom.js b/docs/source/_static/custom.js index 56bae76ec61..5630793d8ff 100644 --- a/docs/source/_static/custom.js +++ b/docs/source/_static/custom.js @@ -5,7 +5,7 @@ document.addEventListener('DOMContentLoaded', function () { script.setAttribute('data-project-name', 'SkyPilot'); script.setAttribute('data-project-color', '#4C4C4D'); script.setAttribute('data-project-logo', 'https://avatars.githubusercontent.com/u/109387420?s=100&v=4'); - script.setAttribute('data-modal-disclaimer', 'Results are automatically generated and may be inaccurate or contain inappropriate information. Do not include any sensitive information in your query.'); + script.setAttribute('data-modal-disclaimer', 'Results are automatically generated and may be inaccurate or contain inappropriate information. Do not include any sensitive information in your query.\n**To get further assistance, you can chat directly with the development team** by joining the [SkyPilot Slack](https://slack.skypilot.co/).'); script.setAttribute('data-modal-title', 'SkyPilot Docs AI - Ask a Question.'); script.setAttribute('data-button-position-bottom', '85px'); script.async = true; @@ -26,9 +26,11 @@ document.addEventListener('DOMContentLoaded', () => { // New items: const newItems = [ { selector: '.caption-text', text: 'SkyServe: Model Serving' }, + { selector: '.toctree-l1 > a', text: 'Managed Jobs' }, { selector: '.toctree-l1 > a', text: 'Running on Kubernetes' }, - { selector: '.toctree-l1 > a', text: 'DBRX (Databricks)' }, { selector: '.toctree-l1 > a', text: 'Ollama' }, + { selector: '.toctree-l1 > a', text: 'Llama-3 (Meta)' }, + { selector: '.toctree-l1 > a', text: 'Qwen (Alibaba)' }, ]; newItems.forEach(({ selector, text }) => { document.querySelectorAll(selector).forEach((el) => { diff --git a/docs/source/cloud-setup/cloud-permissions/index.rst b/docs/source/cloud-setup/cloud-permissions/index.rst index 873cbf339fc..e2a1aaf16ae 100644 --- a/docs/source/cloud-setup/cloud-permissions/index.rst +++ b/docs/source/cloud-setup/cloud-permissions/index.rst @@ -20,3 +20,4 @@ Table of Contents aws gcp vsphere + kubernetes diff --git a/docs/source/cloud-setup/cloud-permissions/kubernetes.rst b/docs/source/cloud-setup/cloud-permissions/kubernetes.rst new file mode 100644 index 00000000000..34d29b49f0a --- /dev/null +++ b/docs/source/cloud-setup/cloud-permissions/kubernetes.rst @@ -0,0 +1,333 @@ +.. _cloud-permissions-kubernetes: + +Kubernetes +========== + +When running outside your Kubernetes cluster, SkyPilot uses your local ``~/.kube/config`` file +for authentication and creating resources on your Kubernetes cluster. + +When running inside your Kubernetes cluster (e.g., as a Spot controller or Serve controller), +SkyPilot can operate using either of the following three authentication methods: + +1. **Using your local kubeconfig file**: In this case, SkyPilot will + copy your local ``~/.kube/config`` file to the controller pod and use it for + authentication. This is the default method when running inside the cluster, + and no additional configuration is required. + + .. note:: + + If your cluster uses exec based authentication in your ``~/.kube/config`` file + (e.g., GKE uses exec auth by default), SkyPilot may not be able to authenticate using this method. In this case, + consider using the service account methods below. + +2. **Creating a service account**: SkyPilot can automatically create the service + account and roles for itself to manage resources in the Kubernetes cluster. + To use this method, set ``remote_identity: SERVICE_ACCOUNT`` to your + Kubernetes configuration in the :ref:`~/.sky/config.yaml ` file: + + .. code-block:: yaml + + kubernetes: + remote_identity: SERVICE_ACCOUNT + + For details on the permissions that are granted to the service account, + refer to the `Minimum Permissions Required for SkyPilot`_ section below. + +3. **Using a custom service account**: If you have a custom service account + with the `necessary permissions `__, you can configure + SkyPilot to use it by adding this to your :ref:`~/.sky/config.yaml ` file: + + .. code-block:: yaml + + kubernetes: + remote_identity: your-service-account-name + +.. note:: + + Service account based authentication applies only when the remote SkyPilot + cluster (including spot and serve controller) is launched inside the + Kubernetes cluster. When running outside the cluster (e.g., on AWS), + SkyPilot will use the local ``~/.kube/config`` file for authentication. + +Below are the permissions required by SkyPilot and an example service account YAML that you can use to create a service account with the necessary permissions. + +.. _k8s-permissions: + +Minimum Permissions Required for SkyPilot +----------------------------------------- + +SkyPilot requires permissions equivalent to the following roles to be able to manage the resources in the Kubernetes cluster: + +.. code-block:: yaml + + # Namespaced role for the service account + # Required for creating pods, services and other necessary resources in the namespace. + # Note these permissions only apply in the namespace where SkyPilot is deployed, and the namespace can be changed below. + kind: Role + apiVersion: rbac.authorization.k8s.io/v1 + metadata: + name: sky-sa-role # Can be changed if needed + namespace: default # Change to your namespace if using a different one. + rules: + - apiGroups: ["*"] + resources: ["*"] + verbs: ["*"] + --- + # ClusterRole for accessing cluster-wide resources. Details for each resource below: + kind: ClusterRole + apiVersion: rbac.authorization.k8s.io/v1 + metadata: + name: sky-sa-cluster-role # Can be changed if needed + namespace: default # Change to your namespace if using a different one. + labels: + parent: skypilot + rules: + - apiGroups: [""] + resources: ["nodes"] # Required for getting node resources. + verbs: ["get", "list", "watch"] + - apiGroups: ["node.k8s.io"] + resources: ["runtimeclasses"] # Required for autodetecting the runtime class of the nodes. + verbs: ["get", "list", "watch"] + + +.. tip:: + + If you are using a different namespace than ``default``, make sure to change the namespace in the above manifests. + +These roles must apply to both the user account configured in the kubeconfig file and the service account used by SkyPilot (if configured). + +If your tasks use object store mounting or require access to ingress resources, you will need to grant additional permissions as described below. + +Permissions for Object Store Mounting +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +If your tasks use object store mounting (e.g., S3, GCS, etc.), SkyPilot will need to run a DaemonSet to expose the FUSE device as a Kubernetes resource to SkyPilot pods. + +To allow this, you will need to also create a ``skypilot-system`` namespace which will run the DaemonSet and grant the necessary permissions to the service account in that namespace. + + +.. code-block:: yaml + + # Required only if using object store mounting + # Create namespace for SkyPilot system + apiVersion: v1 + kind: Namespace + metadata: + name: skypilot-system # Do not change this + labels: + parent: skypilot + --- + # Role for the skypilot-system namespace to create FUSE device manager and + # any other system components required by SkyPilot. + # This role must be bound in the skypilot-system namespace to the service account used for SkyPilot. + kind: Role + apiVersion: rbac.authorization.k8s.io/v1 + metadata: + name: skypilot-system-service-account-role # Can be changed if needed + namespace: skypilot-system # Do not change this namespace + labels: + parent: skypilot + rules: + - apiGroups: ["*"] + resources: ["*"] + verbs: ["*"] + + +Permissions for using Ingress +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +If your tasks use :ref:`Ingress ` for exposing ports, you will need to grant the necessary permissions to the service account in the ``ingress-nginx`` namespace. + +.. code-block:: yaml + + # Required only if using ingresses + # Role for accessing ingress service IP + apiVersion: rbac.authorization.k8s.io/v1 + kind: Role + metadata: + namespace: ingress-nginx # Do not change this + name: sky-sa-role-ingress-nginx # Can be changed if needed + rules: + - apiGroups: [""] + resources: ["services"] + verbs: ["list", "get"] + + +.. _k8s-sa-example: + +Example using Custom Service Account +------------------------------------ + +To create a service account that has all necessary permissions for SkyPilot (including for accessing object stores), you can use the following YAML. + +.. tip:: + + In this example, the service account is named ``sky-sa`` and is created in the ``default`` namespace. + Change the namespace and service account name as needed. + + +.. code-block:: yaml + :linenos: + + # create-sky-sa.yaml + kind: ServiceAccount + apiVersion: v1 + metadata: + name: sky-sa # Change to your service account name + namespace: default # Change to your namespace if using a different one. + labels: + parent: skypilot + --- + # Role for the service account + kind: Role + apiVersion: rbac.authorization.k8s.io/v1 + metadata: + name: sky-sa-role # Can be changed if needed + namespace: default # Change to your namespace if using a different one. + labels: + parent: skypilot + rules: + - apiGroups: ["*"] # Required for creating pods, services, secrets and other necessary resources in the namespace. + resources: ["*"] + verbs: ["*"] + --- + # RoleBinding for the service account + kind: RoleBinding + apiVersion: rbac.authorization.k8s.io/v1 + metadata: + name: sky-sa-rb # Can be changed if needed + namespace: default # Change to your namespace if using a different one. + labels: + parent: skypilot + subjects: + - kind: ServiceAccount + name: sky-sa # Change to your service account name + roleRef: + kind: Role + name: sky-sa-role # Use the same name as the role at line 14 + apiGroup: rbac.authorization.k8s.io + --- + # ClusterRole for the service account + kind: ClusterRole + apiVersion: rbac.authorization.k8s.io/v1 + metadata: + name: sky-sa-cluster-role # Can be changed if needed + namespace: default # Change to your namespace if using a different one. + labels: + parent: skypilot + rules: + - apiGroups: [""] + resources: ["nodes"] # Required for getting node resources. + verbs: ["get", "list", "watch"] + - apiGroups: ["node.k8s.io"] + resources: ["runtimeclasses"] # Required for autodetecting the runtime class of the nodes. + verbs: ["get", "list", "watch"] + - apiGroups: ["networking.k8s.io"] # Required for exposing services through ingresses + resources: ["ingressclasses"] + verbs: ["get", "list", "watch"] + --- + # ClusterRoleBinding for the service account + apiVersion: rbac.authorization.k8s.io/v1 + kind: ClusterRoleBinding + metadata: + name: sky-sa-cluster-role-binding # Can be changed if needed + namespace: default # Change to your namespace if using a different one. + labels: + parent: skypilot + subjects: + - kind: ServiceAccount + name: sky-sa # Change to your service account name + namespace: default # Change to your namespace if using a different one. + roleRef: + kind: ClusterRole + name: sky-sa-cluster-role # Use the same name as the cluster role at line 43 + apiGroup: rbac.authorization.k8s.io + --- + # Optional: If using object store mounting, create the skypilot-system namespace + apiVersion: v1 + kind: Namespace + metadata: + name: skypilot-system # Do not change this + labels: + parent: skypilot + --- + # Optional: If using object store mounting, create role in the skypilot-system + # namespace to create FUSE device manager. + kind: Role + apiVersion: rbac.authorization.k8s.io/v1 + metadata: + name: skypilot-system-service-account-role # Can be changed if needed + namespace: skypilot-system # Do not change this namespace + labels: + parent: skypilot + rules: + - apiGroups: ["*"] + resources: ["*"] + verbs: ["*"] + --- + # Optional: If using object store mounting, create rolebinding in the skypilot-system + # namespace to create FUSE device manager. + apiVersion: rbac.authorization.k8s.io/v1 + kind: RoleBinding + metadata: + name: sky-sa-skypilot-system-role-binding + namespace: skypilot-system # Do not change this namespace + labels: + parent: skypilot + subjects: + - kind: ServiceAccount + name: sky-sa # Change to your service account name + namespace: default # Change this to the namespace where the service account is created + roleRef: + kind: Role + name: skypilot-system-service-account-role # Use the same name as the role at line 88 + apiGroup: rbac.authorization.k8s.io + --- + # Optional: Role for accessing ingress resources + apiVersion: rbac.authorization.k8s.io/v1 + kind: Role + metadata: + name: sky-sa-role-ingress-nginx # Can be changed if needed + namespace: ingress-nginx # Do not change this namespace + labels: + parent: skypilot + rules: + - apiGroups: [""] + resources: ["services"] + verbs: ["list", "get", "watch"] + - apiGroups: ["rbac.authorization.k8s.io"] + resources: ["roles", "rolebindings"] + verbs: ["list", "get", "watch"] + --- + # Optional: RoleBinding for accessing ingress resources + apiVersion: rbac.authorization.k8s.io/v1 + kind: RoleBinding + metadata: + name: sky-sa-rolebinding-ingress-nginx # Can be changed if needed + namespace: ingress-nginx # Do not change this namespace + labels: + parent: skypilot + subjects: + - kind: ServiceAccount + name: sky-sa # Change to your service account name + namespace: default # Change this to the namespace where the service account is created + roleRef: + kind: Role + name: sky-sa-role-ingress-nginx # Use the same name as the role at line 119 + apiGroup: rbac.authorization.k8s.io + +Create the service account using the following command: + +.. code-block:: bash + + $ kubectl apply -f create-sky-sa.yaml + +After creating the service account, the cluster admin may distribute kubeconfigs with the ``sky-sa`` service account to users who need to access the cluster. + +Users should also configure SkyPilot to use the ``sky-sa`` service account through ``~/.sky/config.yaml``: + +.. code-block:: yaml + + # ~/.sky/config.yaml + kubernetes: + remote_identity: sky-sa # Or your service account name diff --git a/docs/source/docs/index.rst b/docs/source/docs/index.rst index d0608f4c0e9..57efa26acbc 100644 --- a/docs/source/docs/index.rst +++ b/docs/source/docs/index.rst @@ -69,6 +69,9 @@ Runnable examples: * **LLMs on SkyPilot** + * `GPT-2 via llm.c `_ + * `Llama 3 `_ + * `Qwen `_ * `Databricks DBRX `_ * `Gemma `_ * `Mixtral 8x7B `_; `Mistral 7B `_ (from official Mistral team) @@ -87,7 +90,7 @@ Runnable examples: * `Falcon `_ * Add yours here & see more in `llm/ `_! -* Framework examples: `PyTorch DDP `_, `DeepSpeed `_, `JAX/Flax on TPU `_, `Stable Diffusion `_, `Detectron2 `_, `Distributed `_ `TensorFlow `_, `NeMo `_, `programmatic grid search `_, `Docker `_, `Cog `_, `Unsloth `_, `Ollama `_ and `many more `_. +* Framework examples: `PyTorch DDP `_, `DeepSpeed `_, `JAX/Flax on TPU `_, `Stable Diffusion `_, `Detectron2 `_, `Distributed `_ `TensorFlow `_, `NeMo `_, `programmatic grid search `_, `Docker `_, `Cog `_, `Unsloth `_, `Ollama `_, `llm.c `__ and `many more `_. Follow updates: @@ -112,14 +115,14 @@ Contents ../getting-started/installation ../getting-started/quickstart ../getting-started/tutorial - ../examples/gpu-jupyter + ../examples/interactive-development .. toctree:: :maxdepth: 1 :caption: Running Jobs - ../examples/spot-jobs + ../examples/managed-jobs ../reference/job-queue ../examples/auto-failover ../reference/kubernetes/index @@ -137,7 +140,7 @@ Contents :maxdepth: 1 :caption: Cutting Cloud Costs - ../examples/spot-jobs + Managed Spot Jobs <../examples/spot-jobs> ../reference/auto-stop ../reference/benchmark/index diff --git a/docs/source/examples/docker-containers.rst b/docs/source/examples/docker-containers.rst index 9fe835d6b9a..8bc7ae16837 100644 --- a/docs/source/examples/docker-containers.rst +++ b/docs/source/examples/docker-containers.rst @@ -8,6 +8,10 @@ SkyPilot can run a container either as a task, or as the runtime environment of * If the container image is invocable / has an entrypoint: run it :ref:`as a task `. * If the container image is to be used as a runtime environment (e.g., ``ubuntu``, ``nvcr.io/nvidia/pytorch:23.10-py3``, etc.) and if you have extra commands to run inside the container: run it :ref:`as a runtime environment `. +.. note:: + + Running docker containers is `not supported on RunPod `_. To use RunPod, use ``setup`` and ``run`` to configure your environment. See `GitHub issue `_ for more. + .. _docker-containers-as-tasks: Running Containers as Tasks diff --git a/docs/source/examples/gpu-jupyter.rst b/docs/source/examples/gpu-jupyter.rst deleted file mode 100644 index 7f4c593e02f..00000000000 --- a/docs/source/examples/gpu-jupyter.rst +++ /dev/null @@ -1,98 +0,0 @@ -GPU-backed Jupyter Notebooks -============================ - -Jupyter notebooks are a useful tool for interactive development, debugging, and -visualization. SkyPilot makes the process of running a GPU-backed Jupyter notebook -simple by automatically managing provisioning and port forwarding. - -To get a machine with a GPU attached, use: - -.. code-block:: bash - - # Launch a VM with 1 NVIDIA GPU and forward port 8888 to localhost - sky launch -c jupyter-vm --gpus K80:1 - ssh -L 8888:localhost:8888 jupyter-vm - -.. note:: - - View the supported GPUs with the :code:`sky show-gpus` command. - -Use ``ssh jupyter-vm`` to SSH into the VM. Inside the VM, you can run the -following commands to start a Jupyter session: - -.. code-block:: bash - - pip install jupyter - jupyter notebook - -In your local browser, you should now be able to access :code:`localhost:8888` and see the following screen: - -.. image:: ../images/jupyter-auth.png - :width: 100% - :alt: Jupyter authentication window - -Enter the password or token and you will be directed to a page where you can create a new notebook. - -.. image:: ../images/jupyter-create.png - :width: 100% - :alt: Create a new Jupyter notebook - -You can verify that this notebook is running on the GPU-backed instance using :code:`nvidia-smi`. - -.. image:: ../images/jupyter-gpu.png - :width: 100% - :alt: nvidia-smi in notebook - -The GPU node is a normal SkyPilot cluster, so you can use the usual CLI commands on it. For example, run ``sky down/stop`` to terminate or stop it, and ``sky exec`` to execute a task. - -Notebooks in SkyPilot tasks ---------------------------- -Jupyter notebooks can also be used in SkyPilot tasks, allowing access to the full -range of SkyPilot's features including :ref:`mounted storage ` and -:ref:`autostop `. - -The following :code:`jupyter.yaml` is an example of a task specification that can launch notebooks with SkyPilot. - -.. code:: yaml - - # jupyter.yaml - - name: jupyter - - resources: - accelerators: K80:1 - - file_mounts: - /covid: - source: s3://fah-public-data-covid19-cryptic-pockets - mode: MOUNT - - setup: | - pip install --upgrade pip - conda init bash - conda create -n jupyter python=3.9 -y - conda activate jupyter - pip install jupyter - - run: | - cd ~/sky_workdir - conda activate jupyter - jupyter notebook --port 8888 - -Launch the GPU-backed Jupyter notebook: - -.. code:: bash - - sky launch -c jupyter jupyter.yaml - -To access the notebook locally, use SSH port forwarding. - -.. code:: bash - - ssh -L 8888:localhost:8888 jupyter - -You can verify that this notebook has access to the mounted storage bucket. - -.. image:: ../images/jupyter-covid.png - :width: 100% - :alt: accessing covid data from notebook diff --git a/docs/source/examples/interactive-development.rst b/docs/source/examples/interactive-development.rst new file mode 100644 index 00000000000..f9c4b5c4803 --- /dev/null +++ b/docs/source/examples/interactive-development.rst @@ -0,0 +1,209 @@ +Start a Development Cluster +=========================== + + +SkyPilot makes interactive development easy on Kubernetes or cloud VMs. It helps you: + +#. :ref:`Launch `: Quickly get a cluster with GPU or other resource requirement with a single command. +#. :ref:`Autostop `: Automatically stop the cluster after some idle time for cost savings. +#. :ref:`Connect `: Easily connect to the cluster using the cluster name: + + - :ref:`SSH ` + - :ref:`VSCode ` + - :ref:`Jupyter Notebooks ` + +.. _dev-launch: + +Launch +------ + +To launch a cluster with a cheap GPU for development: + +.. code-block:: bash + + # Launch a cluster with 1 NVIDIA GPU and sync the local working directory to the + # cluster. + sky launch -c dev --gpus T4 --workdir . + +This can be launched as a pod in your Kubernetes cluster or a VM on any cloud. + +.. tab-set:: + + .. tab-item:: Kubernetes + + .. figure:: ../images/k8s-pod.png + :align: center + :width: 80% + :alt: Launch a cluster as a pod in Kubernetes + + Launch a cluster as a pod in Kubernetes + + .. tab-item:: Cloud + + .. figure:: ../images/gcp-vm.png + :align: center + :width: 80% + :alt: Launch a cluster as a VM on GCP + + Launch a cluster as a VM on GCP + +.. note:: + + View the supported GPUs with the :code:`sky show-gpus` command. + +.. _dev-autostop: + +Autostop +-------- + +SkyPilot allows you to automatically stop the cluster after a period of idle time to save costs. You can set the autostop time with a single command: + +.. code-block:: bash + + # Auto stop the cluster after 5 hours + sky autostop -i 300 dev + +Or add an additional flag :code:`-i` during the launch: + +.. code-block:: bash + + # Launch a cluster with auto stop after 5 hours + sky launch -c dev --gpus T4 --workdir . -i 300 + +For more details of auto stopping, check out: :ref:`auto-stop`. This feature is designed +to prevent idle clusters from incurring unnecessary costs, ensuring your cluster +stops automatically, whether it's overnight or throughout the weekend. + + +.. _dev-connect: + +Connect +------- + +A user can easily connect to the cluster using your method of choice: + +.. _dev-ssh: + +SSH +~~~ + +SkyPilot will automatically configure the SSH setting for a cluster, so that users can connect to the cluster with the cluster name: + +.. code-block:: bash + + ssh dev + + +.. _dev-vscode: + +VSCode +~~~~~~ + +A common use case for interactive development is to connect a local IDE to a remote cluster and directly edit code that lives on the cluster. +This is supported by simply connecting VSCode to the cluster with the cluster name: + +#. Click on the top bar, type: :code:`> remote-ssh`, and select :code:`Remote-SSH: Connect Current Window to Host...` +#. Select the cluster name (e.g., ``dev``) from the list of hosts. +#. Open folder: :code:`sky_workdir` and find your code. + +For more details, please refer to the `VSCode documentation `__. + +.. image:: https://imgur.com/8mKfsET.gif + :align: center + :alt: Connect to the cluster with VSCode + +.. _dev-notebooks: + +Jupyter Notebooks +~~~~~~~~~~~~~~~~~ + +Jupyter notebooks are a useful tool for interactive development, debugging, and +visualization. + +Connect to the machine and forward the port used by jupyter notebook: + +.. code-block:: bash + + ssh -L 8888:localhost:8888 dev + +Inside the cluster, you can run the following commands to start a Jupyter session: + +.. code-block:: bash + + pip install jupyter + jupyter notebook + +In your local browser, you should now be able to access :code:`localhost:8888` and see the following screen: + +.. image:: ../images/jupyter-auth.png + :width: 100% + :alt: Jupyter authentication window + +Enter the password or token and you will be directed to a page where you can create a new notebook. + +.. image:: ../images/jupyter-create.png + :width: 100% + :alt: Create a new Jupyter notebook + +You can verify that this notebook is running on the GPU-backed instance using :code:`nvidia-smi`. + +.. image:: ../images/jupyter-gpu.png + :width: 100% + :alt: nvidia-smi in notebook + +The GPU node is a normal SkyPilot cluster, so you can use the usual CLI commands on it. For example, run ``sky down/stop`` to terminate or stop it, and ``sky exec`` to execute a task. + +Notebooks in SkyPilot tasks +^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Jupyter notebooks can also be used in SkyPilot tasks, allowing access to the full +range of SkyPilot's features including :ref:`mounted storage ` and +:ref:`autostop `. + +The following :code:`jupyter.yaml` is an example of a task specification that can launch notebooks with SkyPilot. + +.. code:: yaml + + # jupyter.yaml + + name: jupyter + + resources: + accelerators: T4:1 + + file_mounts: + /covid: + source: s3://fah-public-data-covid19-cryptic-pockets + mode: MOUNT + + setup: | + pip install --upgrade pip + conda init bash + conda create -n jupyter python=3.9 -y + conda activate jupyter + pip install jupyter + + run: | + cd ~/sky_workdir + conda activate jupyter + jupyter notebook --port 8888 & + +Launch the GPU-backed Jupyter notebook: + +.. code:: bash + + sky launch -c jupyter jupyter.yaml + +To access the notebook locally, use SSH port forwarding. + +.. code:: bash + + ssh -L 8888:localhost:8888 jupyter + +You can verify that this notebook has access to the mounted storage bucket. + +.. image:: ../images/jupyter-covid.png + :width: 100% + :alt: accessing covid data from notebook + + + diff --git a/docs/source/examples/managed-jobs.rst b/docs/source/examples/managed-jobs.rst new file mode 100644 index 00000000000..a47b4345b9f --- /dev/null +++ b/docs/source/examples/managed-jobs.rst @@ -0,0 +1,485 @@ +.. _managed-jobs: + +Managed Jobs +============ + +.. tip:: + + This feature is great for scaling out: running a single job for long durations, or running many jobs (pipelines). + +SkyPilot supports **managed jobs** (:code:`sky jobs`), which can automatically recover from any spot preemptions or hardware failures. +It can be used in three modes: + +#. :ref:`Managed Spot Jobs `: Jobs run on auto-recovering spot instances. This can **save significant costs** (e.g., up to 70\% for GPU VMs) by making preemptible spot instances useful for long-running jobs. +#. :ref:`On-demand `: Jobs run on auto-recovering on-demand instances. This is useful for jobs that require guaranteed resources. +#. :ref:`Pipelines `: Run pipelines that contain multiple tasks (which can have different resource requirements and ``setup``/``run`` commands). This is useful for running a sequence of tasks that depend on each other, e.g., data processing, training a model, and then running inference on it. + + +.. _spot-jobs: + +Managed Spot Jobs +----------------- + +In this mode, :code:`sky jobs launch --use-spot` is used to launch a managed spot job. SkyPilot automatically finds available spot resources across regions and clouds to maximize availability. +Any spot preemptions are automatically handled by SkyPilot without user intervention. + + +Quick comparison between *unmanaged spot clusters* vs. *managed spot jobs*: + +.. list-table:: + :widths: 30 18 12 35 + :header-rows: 1 + + * - Command + - Managed? + - SSH-able? + - Best for + * - :code:`sky launch --use-spot` + - Unmanaged spot cluster + - Yes + - Interactive dev on spot instances (especially for hardware with low preemption rates) + * - :code:`sky jobs launch --use-spot` + - Managed spot job (auto-recovery) + - No + - Scaling out long-running jobs (e.g., data processing, training, batch inference) + +Here is an example of a BERT training job failing over different regions across AWS and GCP. + +.. image:: https://i.imgur.com/Vteg3fK.gif + :width: 600 + :alt: GIF for BERT training on Spot V100 + +.. image:: ../images/spot-training.png + :width: 600 + :alt: Static plot, BERT training on Spot V100 + +To use managed spot jobs, there are two requirements: + +#. :ref:`Job YAML `: Managed Spot requires a YAML to describe the job, tested with :code:`sky launch`. +#. :ref:`Checkpointing ` (optional): For job recovery due to preemptions, the user application code can checkpoint its progress periodically to a :ref:`mounted cloud bucket `. The program can reload the latest checkpoint when restarted. + + +.. _job-yaml: + +Job YAML +~~~~~~~~ + +To launch a managed job, you can simply reuse your job YAML (recommended to test it with :code:`sky launch` first). +For example, we found the BERT fine-tuning YAML works with :code:`sky launch`, and want to +launch it with SkyPilot managed spot jobs. + +We can launch it with the following: + +.. code-block:: console + + $ sky jobs launch -n bert-qa bert_qa.yaml + + +.. code-block:: yaml + + # bert_qa.yaml + name: bert-qa + + resources: + accelerators: V100:1 + # Use spot instances to save cost. + use_spot: true + + # Assume your working directory is under `~/transformers`. + # To make this example work, please run the following command: + # git clone https://github.com/huggingface/transformers.git ~/transformers -b v4.30.1 + workdir: ~/transformers + + setup: | + # Fill in your wandb key: copy from https://wandb.ai/authorize + # Alternatively, you can use `--env WANDB_API_KEY=$WANDB_API_KEY` + # to pass the key in the command line, during `sky spot launch`. + echo export WANDB_API_KEY=[YOUR-WANDB-API-KEY] >> ~/.bashrc + + pip install -e . + cd examples/pytorch/question-answering/ + pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113 + pip install wandb + + run: | + cd ./examples/pytorch/question-answering/ + python run_qa.py \ + --model_name_or_path bert-base-uncased \ + --dataset_name squad \ + --do_train \ + --do_eval \ + --per_device_train_batch_size 12 \ + --learning_rate 3e-5 \ + --num_train_epochs 50 \ + --max_seq_length 384 \ + --doc_stride 128 \ + --report_to wandb + + +.. note:: + + :ref:`workdir ` and :ref:`file mounts with local files ` will be automatically uploaded to a + :ref:`cloud bucket `. The bucket will be created during the job running time, and cleaned up after the job + finishes. + +SkyPilot will launch and start monitoring the job. When a spot preemption or any machine failure happens, SkyPilot will automatically +search for resources across regions and clouds to re-launch the job. + +In this example, the job will be restarted from scratch after each preemption recovery. +To resume the job from previous states, user's application needs to implement checkpointing and recovery. + + +.. _checkpointing: + +Checkpointing and Recovery +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To allow job recovery, a cloud bucket is typically needed to store the job's states (e.g., model checkpoints). +Below is an example of mounting a bucket to :code:`/checkpoint`. + +.. code-block:: yaml + + file_mounts: + /checkpoint: + name: # NOTE: Fill in your bucket name + mode: MOUNT + +The :code:`MOUNT` mode in :ref:`SkyPilot bucket mounting ` ensures the checkpoints outputted to :code:`/checkpoint` are automatically synced to a persistent bucket. +Note that the application code should save program checkpoints periodically and reload those states when the job is restarted. +This is typically achieved by reloading the latest checkpoint at the beginning of your program. + +.. _spot-jobs-end-to-end: + +An End-to-End Example +~~~~~~~~~~~~~~~~~~~~~ + +Below we show an `example `_ for fine-tuning a BERT model on a question-answering task with HuggingFace. + +.. code-block:: yaml + :emphasize-lines: 13-16,42-45 + + # bert_qa.yaml + name: bert-qa + + resources: + accelerators: V100:1 + use_spot: true + + # Assume your working directory is under `~/transformers`. + # To make this example work, please run the following command: + # git clone https://github.com/huggingface/transformers.git ~/transformers -b v4.30.1 + workdir: ~/transformers + + file_mounts: + /checkpoint: + name: # NOTE: Fill in your bucket name + mode: MOUNT + + setup: | + # Fill in your wandb key: copy from https://wandb.ai/authorize + # Alternatively, you can use `--env WANDB_API_KEY=$WANDB_API_KEY` + # to pass the key in the command line, during `sky jobs launch`. + echo export WANDB_API_KEY=[YOUR-WANDB-API-KEY] >> ~/.bashrc + + pip install -e . + cd examples/pytorch/question-answering/ + pip install -r requirements.txt + pip install wandb + + run: | + cd ./examples/pytorch/question-answering/ + python run_qa.py \ + --model_name_or_path bert-base-uncased \ + --dataset_name squad \ + --do_train \ + --do_eval \ + --per_device_train_batch_size 12 \ + --learning_rate 3e-5 \ + --num_train_epochs 50 \ + --max_seq_length 384 \ + --doc_stride 128 \ + --report_to wandb \ + --run_name $SKYPILOT_TASK_ID \ + --output_dir /checkpoint/bert_qa/ \ + --save_total_limit 10 \ + --save_steps 1000 + + + +As HuggingFace has built-in support for periodically checkpointing, we only need to pass the highlighted arguments for setting up +the output directory and frequency of checkpointing (see more +on `Huggingface API `_). +You may also refer to another example `here `__ for periodically checkpointing with PyTorch. + +We also set :code:`--run_name` to :code:`$SKYPILOT_TASK_ID` so that the logs for all recoveries of the same job will be saved +to the same run in Weights & Biases. + +.. note:: + The environment variable :code:`$SKYPILOT_TASK_ID` (example: "sky-managed-2022-10-06-05-17-09-750781_bert-qa_8-0") can be used to identify the same job, i.e., it is kept identical across all + recoveries of the job. + It can be accessed in the task's :code:`run` commands or directly in the program itself (e.g., access + via :code:`os.environ` and pass to Weights & Biases for tracking purposes in your training script). It is made available to + the task whenever it is invoked. + +With the highlighted changes, the managed spot job can now resume training after preemption! We can enjoy the benefits of +cost savings from spot instances without worrying about preemption or losing progress. + +.. code-block:: console + + $ sky jobs launch -n bert-qa bert_qa.yaml + +.. tip:: + + Try copy-paste this example and adapt it to your own job. + + + +Real-World Examples +~~~~~~~~~~~~~~~~~~~ + +* `Vicuna `_ LLM chatbot: `instructions `_, `YAML `__ +* BERT (shown above): `YAML `__ +* PyTorch DDP, ResNet: `YAML `__ +* PyTorch Lightning DDP, CIFAR-10: `YAML `__ + + +.. _on-demand: + +Using On-Demand Instances +-------------------------------- + +The same ``sky jobs launch`` and YAML interfaces can run jobs on auto-recovering +on-demand instances. This is useful to have SkyPilot monitor any underlying +machine failures and transparently recover the job. + +To do so, simply set :code:`use_spot: false` in the :code:`resources` section, or override it with :code:`--use-spot false` in the CLI. + +.. code-block:: console + + $ sky jobs launch -n bert-qa bert_qa.yaml --use-spot false + +.. tip:: + + It is useful to think of ``sky jobs launch`` as a "serverless" managed job + interface, while ``sky launch`` is a cluster interface (that you can launch + tasks on, albeit not managed). + +Either Spot Or On-Demand +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can use ``any_of`` to specify either spot or on-demand instances as +candidate resources for a job. See documentation :ref:`here +` for more details. + +.. code-block:: yaml + + resources: + accelerators: A100:8 + any_of: + - use_spot: true + - use_spot: false + +In this example, SkyPilot will perform cost optimizations to select the resource to use, which almost certainly +will be spot instances. If spot instances are not available, SkyPilot will fall back to launch on-demand instances. + +More advanced policies for resource selection, such as the `Can't Be Late +`__ (NSDI'24) +paper, may be supported in the future. + +Useful CLIs +----------- + +Here are some commands for managed jobs. Check :code:`sky jobs --help` and :ref:`CLI reference ` for more details. + +See all managed jobs: + +.. code-block:: console + + $ sky jobs queue + +.. code-block:: console + + Fetching managed job statuses... + Managed jobs: + ID NAME RESOURCES SUBMITTED TOT. DURATION JOB DURATION #RECOVERIES STATUS + 2 roberta 1x [A100:8][Spot] 2 hrs ago 2h 47m 18s 2h 36m 18s 0 RUNNING + 1 bert-qa 1x [V100:1][Spot] 4 hrs ago 4h 24m 26s 4h 17m 54s 0 RUNNING + +Stream the logs of a running managed job: + +.. code-block:: console + + $ sky jobs logs -n bert-qa # by name + $ sky jobs logs 2 # by job ID + +Cancel a managed job: + +.. code-block:: console + + $ sky jobs cancel -n bert-qa # by name + $ sky jobs cancel 2 # by job ID + +.. note:: + If any failure happens for a managed job, you can check :code:`sky jobs queue -a` for the brief reason + of the failure. For more details, it would be helpful to check :code:`sky jobs logs --controller `. + + +.. _pipeline: + +Job Pipelines +------------- + +A pipeline is a managed job that contains a sequence of tasks running one after another. + +This is useful for running a sequence of tasks that depend on each other, e.g., training a model and then running inference on it. +Different tasks can have different resource requirements to use appropriate per-task resources, which saves costs, while keeping the burden of managing the tasks off the user. + +.. note:: + In other words, a managed job is either a single task or a pipeline of tasks. All managed jobs are submitted by :code:`sky jobs launch`. + +To run a pipeline, specify the sequence of tasks in a YAML file. Here is an example: + +.. code-block:: yaml + + name: pipeline + + --- + + name: train + + resources: + accelerators: V100:8 + any_of: + - use_spot: true + - use_spot: false + + file_mounts: + /checkpoint: + name: train-eval # NOTE: Fill in your bucket name + mode: MOUNT + + setup: | + echo setup for training + + run: | + echo run for training + echo save checkpoints to /checkpoint + + --- + + name: eval + + resources: + accelerators: T4:1 + use_spot: false + + file_mounts: + /checkpoint: + name: train-eval # NOTE: Fill in your bucket name + mode: MOUNT + + setup: | + echo setup for eval + + run: | + echo load trained model from /checkpoint + echo eval model on test set + + +The YAML above defines a pipeline with two tasks. The first :code:`name: +pipeline` names the pipeline. The first task has name :code:`train` and the +second task has name :code:`eval`. The tasks are separated by a line with three +dashes :code:`---`. Each task has its own :code:`resources`, :code:`setup`, and +:code:`run` sections. Tasks are executed sequentially. + +To submit the pipeline, the same command :code:`sky jobs launch` is used. The pipeline will be automatically launched and monitored by SkyPilot. You can check the status of the pipeline with :code:`sky jobs queue` or :code:`sky jobs dashboard`. + +.. code-block:: console + + $ sky jobs launch -n pipeline pipeline.yaml + $ sky jobs queue + Fetching managed job statuses... + Managed jobs + In progress jobs: 1 RECOVERING + ID TASK NAME RESOURCES SUBMITTED TOT. DURATION JOB DURATION #RECOVERIES STATUS + 8 pipeline - 50 mins ago 47m 45s - 1 RECOVERING + ↳ 0 train 1x [V100:8][Spot|On-demand] 50 mins ago 47m 45s - 1 RECOVERING + ↳ 1 eval 1x [T4:1] - - - 0 PENDING + +.. note:: + + The :code:`$SKYPILOT_TASK_ID` environment variable is also available in the :code:`run` section of each task. It is unique for each task in the pipeline. + For example, the :code:`$SKYPILOT_TASK_ID` for the :code:`eval` task above is: + "sky-managed-2022-10-06-05-17-09-750781_pipeline_eval_8-1". + + + +Dashboard +--------- + +Use ``sky jobs dashboard`` to open a dashboard to see all jobs: + +.. code-block:: console + + $ sky jobs dashboard + +This automatically opens a browser tab to show the dashboard: + +.. image:: ../images/job-dashboard.png + +The UI shows the same information as the CLI ``sky jobs queue -a``. The UI is +especially useful when there are many in-progress jobs to monitor, which the +terminal-based CLI may need more than one page to display. + + +Concept: Jobs Controller +------------------------ + +The jobs controller is a small on-demand CPU VM running in the cloud that manages all jobs of a user. +It is automatically launched when the first managed job is submitted, and it is autostopped after it has been idle for 10 minutes (i.e., after all managed jobs finish and no new managed job is submitted in that duration). +Thus, **no user action is needed** to manage its lifecycle. + +You can see the controller with :code:`sky status` and refresh its status by using the :code:`-r/--refresh` flag. + +While the cost of the jobs controller is negligible (~$0.4/hour when running and less than $0.004/hour when stopped), +you can still tear it down manually with +:code:`sky down `, where the ```` can be found in the output of :code:`sky status`. + +.. note:: + Tearing down the jobs controller loses all logs and status information for the finished managed jobs. It is only allowed when there are no in-progress managed jobs to ensure no resource leakage. + +Customizing Job Controller Resources +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You may want to customize the resources of the jobs controller for several reasons: + +#. Changing the maximum number of jobs that can be run concurrently, which is 2x the vCPUs of the controller. (Default: 16) +#. Use a lower-cost controller (if you have a low number of concurrent managed jobs). +#. Enforcing the jobs controller to run on a specific location. (Default: cheapest location) +#. Changing the disk_size of the jobs controller to store more logs. (Default: 50GB) + +To achieve the above, you can specify custom configs in :code:`~/.sky/config.yaml` with the following fields: + +.. code-block:: yaml + + jobs: + # NOTE: these settings only take effect for a new jobs controller, not if + # you have an existing one. + controller: + resources: + # All configs below are optional. + # Specify the location of the jobs controller. + cloud: gcp + region: us-central1 + # Specify the maximum number of managed jobs that can be run concurrently. + cpus: 4+ # number of vCPUs, max concurrent jobs = 2 * cpus + # Specify the disk_size in GB of the jobs controller. + disk_size: 100 + +The :code:`resources` field has the same spec as a normal SkyPilot job; see `here `__. + +.. note:: + These settings will not take effect if you have an existing controller (either + stopped or live). For them to take effect, tear down the existing controller + first, which requires all in-progress jobs to finish or be canceled. + diff --git a/docs/source/examples/spot-jobs.rst b/docs/source/examples/spot-jobs.rst index 5940e404bb3..2b3df600425 100644 --- a/docs/source/examples/spot-jobs.rst +++ b/docs/source/examples/spot-jobs.rst @@ -1,389 +1,23 @@ -.. _spot-jobs: - Managed Spot Jobs -================================================ - -.. tip:: - - This feature is great for scaling out: running a single job for long durations, or running many jobs. - -SkyPilot supports managed spot jobs that can **automatically recover from preemptions**. -This feature **saves significant cost** (e.g., up to 70\% for GPU VMs) by making preemptible spot instances practical for long-running jobs. - -SkyPilot automatically finds available spot resources across regions and clouds to maximize availability. -Here is an example of a BERT training job failing over different regions across AWS and GCP. - -.. image:: https://i.imgur.com/Vteg3fK.gif - :width: 600 - :alt: GIF for BERT training on Spot V100 - -.. image:: ../images/spot-training.png - :width: 600 - :alt: Static plot, BERT training on Spot V100 - -To use managed spot jobs, there are two requirements: - -#. **Task YAML**: Managed Spot requires a YAML to describe the job, tested with :code:`sky launch`. -#. **Checkpointing** (optional): For job recovery due to preemptions, the user application code can checkpoint its progress periodically to a :ref:`mounted cloud bucket `. The program can reload the latest checkpoint when restarted. - - -Task YAML ---------- - -To launch a spot job, you can simply reuse your task YAML (recommended to test it with :code:`sky launch` first). -For example, we found the BERT fine-tuning YAML works with :code:`sky launch`, and want to -launch it with SkyPilot managed spot jobs. - -We can launch it with the following: - -.. code-block:: console - - $ sky spot launch -n bert-qa bert_qa.yaml - - -.. code-block:: yaml - - # bert_qa.yaml - name: bert-qa - - resources: - accelerators: V100:1 - - # Assume your working directory is under `~/transformers`. - # To make this example work, please run the following command: - # git clone https://github.com/huggingface/transformers.git ~/transformers -b v4.30.1 - workdir: ~/transformers - - setup: | - # Fill in your wandb key: copy from https://wandb.ai/authorize - # Alternatively, you can use `--env WANDB_API_KEY=$WANDB_API_KEY` - # to pass the key in the command line, during `sky spot launch`. - echo export WANDB_API_KEY=[YOUR-WANDB-API-KEY] >> ~/.bashrc - - pip install -e . - cd examples/pytorch/question-answering/ - pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113 - pip install wandb - - run: | - cd ./examples/pytorch/question-answering/ - python run_qa.py \ - --model_name_or_path bert-base-uncased \ - --dataset_name squad \ - --do_train \ - --do_eval \ - --per_device_train_batch_size 12 \ - --learning_rate 3e-5 \ - --num_train_epochs 50 \ - --max_seq_length 384 \ - --doc_stride 128 \ - --report_to wandb - - -.. note:: - - :ref:`workdir ` and :ref:`file mounts with local files ` will be automatically uploaded to a - :ref:`cloud bucket `. The bucket will be created during the job running time, and cleaned up after the job - finishes. - -SkyPilot will launch and start monitoring the spot job. When a preemption happens, SkyPilot will automatically -search for resources across regions and clouds to re-launch the job. - -In this example, the job will be restarted from scratch after each preemption recovery. -To resume the job from previous states, user's application needs to implement checkpointing and recovery. - - -Checkpointing and recovery --------------------------- - -To allow spot recovery, a cloud bucket is typically needed to store the job's states (e.g., model checkpoints). -Below is an example of mounting a bucket to :code:`/checkpoint`. - -.. code-block:: yaml - - file_mounts: - /checkpoint: - name: # NOTE: Fill in your bucket name - mode: MOUNT - -The :code:`MOUNT` mode in :ref:`SkyPilot bucket mounting ` ensures the checkpoints outputted to :code:`/checkpoint` are automatically synced to a persistent bucket. -Note that the application code should save program checkpoints periodically and reload those states when the job is restarted. -This is typically achieved by reloading the latest checkpoint at the beginning of your program. - -.. _spot-jobs-end-to-end: - -An end-to-end example ---------------------- - -Below we show an `example `_ for fine-tuning a BERT model on a question-answering task with HuggingFace. - -.. code-block:: yaml - :emphasize-lines: 12-15,41-44 - - # bert_qa.yaml - name: bert-qa - - resources: - accelerators: V100:1 - - # Assume your working directory is under `~/transformers`. - # To make this example work, please run the following command: - # git clone https://github.com/huggingface/transformers.git ~/transformers -b v4.30.1 - workdir: ~/transformers - - file_mounts: - /checkpoint: - name: # NOTE: Fill in your bucket name - mode: MOUNT - - setup: | - # Fill in your wandb key: copy from https://wandb.ai/authorize - # Alternatively, you can use `--env WANDB_API_KEY=$WANDB_API_KEY` - # to pass the key in the command line, during `sky spot launch`. - echo export WANDB_API_KEY=[YOUR-WANDB-API-KEY] >> ~/.bashrc - - pip install -e . - cd examples/pytorch/question-answering/ - pip install -r requirements.txt - pip install wandb - - run: | - cd ./examples/pytorch/question-answering/ - python run_qa.py \ - --model_name_or_path bert-base-uncased \ - --dataset_name squad \ - --do_train \ - --do_eval \ - --per_device_train_batch_size 12 \ - --learning_rate 3e-5 \ - --num_train_epochs 50 \ - --max_seq_length 384 \ - --doc_stride 128 \ - --report_to wandb \ - --run_name $SKYPILOT_TASK_ID \ - --output_dir /checkpoint/bert_qa/ \ - --save_total_limit 10 \ - --save_steps 1000 - - - -As HuggingFace has built-in support for periodically checkpointing, we only need to pass the highlighted arguments for setting up -the output directory and frequency of checkpointing (see more -on `Huggingface API `_). -You may also refer to another example `here `__ for periodically checkpointing with PyTorch. - -We also set :code:`--run_name` to :code:`$SKYPILOT_TASK_ID` so that the logs for all recoveries of the same job will be saved -to the same run in Weights & Biases. - -.. note:: - The environment variable :code:`$SKYPILOT_TASK_ID` (example: "sky-managed-2022-10-06-05-17-09-750781_pipeline_eval_8-1") can be used to identify the same job, i.e., it is kept identical across all - recoveries of the job. - It can be accessed in the task's :code:`run` commands or directly in the program itself (e.g., access - via :code:`os.environ` and pass to Weights & Biases for tracking purposes in your training script). It is made available to - the task whenever it is invoked. - -With the highlighted changes, the managed spot job can now resume training after preemption with ``sky spot launch``! We can enjoy the benefits of -cost savings from spot instances without worrying about preemption or losing progress. - -.. code-block:: console - - $ sky spot launch -n bert-qa bert_qa.yaml - -.. tip:: - - Try copy-paste this example and adapt it to your own job. - - -Useful CLIs ------------ - -Here are some commands for managed spot jobs. Check :code:`sky spot --help` for more details. - -See all spot jobs: - -.. code-block:: console - - $ sky spot queue - -.. code-block:: console - - Fetching managed spot job statuses... - Managed spot jobs: - ID NAME RESOURCES SUBMITTED TOT. DURATION JOB DURATION #RECOVERIES STATUS - 2 roberta 1x [A100:8] 2 hrs ago 2h 47m 18s 2h 36m 18s 0 RUNNING - 1 bert-qa 1x [V100:1] 4 hrs ago 4h 24m 26s 4h 17m 54s 0 RUNNING - -Stream the logs of a running spot job: - -.. code-block:: console - - $ sky spot logs -n bert-qa # by name - $ sky spot logs 2 # by job ID - -Cancel a spot job: - -.. code-block:: console - - $ sky spot cancel -n bert-qa # by name - $ sky spot cancel 2 # by job ID - -.. note:: - If any failure happens for a spot job, you can check :code:`sky spot queue -a` for the brief reason - of the failure. For more details, it would be helpful to check :code:`sky spot logs --controller `. - -Dashboard ------------ - -Use ``sky spot dashboard`` to open a dashboard to see all jobs: - -.. code-block:: console - - $ sky spot dashboard - -This automatically opens a browser tab to show the dashboard: - -.. image:: ../images/spot-dashboard.png - -The UI shows the same information as the CLI ``sky spot queue -a``. The UI is -especially useful when there are many in-progress jobs to monitor, which the -terminal-based CLI may need more than one page to display. - -Real-world examples -------------------------- - -* `Vicuna `_ LLM chatbot: `instructions `_, `YAML `__ -* BERT (shown above): `YAML `__ -* PyTorch DDP, ResNet: `YAML `__ -* PyTorch Lightning DDP, CIFAR-10: `YAML `__ - -Spot controller -------------------------------- - -The spot controller is a small on-demand CPU VM running in the cloud that manages all spot jobs of a user. -It is automatically launched when the first managed spot job is submitted, and it is autostopped after it has been idle for 10 minutes (i.e., after all spot jobs finish and no new spot job is submitted in that duration). -Thus, **no user action is needed** to manage its lifecycle. - -You can see the controller with :code:`sky status` and refresh its status by using the :code:`-r/--refresh` flag. - -While the cost of the spot controller is negligible (~$0.4/hour when running and less than $0.004/hour when stopped), -you can still tear it down manually with -:code:`sky down `, where the ```` can be found in the output of :code:`sky status`. - -.. note:: - Tearing down the spot controller loses all logs and status information for the finished spot jobs. It is only allowed when there are no in-progress spot jobs to ensure no resource leakage. - -Customizing spot controller resources -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -You may want to customize the resources of the spot controller for several reasons: - -1. Use a lower-cost controller (if you have a low number of concurrent spot jobs). -2. Enforcing the spot controller to run on a specific location. (Default: cheapest location) -3. Changing the maximum number of spot jobs that can be run concurrently, which is 2x the vCPUs of the controller. (Default: 16) -4. Changing the disk_size of the spot controller to store more logs. (Default: 50GB) - -To achieve the above, you can specify custom configs in :code:`~/.sky/config.yaml` with the following fields: - -.. code-block:: yaml - - spot: - # NOTE: these settings only take effect for a new spot controller, not if - # you have an existing one. - controller: - resources: - # All configs below are optional. - # Specify the location of the spot controller. - cloud: gcp - region: us-central1 - # Specify the maximum number of spot jobs that can be run concurrently. - cpus: 4+ # number of vCPUs, max concurrent spot jobs = 2 * cpus - # Specify the disk_size in GB of the spot controller. - disk_size: 100 - -The :code:`resources` field has the same spec as a normal SkyPilot job; see `here `__. - -.. note:: - These settings will not take effect if you have an existing controller (either - stopped or live). For them to take effect, tear down the existing controller - first, which requires all in-progress spot jobs to finish or be canceled. - - -Spot Pipeline -------------------------- - -Spot Pipeline is a feature that allows you to submit a spot job that contains a sequence of spot tasks running one after another. -This is useful for running a sequence of jobs that depend on each other, e.g., training a model and then running inference on it. -This allows the multiple tasks to have different resource requirements to fully utilize the resources and save cost, while keeping the burden of managing the tasks off the user. - -.. note:: - A spot job is either a single task or a pipeline of tasks. A spot job is submitted by :code:`sky spot launch`. - - All tasks in a pipeline will be run on spot instances. - -To use Spot Pipeline, you can specify the sequence of jobs in a YAML file. Here is an example: - -.. code-block:: yaml - - name: pipeline - - --- - - name: train - - resources: - accelerators: V100:8 - - file_mounts: - /checkpoint: - name: train-eval # NOTE: Fill in your bucket name - mode: MOUNT - - setup: | - echo setup for training - - run: | - echo run for training - echo save checkpoints to /checkpoint - - --- - - name: eval - - resources: - accelerators: T4:1 - - file_mounts: - /checkpoint: - name: train-eval # NOTE: Fill in your bucket name - mode: MOUNT - - setup: | - echo setup for eval - - run: | - echo load trained model from /checkpoint - echo eval model on test set - - -The above YAML file defines a pipeline with two tasks. The first :code:`name: pipeline` names the pipeline. The first task has name :code:`train` and the second task has name :code:`eval`. The tasks are separated by a line with three dashes :code:`---`. Each task has its own :code:`resources`, :code:`setup`, and :code:`run` sections. The :code:`setup` and :code:`run` sections are executed sequentially. - -To submit the pipeline, the same command :code:`sky spot launch` is used. The pipeline will be automatically launched and monitored by SkyPilot. You can check the status of the pipeline with :code:`sky spot queue` or :code:`sky spot dashboard`. - -.. note:: - - The :code:`$SKYPILOT_TASK_ID` environment variable is also available in the :code:`run` section of each task. It is unique for each task in the pipeline. - For example, the :code:`$SKYPILOT_TASK_ID` for the :code:`eval` task above is: - "sky-managed-2022-10-06-05-17-09-750781_pipeline_eval_8-1". - -.. code-block:: console - - $ sky spot launch -n pipeline pipeline.yaml - $ sky spot queue - Fetching managed spot job statuses... - Managed spot jobs - In progress tasks: 1 PENDING, 1 RECOVERING - ID TASK NAME RESOURCES SUBMITTED TOT. DURATION JOB DURATION #RECOVERIES STATUS - 8 pipeline - 50 mins ago 47m 45s - 1 RECOVERING - ↳ 0 train 1x [V100:8] 50 mins ago 47m 45s - 1 RECOVERING - ↳ 1 eval 1x [T4:1] - - - 0 PENDING - +================== + +.. raw:: html + + diff --git a/docs/source/getting-started/installation.rst b/docs/source/getting-started/installation.rst index 490d2d76311..e5b318d4f87 100644 --- a/docs/source/getting-started/installation.rst +++ b/docs/source/getting-started/installation.rst @@ -164,6 +164,10 @@ section :ref:`below `. If your clouds show ``enabled`` --- |:tada:| |:tada:| **Congratulations!** |:tada:| |:tada:| You can now head over to :ref:`Quickstart ` to get started with SkyPilot. +.. tip:: + + To check credentials only for specific clouds, pass the clouds as arguments: :code:`sky check aws gcp` + .. _cloud-account-setup: Cloud account setup @@ -308,9 +312,10 @@ Cudo Compute ~~~~~~~~~~~~~~~~~~ `Cudo Compute `__ GPU cloud provides low cost GPUs powered with green energy. - -1. Create an API Key by following `this guide `__. -2. Download and install the `cudoctl `__ command line tool +1. Create a billing account by following `this guide `__. +2. Create a project ``__. +3. Create an API Key by following `this guide `__. +3. Download and install the `cudoctl `__ command line tool 3. Run :code:`cudoctl init`: .. code-block:: shell @@ -322,7 +327,7 @@ Cudo Compute ✔ context: default config file saved ~/.config/cudo/cudo.yml - pip install "cudocompute>=0.1.8" + pip install "cudo-compute>=0.1.10" If you want to want to use skypilot with a different Cudo Compute account or project, just run :code:`cudoctl init`: again. diff --git a/docs/source/images/gcp-vm.png b/docs/source/images/gcp-vm.png new file mode 100644 index 00000000000..262a47318fd Binary files /dev/null and b/docs/source/images/gcp-vm.png differ diff --git a/docs/source/images/job-dashboard.png b/docs/source/images/job-dashboard.png new file mode 100644 index 00000000000..6c25d7d83bc Binary files /dev/null and b/docs/source/images/job-dashboard.png differ diff --git a/docs/source/images/k8s-pod.png b/docs/source/images/k8s-pod.png new file mode 100644 index 00000000000..d87eb168d5b Binary files /dev/null and b/docs/source/images/k8s-pod.png differ diff --git a/docs/source/images/managed-jobs-arch.png b/docs/source/images/managed-jobs-arch.png new file mode 100644 index 00000000000..c78f0680331 Binary files /dev/null and b/docs/source/images/managed-jobs-arch.png differ diff --git a/docs/source/images/spot-controller.png b/docs/source/images/spot-controller.png deleted file mode 100644 index fce9a1d8cc2..00000000000 Binary files a/docs/source/images/spot-controller.png and /dev/null differ diff --git a/docs/source/images/spot-dashboard.png b/docs/source/images/spot-dashboard.png deleted file mode 100644 index 2b322418350..00000000000 Binary files a/docs/source/images/spot-dashboard.png and /dev/null differ diff --git a/docs/source/reference/cli.rst b/docs/source/reference/cli.rst index d4c45ce7b82..985f63482b6 100644 --- a/docs/source/reference/cli.rst +++ b/docs/source/reference/cli.rst @@ -3,8 +3,8 @@ Command Line Interface ====================== -Core CLI ---------- +Cluster CLI +----------- .. _sky-launch: .. click:: sky.cli:launch @@ -41,9 +41,6 @@ Core CLI :prog: sky autostop :nested: full -Job Queue CLI --------------- - .. _sky-queue: .. click:: sky.cli:queue :prog: sky queue @@ -59,7 +56,31 @@ Job Queue CLI :prog: sky cancel :nested: full -Sky Serve CLI +Managed (Spot) Jobs CLI +--------------------------- + +.. _sky-job-launch: +.. click:: sky.cli:jobs_launch + :prog: sky jobs launch + :nested: full + +.. _sky-job-queue: +.. click:: sky.cli:jobs_queue + :prog: sky jobs queue + :nested: full + +.. _sky-job-cancel: +.. click:: sky.cli:jobs_cancel + :prog: sky jobs cancel + :nested: full + +.. _sky-job-logs: +.. click:: sky.cli:jobs_logs + :prog: sky jobs logs + :nested: full + + +SkyServe CLI ------------- .. click:: sky.cli:serve_up @@ -82,28 +103,6 @@ Sky Serve CLI :prog: sky serve update :nested: full -Managed Spot Jobs CLI ---------------------------- - -.. _sky-spot-launch: -.. click:: sky.cli:spot_launch - :prog: sky spot launch - :nested: full - -.. _sky-spot-queue: -.. click:: sky.cli:spot_queue - :prog: sky spot queue - :nested: full - -.. _sky-spot-cancel: -.. click:: sky.cli:spot_cancel - :prog: sky spot cancel - :nested: full - -.. _sky-spot-logs: -.. click:: sky.cli:spot_logs - :prog: sky spot logs - :nested: full Storage CLI ------------ diff --git a/docs/source/reference/config.rst b/docs/source/reference/config.rst index 3e1214d958d..74cd2c01092 100644 --- a/docs/source/reference/config.rst +++ b/docs/source/reference/config.rst @@ -14,12 +14,12 @@ Available fields and semantics: .. code-block:: yaml - # Custom spot controller resources (optional). + # Custom managed jobs controller resources (optional). # - # These take effects only when a spot controller does not already exist. + # These take effects only when a managed jobs controller does not already exist. # - # Ref: https://skypilot.readthedocs.io/en/latest/examples/spot-jobs.html#customizing-spot-controller-resources - spot: + # Ref: https://skypilot.readthedocs.io/en/latest/examples/managed-jobs.html#customizing-job-controller-resources + jobs: controller: resources: # same spec as 'resources' in a task YAML cloud: gcp @@ -27,6 +27,19 @@ Available fields and semantics: cpus: 4+ # number of vCPUs, max concurrent spot jobs = 2 * cpus disk_size: 100 + # Allow list for clouds to be used in `sky check` + # + # This field is used to restrict the clouds that SkyPilot will check and use + # when running `sky check`. Any cloud already enabled but not specified here + # will be disabled on the next `sky check` run. + # If this field is not set, SkyPilot will check and use all supported clouds. + # + # Default: null (use all supported clouds). + allowed_clouds: + - aws + - gcp + - kubernetes + # Advanced AWS configurations (optional). # Apply to all new instances but not existing ones. aws: @@ -36,7 +49,7 @@ Available fields and semantics: # # Users should guarantee that these key-values are valid AWS tags, otherwise # errors from the cloud provider will be surfaced. - instance_tags: + labels: # (Example) AWS Migration Acceleration Program (MAP). This tag enables the # program's discounts. # Ref: https://docs.aws.amazon.com/mgn/latest/ug/map-program-tagging.html @@ -109,24 +122,39 @@ Available fields and semantics: # permission to create a security group. security_group_name: my-security-group - # Identity to use for all AWS instances (optional). + # Identity to use for AWS instances (optional). # # LOCAL_CREDENTIALS: The user's local credential files will be uploaded to # AWS instances created by SkyPilot. They are used for accessing cloud # resources (e.g., private buckets) or launching new instances (e.g., for - # spot/serve controllers). + # jobs/serve controllers). # # SERVICE_ACCOUNT: Local credential files are not uploaded to AWS # instances. SkyPilot will auto-create and reuse a service account (IAM # role) for AWS instances. # + # Customized service account (IAM role): or + # - : apply the service account with the specified name to all instances. + # Example: + # remote_identity: my-service-account-name + # - : A list of single-element dict mapping from the cluster name (pattern) + # to the service account name to use. The matching of the cluster name is done in the same order + # as the list. + # NOTE: If none of the wildcard expressions in the dict match the cluster name, LOCAL_CREDENTIALS will be used. + # To specify your default, use "*" as the wildcard expression. + # Example: + # remote_identity: + # - my-cluster-name: my-service-account-1 + # - sky-serve-controller-*: my-service-account-2 + # - "*": my-default-service-account + # # Two caveats of SERVICE_ACCOUNT for multicloud users: # # - This only affects AWS instances. Local AWS credentials will still be # uploaded to non-AWS instances (since those instances may need to access # AWS resources). - # - If the SkyPilot spot/serve controller is on AWS, this setting will make - # non-AWS managed spot jobs / non-AWS service replicas fail to access any + # - If the SkyPilot jobs/serve controller is on AWS, this setting will make + # non-AWS managed jobs / non-AWS service replicas fail to access any # resources on AWS (since the controllers don't have AWS credential # files to assign to these non-AWS instances). # @@ -142,9 +170,9 @@ Available fields and semantics: # # Users should guarantee that these key-values are valid GCP labels, otherwise # errors from the cloud provider will be surfaced. - instance_tags: + labels: Owner: user-unique-name - my-tag: my-value + my-label: my-value # VPC to use (optional). # @@ -190,21 +218,21 @@ Available fields and semantics: # Reserved capacity (optional). - # + # # Whether to prioritize reserved instance types/locations (considered as 0 # cost) in the optimizer. - # + # # If you have "automatically consumed" reservations in your GCP project: # Setting this to true guarantees the optimizer will pick any matching # reservation and GCP will auto consume your reservation, and setting to # false means optimizer uses regular, non-zero pricing in optimization (if # by chance any matching reservation is selected, GCP still auto consumes # the reservation). - # + # # If you have "specifically targeted" reservations (set by the # `specific_reservations` field below): This field will automatically be set # to true. - # + # # Default: false. prioritize_reservations: false # @@ -248,7 +276,7 @@ Available fields and semantics: # LOCAL_CREDENTIALS: The user's local credential files will be uploaded to # GCP instances created by SkyPilot. They are used for accessing cloud # resources (e.g., private buckets) or launching new instances (e.g., for - # spot/serve controllers). + # jobs/serve controllers). # # SERVICE_ACCOUNT: Local credential files are not uploaded to GCP # instances. SkyPilot will auto-create and reuse a service account for GCP @@ -259,8 +287,8 @@ Available fields and semantics: # - This only affects GCP instances. Local GCP credentials will still be # uploaded to non-GCP instances (since those instances may need to access # GCP resources). - # - If the SkyPilot spot/serve controller is on GCP, this setting will make - # non-GCP managed spot jobs / non-GCP service replicas fail to access any + # - If the SkyPilot jobs/serve controller is on GCP, this setting will make + # non-GCP managed jobs / non-GCP service replicas fail to access any # resources on GCP (since the controllers don't have GCP credential # files to assign to these non-GCP instances). # @@ -286,8 +314,7 @@ Available fields and semantics: # The mode to use for opening ports on Kubernetes # - # This must be either: 'ingress' or 'loadbalancer'. If not specified, - # defaults to 'loadbalancer'. + # This must be either: 'loadbalancer', 'ingress' or 'podip'. # # loadbalancer: Creates services of type `LoadBalancer` to expose ports. # See https://skypilot.readthedocs.io/en/latest/reference/kubernetes/kubernetes-setup.html#loadbalancer-service. @@ -298,8 +325,40 @@ Available fields and semantics: # Requires an Nginx ingress controller to be configured on the Kubernetes cluster. # Refer to https://skypilot.readthedocs.io/en/latest/reference/kubernetes/kubernetes-setup.html#nginx-ingress # for details on deploying the NGINX ingress controller. + # + # podip: Directly returns the IP address of the pod. This mode does not + # create any Kubernetes services and is a lightweight way to expose ports. + # NOTE - ports exposed with podip mode are not accessible from outside the + # Kubernetes cluster. This mode is useful for hosting internal services + # that need to be accessed only by other pods in the same cluster. + # + # Default: loadbalancer ports: loadbalancer + # Identity to use for all Kubernetes pods (optional). + # + # LOCAL_CREDENTIALS: The user's local ~/.kube/config will be uploaded to the + # Kubernetes pods created by SkyPilot. They are used for authenticating with + # the Kubernetes API server and launching new pods (e.g., for + # spot/serve controllers). + # + # SERVICE_ACCOUNT: Local ~/.kube/config is not uploaded to Kubernetes pods. + # SkyPilot will auto-create and reuse a service account with necessary roles + # in the user's namespace. + # + # : The name of a service account to use for all Kubernetes pods. + # This service account must exist in the user's namespace and have all + # necessary permissions. Refer to https://skypilot.readthedocs.io/en/latest/cloud-setup/cloud-permissions/kubernetes.html + # for details on the roles required by the service account. + # + # Using SERVICE_ACCOUNT or a custom service account only affects Kubernetes + # instances. Local ~/.kube/config will still be uploaded to non-Kubernetes + # instances (e.g., a serve controller on GCP or AWS may need to provision + # Kubernetes resources). + # + # Default: 'SERVICE_ACCOUNT'. + remote_identity: my-k8s-service-account + # Attach custom metadata to Kubernetes objects created by SkyPilot # # Uses the same schema as Kubernetes metadata object: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.26/#objectmeta-v1-meta @@ -330,6 +389,25 @@ Available fields and semantics: # Default: 10 seconds provision_timeout: 10 + # Autoscaler configured in the Kubernetes cluster (optional) + # + # This field informs SkyPilot about the cluster autoscaler used in the + # Kubernetes cluster. Setting this field disables pre-launch checks for + # GPU capacity in the cluster and SkyPilot relies on the autoscaler to + # provision nodes with the required GPU capacity. + # + # Remember to set provision_timeout accordingly when using an autoscaler. + # + # Supported values: gke, karpenter, generic + # gke: uses cloud.google.com/gke-accelerator label to identify GPUs on nodes + # karpenter: uses karpenter.k8s.aws/instance-gpu-name label to identify GPUs on nodes + # generic: uses skypilot.co/accelerator labels to identify GPUs on nodes + # Refer to https://skypilot.readthedocs.io/en/latest/reference/kubernetes/kubernetes-setup.html#setting-up-gpu-support + # for more details on setting up labels for GPU support. + # + # Default: null (no autoscaler, autodetect label format for GPU nodes) + autoscaler: gke + # Additional fields to override the pod fields used by SkyPilot (optional) # # Any key:value pairs added here would get added to the pod spec used to diff --git a/docs/source/reference/job-queue.rst b/docs/source/reference/job-queue.rst index 47be2012365..6397c7bbbb6 100644 --- a/docs/source/reference/job-queue.rst +++ b/docs/source/reference/job-queue.rst @@ -1,7 +1,7 @@ .. _job-queue: -Job Queue -========= +Cluster Job Queue +================= SkyPilot's **job queue** allows multiple jobs to be scheduled on a cluster. diff --git a/docs/source/reference/kubernetes/index.rst b/docs/source/reference/kubernetes/index.rst index 1edfde01240..4087fd968be 100644 --- a/docs/source/reference/kubernetes/index.rst +++ b/docs/source/reference/kubernetes/index.rst @@ -3,247 +3,107 @@ Running on Kubernetes ============================= -.. note:: - Kubernetes support is under active development. `Please share your feedback `_ - or `directly reach out to the development team `_ - for feature requests and more. - SkyPilot tasks can be run on your private on-prem or cloud Kubernetes clusters. The Kubernetes cluster gets added to the list of "clouds" in SkyPilot and SkyPilot tasks can be submitted to your Kubernetes cluster just like any other cloud provider. -**Benefits of using SkyPilot to run jobs on your Kubernetes cluster:** - -* Get SkyPilot features (setup management, job execution, queuing, logging, SSH access) on your Kubernetes resources -* Replace complex Kubernetes manifests with simple SkyPilot tasks -* Seamlessly "burst" jobs to the cloud if your Kubernetes cluster is congested -* Retain observability and control over your cluster with your existing Kubernetes tools - -**Supported Kubernetes deployments:** - -* Hosted Kubernetes services (EKS, GKE) -* On-prem clusters (Kubeadm, Rancher) -* Local development clusters (KinD, minikube) - - -Kubernetes Cluster Requirements +Why use SkyPilot on Kubernetes? ------------------------------- -To connect and use a Kubernetes cluster, SkyPilot needs: - -* An existing Kubernetes cluster running Kubernetes v1.20 or later. -* A `Kubeconfig `_ file containing access credentials and namespace to be used. - -In a typical workflow: - -1. A cluster administrator sets up a Kubernetes cluster. Detailed admin guides for - different deployment environments (Amazon EKS, Google GKE, On-Prem and local debugging) are included in the :ref:`Kubernetes cluster setup guide `. - -2. Users who want to run SkyPilot tasks on this cluster are issued Kubeconfig - files containing their credentials (`kube-context `_). - SkyPilot reads this Kubeconfig file to communicate with the cluster. - -Submitting SkyPilot tasks to Kubernetes Clusters ------------------------------------------------- -.. _kubernetes-instructions: - -Once your cluster administrator has :ref:`setup a Kubernetes cluster ` and provided you with a kubeconfig file: - -0. Make sure `kubectl `_, ``socat`` and ``nc`` (netcat) are installed on your local machine. - - .. code-block:: console - - $ # MacOS - $ brew install kubectl socat netcat - - $ # Linux (may have socat already installed) - $ sudo apt-get install kubectl socat netcat - - -1. Place your kubeconfig file at ``~/.kube/config``. - - .. code-block:: console - - $ mkdir -p ~/.kube - $ cp /path/to/kubeconfig ~/.kube/config - - You can verify your credentials are setup correctly by running :code:`kubectl get pods`. - -2. Run :code:`sky check` and verify that Kubernetes is enabled in SkyPilot. - - .. code-block:: console - - $ sky check - - Checking credentials to enable clouds for SkyPilot. - ... - Kubernetes: enabled - ... - - - .. note:: - :code:`sky check` will also check if GPU support is available on your cluster. If GPU support is not available, it - will show the reason. - To setup GPU support on the cluster, refer to the :ref:`Kubernetes cluster setup guide `. - -4. You can now run any SkyPilot task on your Kubernetes cluster. - - .. code-block:: console +.. tab-set:: - $ sky launch --cpus 2+ task.yaml - == Optimizer == - Target: minimizing cost - Estimated cost: $0.0 / hour + .. tab-item:: For AI Developers + :sync: why-ai-devs-tab - Considered resources (1 node): - --------------------------------------------------------------------------------------------------- - CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN - --------------------------------------------------------------------------------------------------- - Kubernetes 2CPU--2GB 2 2 - kubernetes 0.00 ✔ - AWS m6i.large 2 8 - us-east-1 0.10 - Azure Standard_D2s_v5 2 8 - eastus 0.10 - GCP n2-standard-2 2 8 - us-central1 0.10 - IBM bx2-8x32 8 32 - us-east 0.38 - Lambda gpu_1x_a10 30 200 A10:1 us-east-1 0.60 - ---------------------------------------------------------------------------------------------------. + .. grid:: 2 + :gutter: 3 + .. grid-item-card:: ✅ Ease of use + :text-align: center -.. note:: - SkyPilot will use the cluster and namespace set in the ``current-context`` in the - kubeconfig file. To manage your ``current-context``: + .. + TODO(romilb): We should have a comparison of a popular Kubernetes manifest vs a SkyPilot YAML in terms of LoC in a mini blog and link it here. - .. code-block:: console + No complex kubernetes manifests - write a simple SkyPilot YAML and run with one command ``sky launch``. - $ # See current context - $ kubectl config current-context + .. grid-item-card:: 📋 Interactive development on Kubernetes + :text-align: center - $ # Switch current-context - $ kubectl config use-context mycontext + :ref:`SSH access to pods `, :ref:`VSCode integration `, :ref:`job management `, :ref:`autodown idle pods ` and more. - $ # Set a specific namespace to be used in the current-context - $ kubectl config set-context --current --namespace=mynamespace + .. grid-item-card:: ☁️ Burst to the cloud + :text-align: center + Kubernetes cluster is full? SkyPilot :ref:`seamlessly gets resources on the cloud ` to get your job running sooner. -Using Custom Images -------------------- -By default, we use and maintain a SkyPilot container image that has conda and a few other basic tools installed. + .. grid-item-card:: 🖼 Run popular models on Kubernetes + :text-align: center -To use your own image, add :code:`image_id: docker:` to the :code:`resources` section of your task YAML. + Train and serve `Llama-3 `_, `Mixtral `_, and more on your Kubernetes with ready-to-use recipes from the :ref:`AI gallery `. -.. code-block:: yaml - resources: - image_id: docker:myrepo/myimage:latest - ... + .. tab-item:: For Infrastructure Admins + :sync: why-admins-tab -Your image must satisfy the following requirements: + .. grid:: 2 + :gutter: 3 -* Image must be **debian-based** and must have the apt package manager installed. -* The default user in the image must have root privileges or passwordless sudo access. + .. grid-item-card:: ☁️ Unified platform for all Infrastructure + :text-align: center -.. note:: + Scale beyond your Kubernetes cluster to capacity on :ref:`across clouds and regions ` without manual intervention. - If your cluster runs on non-x86_64 architecture (e.g., Apple Silicon), your image must be built natively for that architecture. Otherwise, your job may get stuck at :code:`Start streaming logs ...`. See `GitHub issue `_ for more. + .. grid-item-card:: 🚯️ Minimize resource wastage + :text-align: center -Using Images from Private Repositories -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -To use images from private repositories (e.g., Private DockerHub, Amazon ECR, Google Container Registry), create a `secret `_ in your Kubernetes cluster and edit your :code:`~/.sky/config.yaml` to specify the secret like so: + SkyPilot can run with your custom pod scheduler and automatically terminate idle pods to free up resources for other users. -.. code-block:: yaml + .. grid-item-card:: 👀 Observability + :text-align: center - kubernetes: - pod_config: - spec: - imagePullSecrets: - - name: your-secret-here + Works with your existing observability and monitoring tools, such as the :ref:`Kubernetes Dashboard `. -.. tip:: + .. grid-item-card:: 🍽️ Self-serve infra for your teams + :text-align: center - If you use Amazon ECR, your secret credentials may expire every 12 hours. Consider using `k8s-ecr-login-renew `_ to automatically refresh your secrets. + Reduce operational overhead by letting your teams provision their own resources, while you retain control over the Kubernetes cluster. -Opening Ports -------------- +Table of Contents +----------------- -Opening ports on SkyPilot clusters running on Kubernetes is supported through two modes: +.. grid:: 1 1 3 3 + :gutter: 3 -1. `LoadBalancer services `_ (default) -2. `Nginx IngressController `_ + .. grid-item-card:: 👋 Get Started + :link: kubernetes-getting-started + :link-type: ref + :text-align: center -One of these modes must be supported and configured on your cluster. Refer to the :ref:`setting up ports on Kubernetes guide ` on how to do this. + Already have a kubeconfig? Launch your first SkyPilot task on Kubernetes - it's as simple as ``sky launch``. -.. tip:: + .. grid-item-card:: ⚙️ Cluster Configuration + :link: kubernetes-setup + :link-type: ref + :text-align: center - On Google GKE, Amazon EKS or other cloud-hosted Kubernetes services, the default LoadBalancer services mode is supported out of the box and no additional configuration is needed. + Are you a cluster admin? Find cluster deployment guides and setup instructions here. -Once your cluster is configured, launch a task which exposes services on a port by adding :code:`ports` to the :code:`resources` section of your task YAML. + .. grid-item-card:: 🔍️ Troubleshooting + :link: kubernetes-troubleshooting + :link-type: ref + :text-align: center -.. code-block:: yaml + Running into problems with SkyPilot on your Kubernetes cluster? Find common issues and solutions here. - # task.yaml - resources: - ports: 8888 - run: | - python -m http.server 8888 - -After launching the cluster with :code:`sky launch -c myclus task.yaml`, you can get the URL to access the port using :code:`sky status --endpoints myclus`. - -.. code-block:: bash - - # List all ports exposed by the cluster - $ sky status --endpoints myclus - 8888: 34.173.13.241:8888 - - # curl a specific port's endpoint - $ curl $(sky status --endpoint 8888 myclus) - ... - -.. tip:: - - To learn more about opening ports in SkyPilot tasks, see :ref:`Opening Ports `. - -FAQs ----- - -* **Are autoscaling Kubernetes clusters supported?** - - To run on an autoscaling cluster, you may need to adjust the resource provisioning timeout (:code:`Kubernetes.TIMEOUT` in `clouds/kubernetes.py`) to a large value to give enough time for the cluster to autoscale. We are working on a better interface to adjust this timeout - stay tuned! - -* **Can SkyPilot provision a Kubernetes cluster for me? Will SkyPilot add more nodes to my Kubernetes clusters?** - - The goal of Kubernetes support is to run SkyPilot tasks on an existing Kubernetes cluster. It does not provision any new Kubernetes clusters or add new nodes to an existing Kubernetes cluster. - -* **I have multiple users in my organization who share the same Kubernetes cluster. How do I provide isolation for their SkyPilot workloads?** - - For isolation, you can create separate Kubernetes namespaces and set them in the kubeconfig distributed to users. SkyPilot will use the namespace set in the kubeconfig for running all tasks. - -* **How can I specify custom configuration for the pods created by SkyPilot?** - - You can override the pod configuration used by SkyPilot by setting the :code:`pod_config` key in :code:`~/.sky/config.yaml`. - The value of :code:`pod_config` should be a dictionary that follows the `Kubernetes Pod API `_. - - For example, to set custom environment variables and attach a volume on your pods, you can add the following to your :code:`~/.sky/config.yaml` file: - - .. code-block:: yaml +.. toctree:: + :hidden: - kubernetes: - pod_config: - spec: - containers: - - env: - - name: MY_ENV_VAR - value: MY_ENV_VALUE - volumeMounts: # Custom volume mounts for the pod - - mountPath: /foo - name: example-volume - volumes: - - name: example-volume - hostPath: - path: /tmp - type: Directory + kubernetes-getting-started + kubernetes-setup + kubernetes-troubleshooting - For more details refer to :ref:`config-yaml`. Features and Roadmap -------------------- @@ -256,11 +116,4 @@ Kubernetes support is under active development. Some features are in progress an * Multi-node tasks - ✅ Available * Custom images - ✅ Available * Opening ports and exposing services - ✅ Available -* Multiple Kubernetes Clusters - 🚧 In progress - - -.. toctree:: - :hidden: - - kubernetes-setup - kubernetes-troubleshooting +* Multiple Kubernetes Clusters - 🚧 In progress \ No newline at end of file diff --git a/docs/source/reference/kubernetes/kubernetes-getting-started.rst b/docs/source/reference/kubernetes/kubernetes-getting-started.rst new file mode 100644 index 00000000000..99a777ce1c0 --- /dev/null +++ b/docs/source/reference/kubernetes/kubernetes-getting-started.rst @@ -0,0 +1,235 @@ +.. _kubernetes-getting-started: + +Getting Started on Kubernetes +============================= + +Prerequisites +------------- + +To connect and use a Kubernetes cluster, SkyPilot needs: + +* An existing Kubernetes cluster running Kubernetes v1.20 or later. +* A `Kubeconfig `_ file containing access credentials and namespace to be used. + +**Supported Kubernetes deployments:** + +* Hosted Kubernetes services (EKS, GKE) +* On-prem clusters (Kubeadm, Rancher, K3s) +* Local development clusters (KinD, minikube) + +In a typical workflow: + +1. A cluster administrator sets up a Kubernetes cluster. Refer to admin guides for + :ref:`Kubernetes cluster setup ` for different deployment environments (Amazon EKS, Google GKE, On-Prem and local debugging) and :ref:`required permissions `. + +2. Users who want to run SkyPilot tasks on this cluster are issued Kubeconfig + files containing their credentials (`kube-context `_). + SkyPilot reads this Kubeconfig file to communicate with the cluster. + +Launching your first task +------------------------- +.. _kubernetes-instructions: + +Once your cluster administrator has :ref:`setup a Kubernetes cluster ` and provided you with a kubeconfig file: + +0. Make sure `kubectl `_, ``socat`` and ``nc`` (netcat) are installed on your local machine. + + .. code-block:: console + + $ # MacOS + $ brew install kubectl socat netcat + + $ # Linux (may have socat already installed) + $ sudo apt-get install kubectl socat netcat + + +1. Place your kubeconfig file at ``~/.kube/config``. + + .. code-block:: console + + $ mkdir -p ~/.kube + $ cp /path/to/kubeconfig ~/.kube/config + + You can verify your credentials are setup correctly by running :code:`kubectl get pods`. + +2. Run :code:`sky check` and verify that Kubernetes is enabled in SkyPilot. + + .. code-block:: console + + $ sky check + + Checking credentials to enable clouds for SkyPilot. + ... + Kubernetes: enabled + ... + + + .. note:: + :code:`sky check` will also check if GPU support is available on your cluster. If GPU support is not available, it + will show the reason. + To setup GPU support on the cluster, refer to the :ref:`Kubernetes cluster setup guide `. + +.. _kubernetes-optimizer-table: + +4. You can now run any SkyPilot task on your Kubernetes cluster. + + .. code-block:: console + + $ sky launch --cpus 2+ task.yaml + == Optimizer == + Target: minimizing cost + Estimated cost: $0.0 / hour + + Considered resources (1 node): + --------------------------------------------------------------------------------------------------- + CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN + --------------------------------------------------------------------------------------------------- + Kubernetes 2CPU--2GB 2 2 - kubernetes 0.00 ✔ + AWS m6i.large 2 8 - us-east-1 0.10 + Azure Standard_D2s_v5 2 8 - eastus 0.10 + GCP n2-standard-2 2 8 - us-central1 0.10 + IBM bx2-8x32 8 32 - us-east 0.38 + Lambda gpu_1x_a10 30 200 A10:1 us-east-1 0.60 + ---------------------------------------------------------------------------------------------------. + + +.. note:: + SkyPilot will use the cluster and namespace set in the ``current-context`` in the + kubeconfig file. To manage your ``current-context``: + + .. code-block:: console + + $ # See current context + $ kubectl config current-context + + $ # Switch current-context + $ kubectl config use-context mycontext + + $ # Set a specific namespace to be used in the current-context + $ kubectl config set-context --current --namespace=mynamespace + + +Using Custom Images +------------------- +By default, we use and maintain a SkyPilot container image that has conda and a few other basic tools installed. + +To use your own image, add :code:`image_id: docker:` to the :code:`resources` section of your task YAML. + +.. code-block:: yaml + + resources: + image_id: docker:myrepo/myimage:latest + ... + +Your image must satisfy the following requirements: + +* Image must be **debian-based** and must have the apt package manager installed. +* The default user in the image must have root privileges or passwordless sudo access. + +.. note:: + + If your cluster runs on non-x86_64 architecture (e.g., Apple Silicon), your image must be built natively for that architecture. Otherwise, your job may get stuck at :code:`Start streaming logs ...`. See `GitHub issue `_ for more. + +Using Images from Private Repositories +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +To use images from private repositories (e.g., Private DockerHub, Amazon ECR, Google Container Registry), create a `secret `_ in your Kubernetes cluster and edit your :code:`~/.sky/config.yaml` to specify the secret like so: + +.. code-block:: yaml + + kubernetes: + pod_config: + spec: + imagePullSecrets: + - name: your-secret-here + +.. tip:: + + If you use Amazon ECR, your secret credentials may expire every 12 hours. Consider using `k8s-ecr-login-renew `_ to automatically refresh your secrets. + + +Opening Ports +------------- + +Opening ports on SkyPilot clusters running on Kubernetes is supported through two modes: + +1. `LoadBalancer services `_ (default) +2. `Nginx IngressController `_ + +One of these modes must be supported and configured on your cluster. Refer to the :ref:`setting up ports on Kubernetes guide ` on how to do this. + +.. tip:: + + On Google GKE, Amazon EKS or other cloud-hosted Kubernetes services, the default LoadBalancer services mode is supported out of the box and no additional configuration is needed. + +Once your cluster is configured, launch a task which exposes services on a port by adding :code:`ports` to the :code:`resources` section of your task YAML. + +.. code-block:: yaml + + # task.yaml + resources: + ports: 8888 + + run: | + python -m http.server 8888 + +After launching the cluster with :code:`sky launch -c myclus task.yaml`, you can get the URL to access the port using :code:`sky status --endpoints myclus`. + +.. code-block:: bash + + # List all ports exposed by the cluster + $ sky status --endpoints myclus + 8888: 34.173.13.241:8888 + + # curl a specific port's endpoint + $ curl $(sky status --endpoint 8888 myclus) + ... + +.. tip:: + + To learn more about opening ports in SkyPilot tasks, see :ref:`Opening Ports `. + +FAQs +---- + +* **Are autoscaling Kubernetes clusters supported?** + + To run on an autoscaling cluster, you may need to adjust the resource provisioning timeout (:code:`Kubernetes.TIMEOUT` in `clouds/kubernetes.py`) to a large value to give enough time for the cluster to autoscale. We are working on a better interface to adjust this timeout - stay tuned! + +* **Can SkyPilot provision a Kubernetes cluster for me? Will SkyPilot add more nodes to my Kubernetes clusters?** + + The goal of Kubernetes support is to run SkyPilot tasks on an existing Kubernetes cluster. It does not provision any new Kubernetes clusters or add new nodes to an existing Kubernetes cluster. + +* **I have multiple users in my organization who share the same Kubernetes cluster. How do I provide isolation for their SkyPilot workloads?** + + For isolation, you can create separate Kubernetes namespaces and set them in the kubeconfig distributed to users. SkyPilot will use the namespace set in the kubeconfig for running all tasks. + +* **How do I view the pods created by SkyPilot on my Kubernetes cluster?** + + You can use your existing observability tools to filter resources with the label :code:`parent=skypilot` (:code:`kubectl get pods -l 'parent=skypilot'`). As an example, follow the instructions :ref:`here ` to deploy the Kubernetes Dashboard on your cluster. + +* **How can I specify custom configuration for the pods created by SkyPilot?** + + You can override the pod configuration used by SkyPilot by setting the :code:`pod_config` key in :code:`~/.sky/config.yaml`. + The value of :code:`pod_config` should be a dictionary that follows the `Kubernetes Pod API `_. + + For example, to set custom environment variables and attach a volume on your pods, you can add the following to your :code:`~/.sky/config.yaml` file: + + .. code-block:: yaml + + kubernetes: + pod_config: + spec: + containers: + - env: + - name: MY_ENV_VAR + value: MY_ENV_VALUE + volumeMounts: # Custom volume mounts for the pod + - mountPath: /foo + name: example-volume + volumes: + - name: example-volume + hostPath: + path: /tmp + type: Directory + + For more details refer to :ref:`config-yaml`. diff --git a/docs/source/reference/kubernetes/kubernetes-setup.rst b/docs/source/reference/kubernetes/kubernetes-setup.rst index fca4d539327..4acf271bdca 100644 --- a/docs/source/reference/kubernetes/kubernetes-setup.rst +++ b/docs/source/reference/kubernetes/kubernetes-setup.rst @@ -18,7 +18,7 @@ SkyPilot's Kubernetes support is designed to work with most Kubernetes distribut To connect to a Kubernetes cluster, SkyPilot needs: * An existing Kubernetes cluster running Kubernetes v1.20 or later. -* A `Kubeconfig `_ file containing access credentials and namespace to be used. +* A `Kubeconfig `_ file containing access credentials and namespace to be used. To reduce the permissions for a user, check :ref:`required permissions guide`. Deployment Guides @@ -382,7 +382,7 @@ To use this mode: # ingress-nginx-controller LoadBalancer 10.24.4.254 35.202.58.117 80:31253/TCP,443:32699/TCP .. note:: - If the ``EXTERNAL-IP`` field is ````, you must manually assign an External IP. + If the ``EXTERNAL-IP`` field is ````, you may manually assign it an External IP. This can be done by patching the service with an IP that can be accessed from outside the cluster. If the service type is ``NodePort``, you can set the ``EXTERNAL-IP`` to any node's IP address: @@ -395,6 +395,22 @@ To use this mode: If the ``EXTERNAL-IP`` field is left as ````, SkyPilot will use ``localhost`` as the external IP for the Ingress, and the endpoint may not be accessible from outside the cluster. +.. note:: + If you cannot update the ``EXTERNAL-IP`` field of the service, you can also + specify the Ingress IP or hostname through the ``skypilot.co/external-ip`` + annotation on the ``ingress-nginx-controller`` service. In this case, + having a valid ``EXTERNAL-IP`` field is not required. + + For example, if your ``ingress-nginx-controller`` service is ``NodePort``: + + .. code-block:: bash + + # Add skypilot.co/external-ip annotation to the nginx ingress service. + # Replace in the following command with the IP you select. + # Can be any node's IP if using NodePort service type. + $ kubectl annotate service ingress-nginx-controller skypilot.co/external-ip= -n ingress-nginx + + 3. Update the :ref:`SkyPilot config ` at :code:`~/.sky/config` to use the ingress mode. .. code-block:: yaml diff --git a/docs/source/reference/yaml-spec.rst b/docs/source/reference/yaml-spec.rst index 782fd9fd208..1e56240989c 100644 --- a/docs/source/reference/yaml-spec.rst +++ b/docs/source/reference/yaml-spec.rst @@ -92,10 +92,21 @@ Available fields: # If unspecified, defaults to False (on-demand instances). use_spot: False - # The recovery strategy for spot jobs (optional). - # `use_spot` must be True for this to have any effect. For now, only - # `FAILOVER` strategy is supported. - spot_recovery: none + # The recovery strategy for managed jobs (optional). + # + # In effect for managed jobs. Possible values are `FAILOVER` and `EAGER_NEXT_REGION`. + # + # If `FAILOVER` is specified, the job will be restarted in the same region + # if the node fails, and go to the next region if no available resources + # are found in the same region. + # + # If `EAGER_NEXT_REGION` is specified, the job will go to the next region + # directly if the node fails. This is useful for spot instances, as in + # practice, preemptions in a region usually indicate a shortage of resources + # in that region. + # + # default: EAGER_NEXT_REGION + job_recovery: none # Disk size in GB to allocate for OS (mounted at /). Increase this if you # have a large working directory or tasks that write out large outputs. @@ -208,6 +219,20 @@ Available fields: # To use a more limited but easier to manage tool: # https://github.com/IBM/vpc-img-inst + # Labels to apply to the instances (optional). + # + # If specified, these labels will be applied to the VMs or pods created + # by SkyPilot. These are useful for assigning metadata that may be + # used by external tools. Implementation depends on the chosen cloud - + # On AWS, labels map to instance tags. On GCP, labels map to instance + # labels. On Kubernetes, labels map to pod labels. On other clouds, + # labels are not supported and will be ignored. + # + # Note: Labels are applied only on the first launch of the cluster. They + # are not updated on subsequent launches. + labels: + my-label: my-value + # Candidate resources (optional). If specified, SkyPilot will only use # these candidate resources to launch the cluster. The fields specified # outside of `any_of`, `ordered` will be used as the default values for diff --git a/docs/source/running-jobs/environment-variables.rst b/docs/source/running-jobs/environment-variables.rst index 16502f70818..2f3427c1bf5 100644 --- a/docs/source/running-jobs/environment-variables.rst +++ b/docs/source/running-jobs/environment-variables.rst @@ -12,6 +12,20 @@ You can specify environment variables to be made available to a task in two ways - The ``envs`` field (dict) in a :ref:`task YAML ` - The ``--env`` flag in the ``sky launch/exec`` :ref:`CLI ` (takes precedence over the above) +.. tip:: + + If an environment variable is required to be specified with `--env` during + ``sky launch/exec``, you can set it to ``null`` in task YAML to raise an + error when it is forgotten to be specified. For example, the ``WANDB_API_KEY`` + and ``HF_TOKEN`` in the following task YAML: + + .. code-block:: yaml + + envs: + WANDB_API_KEY: + HF_TOKEN: null + MYVAR: val + The ``file_mounts``, ``setup``, and ``run`` sections of a task YAML can access the variables via the ``${MYVAR}`` syntax. Using in ``file_mounts`` diff --git a/docs/source/serving/auth.rst b/docs/source/serving/auth.rst new file mode 100644 index 00000000000..91e02a64b07 --- /dev/null +++ b/docs/source/serving/auth.rst @@ -0,0 +1,123 @@ +.. _serve-auth: + +Authorization +============= + +SkyServe provides robust authorization capabilities at the replica level, allowing you to control access to service endpoints with API keys. + +Setup API Keys +-------------- + +SkyServe relies on the authorization of the service running on underlying service replicas, e.g., the inference engine. We take the vLLM inference engine as an example, which supports static API key authorization with an argument :code:`--api-key`. + +We define a SkyServe service spec for serving Llama-3 chatbot with vLLM and an API key. In the example YAML below, we define the authorization token as an environment variable, :code:`AUTH_TOKEN`, and pass it to both the service field to enable :code:`readiness_probe` to access the replicas and the vllm entrypoint to start services on replicas with the API key. + +.. code-block:: yaml + :emphasize-lines: 5,10-11,28 + + # auth.yaml + envs: + MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct + HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass. + AUTH_TOKEN: # TODO: Fill with your own auth token (a random string), or use --env to pass. + + service: + readiness_probe: + path: /v1/models + headers: + Authorization: Bearer $AUTH_TOKEN + replicas: 2 + + resources: + accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} + ports: 8000 + + setup: | + pip install vllm==0.4.0.post1 flash-attn==2.5.7 gradio openai + # python -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}')" + + run: | + export PATH=$PATH:/sbin + python -m vllm.entrypoints.openai.api_server \ + --model $MODEL_NAME --trust-remote-code \ + --gpu-memory-utilization 0.95 \ + --host 0.0.0.0 --port 8000 \ + --api-key $AUTH_TOKEN + +To deploy the service, run the following command: + +.. code-block:: bash + + HF_TOKEN=xxx AUTH_TOKEN=yyy sky serve up auth.yaml -n auth --env HF_TOKEN --env AUTH_TOKEN + +To send a request to the service endpoint, a service client need to include the static API key in a request's header: + +.. code-block:: bash + :emphasize-lines: 5 + + $ ENDPOINT=$(sky serve status --endpoint auth) + $ AUTH_TOKEN=yyy + $ curl http://$ENDPOINT/v1/chat/completions \ + -H "Content-Type: application/json" \ + -H "Authorization: Bearer $AUTH_TOKEN" \ + -d '{ + "model": "meta-llama/Meta-Llama-3-8B-Instruct", + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "Who are you?" + } + ], + "stop_token_ids": [128009, 128001] + }' | jq + +.. raw:: HTML + +
+ + Example output + + +.. code-block:: console + + { + "id": "cmpl-cad2c1a2a6ee44feabed0b28be294d6f", + "object": "chat.completion", + "created": 1716819147, + "model": "meta-llama/Meta-Llama-3-8B-Instruct", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "I'm so glad you asked! I'm LLaMA, an AI assistant developed by Meta AI that can understand and respond to human input in a conversational manner. I'm here to help you with any questions, tasks, or topics you'd like to discuss. I can provide information on a wide range of subjects, from science and history to entertainment and culture. I can also assist with language-related tasks such as language translation, text summarization, and even writing and proofreading. My goal is to provide accurate and helpful responses to your inquiries, while also being friendly and engaging. So, what's on your mind? How can I assist you today?" + }, + "logprobs": null, + "finish_reason": "stop", + "stop_reason": 128009 + } + ], + "usage": { + "prompt_tokens": 26, + "total_tokens": 160, + "completion_tokens": 134 + } + } + +.. raw:: html + +
+ +A service client without an API key will not be able to access the service and get a :code:`401 Unauthorized` error: + +.. code-block:: bash + + $ curl http://$ENDPOINT/v1/models + {"error": "Unauthorized"} + + $ curl http://$ENDPOINT/v1/models -H "Authorization: Bearer random-string" + {"error": "Unauthorized"} diff --git a/docs/source/serving/sky-serve.rst b/docs/source/serving/sky-serve.rst index eb2daa6ffb8..c00fa427bd6 100644 --- a/docs/source/serving/sky-serve.rst +++ b/docs/source/serving/sky-serve.rst @@ -22,7 +22,7 @@ Why SkyServe? How it works: -- Each service gets an endpoint that automatically redirects requests to its replicas. +- Each service gets an endpoint that automatically distributes requests to its replicas. - Replicas of the same service can run in different regions and clouds — reducing cloud costs and increasing availability. - SkyServe handles the load balancing, recovery, and autoscaling of the replicas. @@ -127,7 +127,7 @@ Run :code:`sky serve up service.yaml` to deploy the service with automatic price If you see the :code:`STATUS` column becomes :code:`READY`, then the service is ready to accept traffic! -Simply ``curl -L`` the service endpoint, which automatically load-balances across the two replicas: +Simply ``curl`` the service endpoint, which automatically load-balances across the two replicas: .. tab-set:: @@ -136,7 +136,7 @@ Simply ``curl -L`` the service endpoint, which automatically load-balances acros .. code-block:: console - $ curl -L 3.84.15.251:30001/v1/chat/completions \ + $ curl 3.84.15.251:30001/v1/chat/completions \ -X POST \ -d '{"model": "mistralai/Mixtral-8x7B-Instruct-v0.1", "messages": [{"role": "user", "content": "Who are you?"}]}' \ -H 'Content-Type: application/json' @@ -149,7 +149,7 @@ Simply ``curl -L`` the service endpoint, which automatically load-balances acros .. code-block:: console - $ curl -L 44.211.131.51:30001/generate \ + $ curl 44.211.131.51:30001/generate \ -X POST \ -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \ -H 'Content-Type: application/json' @@ -240,7 +240,7 @@ Under the hood, :code:`sky serve up`: #. Launches a controller which handles autoscaling, monitoring and load balancing; #. Returns a Service Endpoint which will be used to accept traffic; #. Meanwhile, the controller provisions replica VMs which later run the services; -#. Once any replica is ready, the requests sent to the Service Endpoint will be **HTTP-redirect** to one of the endpoint replicas. +#. Once any replica is ready, the requests sent to the Service Endpoint will be distributed to one of the endpoint replicas. After the controller is provisioned, you'll see the following in :code:`sky serve status` output: @@ -264,7 +264,7 @@ sending requests to :code:`` (e.g., ``44.201.119.3:30001``): .. code-block:: console - $ curl -L + $ curl My First SkyServe Service @@ -274,12 +274,6 @@ sending requests to :code:`` (e.g., ``44.201.119.3:30001``): -.. note:: - - Since we are using HTTP-redirect, we need to use :code:`curl -L - `. The :code:`curl` command by default won't follow the - redirect. - Tutorial: Serve a Chatbot LLM! ------------------------------ @@ -308,11 +302,12 @@ Let's bring up a real LLM chat service with FastChat + Vicuna. We'll use the `Vi conda activate chatbot echo 'Starting controller...' - python -u -m fastchat.serve.controller > ~/controller.log 2>&1 & + python -u -m fastchat.serve.controller --host 127.0.0.1 > ~/controller.log 2>&1 & sleep 10 echo 'Starting model worker...' python -u -m fastchat.serve.model_worker \ --model-path lmsys/vicuna-${MODEL_SIZE}b-v1.3 2>&1 \ + --host 127.0.0.1 \ | tee model_worker.log & echo 'Waiting for model worker to start...' @@ -367,7 +362,7 @@ Send a request using the following cURL command: .. code-block:: console - $ curl -L http:///v1/chat/completions \ + $ curl http:///v1/chat/completions \ -X POST \ -d '{"model":"vicuna-7b-v1.3","messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"Who are you?"}],"temperature":0}' \ -H 'Content-Type: application/json' @@ -449,6 +444,11 @@ Autoscaling See :ref:`Autoscaling ` for more information. +Authorization +------------- + +See :ref:`Authorization ` for more information. + SkyServe Architecture --------------------- @@ -467,7 +467,7 @@ SkyServe has a centralized controller VM that manages the deployment of your ser It is composed of the following components: #. **Controller**: The controller will monitor the status of the replicas and re-launch a new replica if one of them fails. It also autoscales the number of replicas if autoscaling config is set (see :ref:`Service YAML spec ` for more information). -#. **Load Balancer**: The load balancer will route the traffic to all ready replicas. It is a lightweight HTTP server that listens on the service endpoint and **HTTP-redirects** the requests to one of the replicas. +#. **Load Balancer**: The load balancer will route the traffic to all ready replicas. It is a lightweight HTTP server that listens on the service endpoint and distribute the requests to one of the replicas. All of the process group shares a single controller VM. The controller VM will be launched in the cloud with the best price/performance ratio. You can also :ref:`customize the controller resources ` based on your needs. diff --git a/docs/source/serving/spot-policy.rst b/docs/source/serving/spot-policy.rst new file mode 100644 index 00000000000..1c03dbe7ba4 --- /dev/null +++ b/docs/source/serving/spot-policy.rst @@ -0,0 +1,160 @@ +.. _spot_policy: + +Using Spot Instances for Serving +================================ + +SkyServe supports serving models on a mixture of spot and on-demand replicas with two options: :code:`base_ondemand_fallback_replicas` and :code:`dynamic_ondemand_fallback`. + + +Base on-demand Fallback +----------------------- + +:code:`base_ondemand_fallback_replicas` sets the number of on-demand replicas to keep running at all times. This is useful for ensuring service availability and making sure that there is always some capacity available, even if spot replicas are not available. :code:`use_spot` should be set to :code:`true` to enable spot replicas. + +.. code-block:: yaml + + service: + readiness_probe: /health + replica_policy: + min_replicas: 2 + max_replicas: 3 + target_qps_per_replica: 1 + # Ensures that one of the replicas is run on on-demand instances + base_ondemand_fallback_replicas: 1 + + resources: + ports: 8081 + cpus: 2+ + use_spot: true + + workdir: examples/serve/http_server + + run: python3 server.py + + +.. tip:: + + Kubernetes instances are considered on-demand instances. You can use the :code:`base_ondemand_fallback_replicas` option to have some replicas run on Kubernetes, while others run on cloud spot instances. + + +Dynamic on-demand Fallback +-------------------------- + +SkyServe supports dynamically fallback to on-demand replicas when spot replicas are not available. +This is enabled by setting :code:`dynamic_ondemand_fallback` to be :code:`true`. +This is useful for ensuring the required capacity of replicas in the case of spot instance interruptions. +When spot replicas are available, SkyServe will automatically switch back to using spot replicas to maximize cost savings. + +.. code-block:: yaml + + service: + readiness_probe: /health + replica_policy: + min_replicas: 2 + max_replicas: 3 + target_qps_per_replica: 1 + # Allows replicas to be run on on-demand instances if spot instances are not available + dynamic_ondemand_fallback: true + + resources: + ports: 8081 + cpus: 2+ + use_spot: true + + workdir: examples/serve/http_server + + run: python3 server.py + + +.. tip:: + + SkyServe supports specifying both :code:`base_ondemand_fallback_replicas` and :code:`dynamic_ondemand_fallback`. Specifying both will set a base number of on-demand replicas and dynamically fallback to on-demand replicas when spot replicas are not available. + +Example +------- + +The following example demonstrates how to use spot replicas with SkyServe with dynamic fallback. The example is a simple HTTP server that listens on port 8081 with :code:`dynamic_ondemand_fallback: true`. To run: + +.. code-block:: console + + $ sky serve up examples/serve/spot_policy/dynamic_on_demand_fallback.yaml -n http-server + +When the service is up, we can check the status of the service and the replicas using the following command. Initially, we will see: + +.. code-block:: console + + $ sky serve status http-server + + Services + NAME VERSION UPTIME STATUS REPLICAS ENDPOINT + http-server 1 1m 17s NO_REPLICA 0/4 54.227.229.217:30001 + + Service Replicas + SERVICE_NAME ID VERSION ENDPOINT LAUNCHED RESOURCES STATUS REGION + http-server 1 1 - 1 min ago 1x GCP([Spot]vCPU=2) PROVISIONING us-east1 + http-server 2 1 - 1 min ago 1x GCP([Spot]vCPU=2) PROVISIONING us-central1 + http-server 3 1 - 1 mins ago 1x GCP(vCPU=2) PROVISIONING us-east1 + http-server 4 1 - 1 min ago 1x GCP(vCPU=2) PROVISIONING us-central1 + +When the required number of spot replicas are not available, SkyServe will provision the number of on-demand replicas needed to meet the target number of replicas. For example, when the target number is 2 and only 1 spot replica is ready, SkyServe will provision 1 on-demand replica to meet the target number of replicas. + +.. code-block:: console + + $ sky serve status http-server + + Services + NAME VERSION UPTIME STATUS REPLICAS ENDPOINT + http-server 1 1m 17s READY 2/4 54.227.229.217:30001 + + Service Replicas + SERVICE_NAME ID VERSION ENDPOINT LAUNCHED RESOURCES STATUS REGION + http-server 1 1 http://34.23.22.160:8081 3 min ago 1x GCP([Spot]vCPU=2) READY us-east1 + http-server 2 1 http://34.68.226.193:8081 3 min ago 1x GCP([Spot]vCPU=2) READY us-central1 + http-server 3 1 - 3 mins ago 1x GCP(vCPU=2) SHUTTING_DOWN us-east1 + http-server 4 1 - 3 min ago 1x GCP(vCPU=2) SHUTTING_DOWN us-central1 + +When the spot replicas are ready, SkyServe will automatically scale down on-demand replicas to maximize cost savings. + +.. code-block:: console + + $ sky serve status http-server + + Services + NAME VERSION UPTIME STATUS REPLICAS ENDPOINT + http-server 1 3m 59s READY 2/2 54.227.229.217:30001 + + Service Replicas + SERVICE_NAME ID VERSION ENDPOINT LAUNCHED RESOURCES STATUS REGION + http-server 1 1 http://34.23.22.160:8081 4 mins ago 1x GCP([Spot]vCPU=2) READY us-east1 + http-server 2 1 http://34.68.226.193:8081 4 mins ago 1x GCP([Spot]vCPU=2) READY us-central1 + +In the event of spot instance interruptions (e.g. replica 1), SkyServe will automatically fallback to on-demand replicas (e.g. launch one on-demand replica) to meet the required capacity of replicas. SkyServe will continue trying to provision one spot replica in the event where spot availability is back. Note that SkyServe will try different regions and clouds to maximize the chance of successfully provisioning spot instances. + +.. code-block:: console + + $ sky serve status http-server + + Services + NAME VERSION UPTIME STATUS REPLICAS ENDPOINT + http-server 1 7m 2s READY 1/3 54.227.229.217:30001 + + Service Replicas + SERVICE_NAME ID VERSION ENDPOINT LAUNCHED RESOURCES STATUS REGION + http-server 2 1 http://34.68.226.193:8081 7 mins ago 1x GCP([Spot]vCPU=2) READY us-central1 + http-server 5 1 - 13 secs ago 1x GCP([Spot]vCPU=2) PROVISIONING us-central1 + http-server 6 1 - 13 secs ago 1x GCP(vCPU=2) PROVISIONING us-central1 + +Eventually, when the spot availability is back, SkyServe will automatically scale down on-demand replicas. + +.. code-block:: console + + $ sky serve status http-server + + Services + NAME VERSION UPTIME STATUS REPLICAS ENDPOINT + http-server 1 10m 5s READY 2/3 54.227.229.217:30001 + + Service Replicas + SERVICE_NAME ID VERSION ENDPOINT LAUNCHED RESOURCES STATUS REGION + http-server 2 1 http://34.68.226.193:8081 10 mins ago 1x GCP([Spot]vCPU=2) READY us-central1 + http-server 5 1 http://34.121.49.94:8081 1 min ago 1x GCP([Spot]vCPU=2) READY us-central1 \ No newline at end of file diff --git a/docs/source/serving/user-guides.rst b/docs/source/serving/user-guides.rst index e6e63fd40a5..8b9cba92b45 100644 --- a/docs/source/serving/user-guides.rst +++ b/docs/source/serving/user-guides.rst @@ -5,3 +5,5 @@ Serving User Guides autoscaling update + auth + spot-policy diff --git a/examples/cog/README.md b/examples/cog/README.md index 4fa4890420f..b2193e2e18f 100644 --- a/examples/cog/README.md +++ b/examples/cog/README.md @@ -28,7 +28,7 @@ After the service is launched, access the deployment with the following: ```console ENDPOINT=$(sky serve status --endpoint cog) -curl -L http://$ENDPOINT/predictions -X POST \ +curl http://$ENDPOINT/predictions -X POST \ -H 'Content-Type: application/json' \ -d '{"input": {"image": "https://blog.skypilot.co/introducing-sky-serve/images/sky-serve-thumbnail.png"}}' \ | jq -r '.output | split(",")[1]' | base64 --decode > output.png diff --git a/examples/job_queue/job_docker.yaml b/examples/job_queue/job_docker.yaml index da604125865..37f23f3eef1 100644 --- a/examples/job_queue/job_docker.yaml +++ b/examples/job_queue/job_docker.yaml @@ -8,6 +8,9 @@ name: job_docker +envs: + TIME_TO_SLEEP: 180 + resources: accelerators: T4:0.5 image_id: docker:ubuntu:20.04 @@ -18,7 +21,7 @@ setup: | run: | timestamp=$(date +%s) conda env list - for i in {1..180}; do + for i in $(seq 1 $TIME_TO_SLEEP); do echo "$timestamp $i" sleep 1 done diff --git a/examples/managed_job.yaml b/examples/managed_job.yaml new file mode 100644 index 00000000000..30ad2db287a --- /dev/null +++ b/examples/managed_job.yaml @@ -0,0 +1,17 @@ +name: minimal + +setup: | + echo "running setup" + pip install tqdm + +run: | + conda env list + echo "start counting" + python -u - << EOF + import time + import tqdm + + for i in tqdm.trange(240): + time.sleep(1) + + EOF diff --git a/examples/managed_spot_with_storage.yaml b/examples/managed_job_with_storage.yaml similarity index 83% rename from examples/managed_spot_with_storage.yaml rename to examples/managed_job_with_storage.yaml index 1b81e459bb4..ecefccd8b3d 100644 --- a/examples/managed_spot_with_storage.yaml +++ b/examples/managed_job_with_storage.yaml @@ -3,13 +3,13 @@ # Runs a task that uses cloud buckets for uploading and accessing files. # # Usage: -# sky spot launch -c spot-storage examples/managed_spot_with_storage.yaml +# sky spot launch -c spot-storage examples/managed_job_with_storage.yaml # sky down spot-storage resources: cloud: aws use_spot: true - spot_recovery: failover + job_recovery: failover workdir: ./examples @@ -41,8 +41,8 @@ file_mounts: run: | set -ex - ls ~/sky_workdir/managed_spot_with_storage.yaml - ls ~/bucket_workdir/managed_spot_with_storage.yaml + ls ~/sky_workdir/managed_job_with_storage.yaml + ls ~/bucket_workdir/managed_job_with_storage.yaml ls -l /imagenet-image/datasets diff --git a/examples/managed_spot.yaml b/examples/managed_spot.yaml index 4bfcb63f40a..712819eb9ca 100644 --- a/examples/managed_spot.yaml +++ b/examples/managed_spot.yaml @@ -1,5 +1,8 @@ name: minimal +resources: + use_spot: true + setup: | echo "running setup" pip install tqdm diff --git a/examples/serve/gorilla/gorilla.yaml b/examples/serve/gorilla/gorilla.yaml index ee46aa94568..e3072d816fb 100644 --- a/examples/serve/gorilla/gorilla.yaml +++ b/examples/serve/gorilla/gorilla.yaml @@ -35,11 +35,12 @@ run: | conda activate chatbot echo 'Starting controller...' - python -u -m fastchat.serve.controller > ~/controller.log 2>&1 & + python -u -m fastchat.serve.controller --host 127.0.0.1 > ~/controller.log 2>&1 & sleep 10 echo 'Starting model worker...' python -u -m fastchat.serve.model_worker \ --model-path gorilla-llm/gorilla-falcon-7b-hf-v0 2>&1 \ + --host 127.0.0.1 \ | tee model_worker.log & echo 'Waiting for model worker to start...' diff --git a/examples/serve/llama2/llama2.yaml b/examples/serve/llama2/llama2.yaml index 5eaaea449d0..42c82ea0cc9 100644 --- a/examples/serve/llama2/llama2.yaml +++ b/examples/serve/llama2/llama2.yaml @@ -25,7 +25,7 @@ resources: envs: MODEL_SIZE: 7 - HF_TOKEN: # TODO: Replace with huggingface token + HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass. setup: | conda activate chatbot diff --git a/examples/serve/misc/cancel/README.md b/examples/serve/misc/cancel/README.md index 65b88c2d540..61c24383909 100644 --- a/examples/serve/misc/cancel/README.md +++ b/examples/serve/misc/cancel/README.md @@ -1,6 +1,6 @@ # SkyServe cancel example -This example demonstrates the redirect support canceling a request. +This example demonstrates the SkyServe load balancer support canceling a request. ## Running the example @@ -33,7 +33,7 @@ Client disconnected, stopping computation. You can also run ```bash -curl -L http:/// +curl http:/// ``` and manually Ctrl + C to cancel the request and see logs. diff --git a/examples/serve/stable_diffusion_service.yaml b/examples/serve/stable_diffusion_service.yaml index 86ef257e7ca..2adaf6e4ca6 100644 --- a/examples/serve/stable_diffusion_service.yaml +++ b/examples/serve/stable_diffusion_service.yaml @@ -18,7 +18,7 @@ file_mounts: /stable_diffusion: examples/stable_diffusion setup: | - sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose + sudo curl "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose sudo chmod +x /usr/local/bin/docker-compose cd stable-diffusion-webui-docker sudo rm -r stable-diffusion-webui-docker diff --git a/examples/serve/vicuna-v1.5.yaml b/examples/serve/vicuna-v1.5.yaml index c94115ea3d7..0f659e85697 100644 --- a/examples/serve/vicuna-v1.5.yaml +++ b/examples/serve/vicuna-v1.5.yaml @@ -34,11 +34,12 @@ run: | conda activate chatbot echo 'Starting controller...' - python -u -m fastchat.serve.controller > ~/controller.log 2>&1 & + python -u -m fastchat.serve.controller --host 127.0.0.1 > ~/controller.log 2>&1 & sleep 10 echo 'Starting model worker...' python -u -m fastchat.serve.model_worker \ --model-path lmsys/vicuna-${MODEL_SIZE}b-v1.5 2>&1 \ + --host 127.0.0.1 \ | tee model_worker.log & echo 'Waiting for model worker to start...' diff --git a/examples/spot_pipeline/bert_qa_train_eval.yaml b/examples/spot_pipeline/bert_qa_train_eval.yaml index 32fb526ca91..62bd34c3b76 100644 --- a/examples/spot_pipeline/bert_qa_train_eval.yaml +++ b/examples/spot_pipeline/bert_qa_train_eval.yaml @@ -42,7 +42,7 @@ run: | echo Model saved to /checkpoint/bert_qa/$SKYPILOT_TASK_ID envs: - WANDB_API_KEY: # NOTE: Fill in your wandb key + WANDB_API_KEY: # TODO: Fill with your own WANDB_API_KEY, or use --env to pass. --- @@ -84,4 +84,4 @@ run: | --save_steps 1000 envs: - WANDB_API_KEY: # NOTE: Fill in your wandb key + WANDB_API_KEY: # TODO: Fill with your own WANDB_API_KEY, or use --env to pass. diff --git a/examples/stable_diffusion/README.md b/examples/stable_diffusion/README.md index 9bec3354848..2a4383f1347 100644 --- a/examples/stable_diffusion/README.md +++ b/examples/stable_diffusion/README.md @@ -8,7 +8,7 @@ 4. Run `ssh -L 7860:localhost:7860 stable-diffusion` -5. Open [`http://localhost:7860/`](http://localhost:7860/) in browser. +5. Open [`http://localhost:7860/`](http://localhost:7860/) in browser. If the page doesn't load, try again in a few minutes to allow the container to start. 6. Type in text prompt and click "Generate". diff --git a/examples/stable_diffusion/stable_diffusion_docker.yaml b/examples/stable_diffusion/stable_diffusion_docker.yaml index 9c07790ba6b..1a830241f14 100644 --- a/examples/stable_diffusion/stable_diffusion_docker.yaml +++ b/examples/stable_diffusion/stable_diffusion_docker.yaml @@ -7,8 +7,6 @@ file_mounts: /stable_diffusion: . setup: | - sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose - sudo chmod +x /usr/local/bin/docker-compose cd stable-diffusion-webui-docker sudo rm -r stable-diffusion-webui-docker git clone https://github.com/AbdBarho/stable-diffusion-webui-docker.git @@ -17,9 +15,16 @@ setup: | wget https://huggingface.co/CompVis/stable-diffusion-v-1-4-original/resolve/main/sd-v1-4.ckpt -P models mv models/sd-v1-4.ckpt models/model.ckpt docker pull berkeleyskypilot/stable-diffusion - rm docker-compose.yml - cp /stable_diffusion/docker-compose.yml . run: | cd stable-diffusion-webui-docker - docker-compose up + docker run -d \ + --name model \ + --restart on-failure \ + -p 7860:7860 \ + -v $(pwd)/cache:/cache \ + -v $(pwd)/output:/output \ + -v $(pwd)/models:/models \ + -e CLI_ARGS='--extra-models-cpu --optimized-turbo' \ + --gpus all \ + berkeleyskypilot/stable-diffusion diff --git a/examples/tpu/tpuvm_mnist.yaml b/examples/tpu/tpuvm_mnist.yaml index 5d40089329f..d4e119119e0 100644 --- a/examples/tpu/tpuvm_mnist.yaml +++ b/examples/tpu/tpuvm_mnist.yaml @@ -5,7 +5,7 @@ resources: # The setup command. Will be run under the working directory. setup: | - git clone https://github.com/google/flax.git --branch v0.6.11 + git clone https://github.com/google/flax.git --branch v0.8.2 conda activate flax if [ $? -eq 0 ]; then @@ -14,8 +14,8 @@ setup: | conda create -n flax python=3.10 -y conda activate flax # Make sure to install TPU related packages in a conda env to avoid package conflicts. - pip install "jax[tpu]==0.4.23" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html - pip install --upgrade clu + pip install "jax[tpu]==0.4.26" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html + pip install clu pip install -e flax pip install tensorflow tensorflow-datasets fi @@ -24,8 +24,9 @@ setup: | # The command to run. Will be run under the working directory. run: | conda activate flax + pip install clu cd flax/examples/mnist python3 main.py --workdir=/tmp/mnist \ - --config=configs/default.py \ - --config.learning_rate=0.05 \ - --config.num_epochs=10 + --config=configs/default.py \ + --config.learning_rate=0.05 \ + --config.num_epochs=10 diff --git a/llm/axolotl/axolotl-docker.yaml b/llm/axolotl/axolotl-docker.yaml new file mode 100644 index 00000000000..b883ebdde46 --- /dev/null +++ b/llm/axolotl/axolotl-docker.yaml @@ -0,0 +1,29 @@ +# Usage: +# HF_TOKEN=abc sky launch -c axolotl axolotl.yaml --env HF_TOKEN -y -i30 --down + +name: axolotl + +resources: + accelerators: L4:1 + cloud: gcp # optional + +workdir: mistral + +setup: | + docker pull winglian/axolotl:main-py3.10-cu118-2.0.1 + +run: | + docker run --gpus all \ + -v ~/sky_workdir:/sky_workdir \ + -v /root/.cache:/root/.cache \ + winglian/axolotl:main-py3.10-cu118-2.0.1 \ + huggingface-cli login --token ${HF_TOKEN} + + docker run --gpus all \ + -v ~/sky_workdir:/sky_workdir \ + -v /root/.cache:/root/.cache \ + winglian/axolotl:main-py3.10-cu118-2.0.1 \ + accelerate launch -m axolotl.cli.train /sky_workdir/qlora.yaml + +envs: + HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass. diff --git a/llm/axolotl/axolotl-spot.yaml b/llm/axolotl/axolotl-spot.yaml index b6c81b742c9..4832fa72c04 100644 --- a/llm/axolotl/axolotl-spot.yaml +++ b/llm/axolotl/axolotl-spot.yaml @@ -12,6 +12,7 @@ resources: accelerators: A100:1 cloud: gcp # optional use_spot: True + image_id: docker:winglian/axolotl:main-py3.10-cu118-2.0.1 workdir: mistral @@ -20,29 +21,13 @@ file_mounts: name: ${BUCKET} mode: MOUNT -setup: | - docker pull winglian/axolotl:main-py3.10-cu118-2.0.1 - run: | - docker run --gpus all \ - -v ~/sky_workdir:/sky_workdir \ - -v /root/.cache:/root/.cache \ - winglian/axolotl:main-py3.10-cu118-2.0.1 \ - huggingface-cli login --token ${HF_TOKEN} + huggingface-cli login --token ${HF_TOKEN} - docker run --gpus all \ - -v ~/sky_workdir:/sky_workdir \ - -v /root/.cache:/root/.cache \ - -v /sky-notebook:/sky-notebook \ - winglian/axolotl:main-py3.10-cu118-2.0.1 \ - accelerate launch -m axolotl.cli.train /sky_workdir/qlora-checkpoint.yaml + accelerate launch -m axolotl.cli.train qlora-checkpoint.yaml envs: - HF_TOKEN: # TODO: Replace with huggingface token - BUCKET: - - - - - + HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass. + BUCKET: # TODO: Fill with your unique bucket name, or use --env to pass. +4 diff --git a/llm/axolotl/axolotl.yaml b/llm/axolotl/axolotl.yaml index d9cfd91aa6d..b43127c9361 100644 --- a/llm/axolotl/axolotl.yaml +++ b/llm/axolotl/axolotl.yaml @@ -5,31 +5,14 @@ name: axolotl resources: accelerators: L4:1 - cloud: gcp # optional + image_id: docker:winglian/axolotl:main-py3.10-cu118-2.0.1 workdir: mistral -setup: | - docker pull winglian/axolotl:main-py3.10-cu118-2.0.1 - run: | - docker run --gpus all \ - -v ~/sky_workdir:/sky_workdir \ - -v /root/.cache:/root/.cache \ - winglian/axolotl:main-py3.10-cu118-2.0.1 \ - huggingface-cli login --token ${HF_TOKEN} + huggingface-cli login --token ${HF_TOKEN} - docker run --gpus all \ - -v ~/sky_workdir:/sky_workdir \ - -v /root/.cache:/root/.cache \ - winglian/axolotl:main-py3.10-cu118-2.0.1 \ - accelerate launch -m axolotl.cli.train /sky_workdir/qlora.yaml + accelerate launch -m axolotl.cli.train qlora.yaml envs: - HF_TOKEN: # TODO: Replace with huggingface token - - - - - - + HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass. diff --git a/llm/axolotl/mistral/qlora-checkpoint.yaml b/llm/axolotl/mistral/qlora-checkpoint.yaml index 278a5d72b9a..1f1cc67446c 100644 --- a/llm/axolotl/mistral/qlora-checkpoint.yaml +++ b/llm/axolotl/mistral/qlora-checkpoint.yaml @@ -71,6 +71,7 @@ warmup_steps: 10 eval_steps: 0.05 eval_table_size: eval_table_max_new_tokens: 128 +eval_sample_packing: false save_steps: 2 ## increase based on your dataset save_strategy: steps debug: @@ -81,4 +82,4 @@ fsdp_config: special_tokens: bos_token: "" eos_token: "" - unk_token: "" \ No newline at end of file + unk_token: "" diff --git a/llm/axolotl/mistral/qlora.yaml b/llm/axolotl/mistral/qlora.yaml index 42c3742b52d..39b2c55b1ce 100644 --- a/llm/axolotl/mistral/qlora.yaml +++ b/llm/axolotl/mistral/qlora.yaml @@ -69,6 +69,7 @@ warmup_steps: 10 eval_steps: 0.05 eval_table_size: eval_table_max_new_tokens: 128 +eval_sample_packing: false save_steps: debug: deepspeed: @@ -78,4 +79,4 @@ fsdp_config: special_tokens: bos_token: "" eos_token: "" - unk_token: "" \ No newline at end of file + unk_token: "" diff --git a/llm/codellama/README.md b/llm/codellama/README.md index 1ed02e301d1..8e5025d22b5 100644 --- a/llm/codellama/README.md +++ b/llm/codellama/README.md @@ -68,7 +68,7 @@ Launching a cluster 'code-llama'. Proceed? [Y/n]: ```bash IP=$(sky status --ip code-llama) -curl -L http://$IP:8000/v1/completions \ +curl http://$IP:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "codellama/CodeLlama-70b-Instruct-hf", @@ -131,7 +131,7 @@ availability of the service while minimizing the cost. ```bash ENDPOINT=$(sky serve status --endpoint code-llama) -curl -L http://$ENDPOINT/v1/completions \ +curl http://$ENDPOINT/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "codellama/CodeLlama-70b-Instruct-hf", @@ -146,7 +146,7 @@ We can also access the Code Llama service with the openAI Chat API. ```bash ENDPOINT=$(sky serve status --endpoint code-llama) -curl -L http://$ENDPOINT/v1/chat/completions \ +curl http://$ENDPOINT/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "codellama/CodeLlama-70b-Instruct-hf", diff --git a/llm/codellama/gui.yaml b/llm/codellama/gui.yaml index 486e1409aa9..acd11e46871 100644 --- a/llm/codellama/gui.yaml +++ b/llm/codellama/gui.yaml @@ -38,7 +38,9 @@ run: | "model_name": "codellama/CodeLlama-70b-Instruct-hf", "api_base": "http://${ENDPOINT}/v1", "api_key": "empty", - "model_path": "codellama/CodeLlama-70b-Instruct-hf" + "model_path": "codellama/CodeLlama-70b-Instruct-hf", + "anony_only": false, + "api_type": "openai" } } EOF @@ -47,4 +49,4 @@ run: | echo 'Starting gradio server...' python -u -m fastchat.serve.gradio_web_server --share \ - --register-openai-compatible-models ~/model_info.json | tee ~/gradio.log + --register ~/model_info.json | tee ~/gradio.log diff --git a/llm/dbrx/README.md b/llm/dbrx/README.md index b546a323f6f..3011af9d4e6 100644 --- a/llm/dbrx/README.md +++ b/llm/dbrx/README.md @@ -22,7 +22,7 @@ In this recipe, you will serve `databricks/dbrx-instruct` on your own infra -- ```yaml envs: MODEL_NAME: databricks/dbrx-instruct - HF_TOKEN: # Change to your own huggingface token, or use --env to pass. + HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass. service: replicas: 2 @@ -171,7 +171,7 @@ Wait until the model is ready (this can take 10+ minutes), as indicated by these ... (task, pid=17433) INFO 03-28 04:32:50 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0% ``` -:tada: **Congratulations!** :tada: You have now launched the DBRX Instruct LLM on your infra. +🎉 **Congratulations!** 🎉 You have now launched the DBRX Instruct LLM on your infra. You can play with the model via - Standard OpenAPI-compatible endpoints (e.g., `/v1/chat/completions`) @@ -256,7 +256,7 @@ ENDPOINT=$(sky serve status --endpoint dbrx) To curl the endpoint: ```console -curl -L $ENDPOINT/v1/chat/completions \ +curl $ENDPOINT/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "databricks/dbrx-instruct", diff --git a/llm/dbrx/dbrx.yaml b/llm/dbrx/dbrx.yaml index ffa777ab86d..0c9abd06d30 100644 --- a/llm/dbrx/dbrx.yaml +++ b/llm/dbrx/dbrx.yaml @@ -31,7 +31,7 @@ envs: MODEL_NAME: databricks/dbrx-instruct - HF_TOKEN: # Change to your own huggingface token, or use --env to pass. + HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass. service: replicas: 2 diff --git a/llm/falcon/falcon.yaml b/llm/falcon/falcon.yaml index 256d936d61b..b752db5256b 100644 --- a/llm/falcon/falcon.yaml +++ b/llm/falcon/falcon.yaml @@ -7,7 +7,7 @@ workdir: . envs: MODEL_NAME: tiiuae/falcon-7b # [ybelkada/falcon-7b-sharded-bf16, tiiuae/falcon-7b, tiiuae/falcon-40b] - WANDB_API_KEY: $WANDB_KEY # Change to your own wandb key + WANDB_API_KEY: # TODO: Fill with your own WANDB_API_KEY, or use --env to pass. OUTPUT_BUCKET_NAME: # Set a unique name for the bucket which will store model weights file_mounts: @@ -39,4 +39,4 @@ run: | --bnb_4bit_compute_dtype bfloat16 \ --max_steps 500 \ --dataset_name timdettmers/openassistant-guanaco \ - --output_dir /results \ No newline at end of file + --output_dir /results diff --git a/llm/gemma/README.md b/llm/gemma/README.md index 2801c4fd6f3..ef5027b2807 100644 --- a/llm/gemma/README.md +++ b/llm/gemma/README.md @@ -37,7 +37,7 @@ After the cluster is launched, we can access the model with the following comman ```bash IP=$(sky status --ip gemma) -curl -L http://$IP:8000/v1/completions \ +curl http://$IP:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "google/gemma-7b-it", @@ -50,7 +50,7 @@ Chat API is also supported: ```bash IP=$(sky status --ip gemma) -curl -L http://$IP:8000/v1/chat/completions \ +curl http://$IP:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "google/gemma-7b-it", @@ -78,7 +78,7 @@ After the cluster is launched, we can access the model with the following comman ```bash ENDPOINT=$(sky serve status --endpoint gemma) -curl -L http://$ENDPOINT/v1/completions \ +curl http://$ENDPOINT/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "google/gemma-7b-it", @@ -89,7 +89,7 @@ curl -L http://$ENDPOINT/v1/completions \ Chat API is also supported: ```bash -curl -L http://$ENDPOINT/v1/chat/completions \ +curl http://$ENDPOINT/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "google/gemma-7b-it", diff --git a/llm/gemma/serve.yaml b/llm/gemma/serve.yaml index 73f5b9c2b5d..4c5a2c984c5 100644 --- a/llm/gemma/serve.yaml +++ b/llm/gemma/serve.yaml @@ -17,7 +17,7 @@ service: envs: MODEL_NAME: google/gemma-7b-it - HF_TOKEN: # TODO: Replace with huggingface token + HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass. resources: accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} diff --git a/llm/gpt-2/README.md b/llm/gpt-2/README.md new file mode 100644 index 00000000000..bc9893fec5b --- /dev/null +++ b/llm/gpt-2/README.md @@ -0,0 +1,119 @@ +# Run GPT-2 in llm.c on any cloud with SkyPilot + +This is a reproducible package of llm.c's GPT-2 (124M) training by @karpathy (https://github.com/karpathy/llm.c/discussions/481). +With SkyPilot, you can run GPT-2 (124M) training on any cloud. SkyPilot looks for the cheapest resources available on the clouds enabled for a user, launches and manages the whole data processing and training pipeline, leading to a close to ~\$20 target cost as @karpathy mentioned in the discussion. + +## Prerequisites + +1. Install [SkyPilot](https://github.com/skypilot-org/skypilot): +```bash +pip install "skypilot-nightly[aws,gcp,azure,kubernetes,lambda,fluidstack]" # Choose the clouds you want to enable +``` +2. Enable clouds for SkyPilot: +```bash +sky check +``` +Please check the instructions for enabling clouds at [SkyPilot doc](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html). + +3. Download the YAML for starting the training: +```bash +wget https://raw.githubusercontent.com/skypilot-org/skypilot/blob/master/llm/gpt-2/gpt2.yaml +``` + +## Run GPT-2 training + +Run the following command to start GPT-2 (124M) training on a GPU VM with 8 A100 GPUs (replace `your-bucket-name` with your bucket name): + +```bash +sky launch -c gpt2 gpt2.yaml +``` + +![GPT-2 training with 8 A100 GPUs](https://imgur.com/v8SGpsF.png) + +Or, you can train the model with a single A100, by adding `--gpus A100`: +```bash +sky launch -c gpt2 gpt2.yaml --gpus A100 +``` + +![GPT-2 training with a single A100](https://imgur.com/hN65g4r.png) + + +It is also possible to speed up the training of the model on 8 H100 (2.3x more tok/s than 8x A100s): +```bash +sky launch -c gpt2 gpt2.yaml --gpus H100:8 +``` + +![GPT-2 training with 8 H100](https://imgur.com/STbi80b.png) + +### Download logs and visualizations + +After the training is finished, you can download the logs and visualizations with the following command: +```bash +scp -r gpt2:~/llm.c/log124M . +``` +We can visualize the training progress with the notebook provided in [llm.c](https://github.com/karpathy/llm.c/blob/master/dev/vislog.ipynb). (Note: we cut off the training after 10K steps, which already achieve similar validation loss as OpenAI GPT-2 checkpoint.) + +
+ +
+ +> Yes! We are able to reproduce the training of GPT-2 (124M) on any cloud with SkyPilot. + + + +## Advanced: Run GPT-2 training in two stages + +The data processing for GPT-2 training is CPU-bound, while the training is GPU-bound. Having the data processing on a GPU VM is not cost-effective. With SkyPilot, you can easily +separate the data processing and training into two stages and execute them sequantially manually, or let SkyPilot manage the dependencies between the two stages. + +With this data processing can be run on cheaper CPU VMs (e.g., ~\$0.4/hour), and run the training on more expensive GPU VMs (e.g., ~\$1.3-\$3.6/hour for a single A100 GPU, or \$10.3-\$32.8/hour for 8 A100 GPUs). + +We can run the data processing on a CPU VM and store the processed data in a cloud bucket. Then, we can run the training on a GPU VM with the processed data. + +```bash +wget https://raw.githubusercontent.com//skypilot-org/skypilot/blob/master/llm/gpt-2/gpt2-data.yaml +wget https://raw.githubusercontent.com/skypilot-org/skypilot/blob/master/llm/gpt-2/gpt2-train.yaml +``` + +### Run two stages manually +#### Data processing + +Run the following command to process the training data on a CPU VM and store it in a cloud bucket for future use (replace `your-bucket-name` with your bucket name): + +```bash +sky launch -c gpt2-data gpt2-data.yaml --env BUCKET_NAME=your-bucket-name +``` + + +#### Training + +After the data is processed, you can then train the model on a GPU VM with 8 A100 GPUs (replace `your-bucket-name` with your bucket name): + +```bash +sky launch -c gpt2-train --detach-setup gpt2-train.yaml --env BUCKET_NAME=your-bucket-name +``` + +Or, you can train the model with a single A100, by adding `--gpus A100`: +```bash +sky launch -c gpt2-train --detach-setup gpt2-train.yaml --gpus A100 --env BUCKET_NAME=your-bucket-name +``` + + +### Run in a Pipeline + +We can also combine the two steps into a single SkyPilot job, and let SkyPilot to handle the dependencies between the two steps. Here is an example of how to do this (replace `your-bucket-name` with your bucket name): +```bash +sky jobs launch -n gpt2 gpt2-pipeline.yaml --env BUCKET_NAME=your-bucket-name +``` + +> Note: the pipeline yaml can be retrieved with the following command: +```bash +cat gpt2-data.yaml > gpt2-pipeline.yaml; echo "---" >> gpt2-pipeline.yaml; cat gpt2-train.yaml >> gpt2-pipeline.yaml +``` + +SkyPilot will first download and process the dataset on a CPU VM and store the +processed data in a GCS bucket. Then, it will launch a GPT-2 training job on a +GPU VM. The training job will train GPT-2 (124M) on the processed data. + + + diff --git a/llm/gpt-2/gpt2-data.yaml b/llm/gpt-2/gpt2-data.yaml new file mode 100644 index 00000000000..fc7bb02bf95 --- /dev/null +++ b/llm/gpt-2/gpt2-data.yaml @@ -0,0 +1,34 @@ +name: gpt2-data + +envs: + BUCKET_NAME: # TODO: Fill in your bucket name + BUCKET_STORE: s3 # Can be s3, gcs, or r2. + +resources: + cpus: 8+ + +file_mounts: + /cache: + name: $BUCKET_NAME + store: $BUCKET_STORE + mode: MOUNT + +setup: | + pip install tqdm tiktoken requests datasets + git clone https://github.com/karpathy/llm.c.git@ed37d9261ba13ef212c01e2de8b309cbb46a2aa7 || true + + # Adding revision to fix the dataset version, as the latest fineweb + # dataset removed the samples, causing error: + # Please pass `features` or at least one example when writing data + sed -i 's/fw = load_dataset("HuggingFaceFW\/fineweb", name=remote_name, split="train")/fw = load_dataset("HuggingFaceFW\/fineweb", name=remote_name, split="train", revision="9767af12bf8f0f7d3c91e0345b89bc6b9cbe1a94")/' dev/data/fineweb.py + + +run: | + cd llm.c + # tokenize the FineWeb dataset 10B tokens sample (takes ~1 hour, get lunch?) + # writes ~19GB of raw GPT-2 tokens to dev/data/fineweb10B + # and ~46GB in ~/.cache/huggingface/datasets/HuggingFaceFW___fineweb + python dev/data/fineweb.py --version 10B + + rsync -Pavz --exclude "datasets/downloads/" ~/.cache/huggingface /cache/ + rsync -Pavz dev/data/fineweb10B /cache/ diff --git a/llm/gpt-2/gpt2-pipeline.yaml b/llm/gpt-2/gpt2-pipeline.yaml new file mode 100644 index 00000000000..e5ea05f7948 --- /dev/null +++ b/llm/gpt-2/gpt2-pipeline.yaml @@ -0,0 +1,129 @@ +name: gpt2-data + +envs: + BUCKET_NAME: # TODO: Fill in your bucket name + BUCKET_STORE: s3 # Can be s3, gcs, or r2. + +resources: + cpus: 8+ + +file_mounts: + /cache: + name: $BUCKET_NAME + store: $BUCKET_STORE + mode: MOUNT + +setup: | + pip install tqdm tiktoken requests datasets + git clone https://github.com/karpathy/llm.c.git@ed37d9261ba13ef212c01e2de8b309cbb46a2aa7 || true + + # Adding revision to fix the dataset version, as the latest fineweb + # dataset removed the samples, causing error: + # Please pass `features` or at least one example when writing data + sed -i 's/fw = load_dataset("HuggingFaceFW\/fineweb", name=remote_name, split="train")/fw = load_dataset("HuggingFaceFW\/fineweb", name=remote_name, split="train", revision="9767af12bf8f0f7d3c91e0345b89bc6b9cbe1a94")/' dev/data/fineweb.py + + +run: | + cd llm.c + # tokenize the FineWeb dataset 10B tokens sample (takes ~1 hour, get lunch?) + # writes ~19GB of raw GPT-2 tokens to dev/data/fineweb10B + # and ~46GB in ~/.cache/huggingface/datasets/HuggingFaceFW___fineweb + python dev/data/fineweb.py --version 10B + + rsync -Pavz --exclude "datasets/downloads/" ~/.cache/huggingface /cache/ + rsync -Pavz dev/data/fineweb10B /cache/ +--- +name: gpt2-train + +envs: + BUCKET_NAME: # TODO: Fill in your bucket name + BUCKET_STORE: s3 # Can be s3, gcs, or r2. + +resources: + accelerators: A100:8 + # Use docker image for latest version g++ to enable the compilation of llm.c. + image_id: docker:nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04 + any_of: + # Avoid using docker image for lambda due to the docker is not supported on + # Lambda yet, but the base image works. + - cloud: lambda + image_id: null + - cloud: aws + - cloud: gcp + - cloud: azure + - cloud: fluidstack + - cloud: kubernetes + +file_mounts: + ~/.cache/huggingface: + name: $BUCKET_NAME + store: $BUCKET_STORE + mode: COPY + +setup: | + cd ~ + + # install cudnn so we can use FlashAttention and run fast (optional) + # https://developer.nvidia.com/cudnn-downloads + # for me, CUDA 12 (run `nvcc --version`) running on Linux x86_64 Ubuntu 22.04 + if [ -f ./CUDNN_INSTALLED ]; then + echo "cudnn already installed" + else + system=$(lsb_release -si | tr '[:upper:]' '[:lower:]') + # Get version and remove the dot + version=$(lsb_release -sr | tr -d .) + export system_version="${system}${version}" + wget https://developer.download.nvidia.com/compute/cudnn/9.1.1/local_installers/cudnn-local-repo-${system_version}-9.1.1_1.0-1_amd64.deb -O cudnn-installer.deb + sudo dpkg -i cudnn-installer.deb + sudo cp /var/cudnn-local-repo-${system_version}-9.1.1/cudnn-*-keyring.gpg /usr/share/keyrings/ + # Remove problematic kubernetes.list source + sudo apt-get update --allow-releaseinfo-change || true + + sudo apt-get -y install cudnn-cuda-12 + + touch ./CUDNN_INSTALLED + fi + + # "install" cudnn-frontend to ~/ + sudo apt -y install git + git clone https://github.com/NVIDIA/cudnn-frontend.git || true + + # install MPI (optional, if you intend to use multiple GPUs) + # SkyPilot do not install MPI as that requires NCCL which needs to be manually + # installed. + sudo apt install -y openmpi-bin openmpi-doc libopenmpi-dev + # install nccl + pip install nvidia-nccl-cu12 + export LIBRARY_PATH=$LIBRARY_PATH:/usr/local/nccl2/lib + export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/usr/local/nccl2/include + + git clone https://github.com/karpathy/llm.c.git || true + cd llm.c + ln -s ~/.cache/huggingface/fineweb10B dev/data/ + # compile llm.c (mixed precision, with cuDNN flash-attention) + # first compilation is ~1 minute, mostly due to cuDNN + make train_gpt2cu USE_CUDNN=1 + + +run: | + cd ~/llm.c + # train on multiple GPUs + mpirun -np $SKYPILOT_NUM_GPUS_PER_NODE --allow-run-as-root ./train_gpt2cu \ + -i "dev/data/fineweb10B/fineweb_train_*.bin" \ + -j "dev/data/fineweb10B/fineweb_val_*.bin" \ + -o log124M \ + -e "d12" \ + -b 64 -t 1024 \ + -d 524288 \ + -r 1 \ + -z 1 \ + -c 0.1 \ + -l 0.0006 \ + -q 0.0 \ + -u 700 \ + -n 5000 \ + -v 250 -s 20000 \ + -h 1 + + # Upload the log and model to the bucket + rsync -Pavz log124M ~/.cache/huggingface diff --git a/llm/gpt-2/gpt2-train.yaml b/llm/gpt-2/gpt2-train.yaml new file mode 100644 index 00000000000..3a4e8c28d14 --- /dev/null +++ b/llm/gpt-2/gpt2-train.yaml @@ -0,0 +1,94 @@ +name: gpt2-train + +envs: + BUCKET_NAME: # TODO: Fill in your bucket name + BUCKET_STORE: s3 # Can be s3, gcs, or r2. + +resources: + accelerators: A100:8 + # Use docker image for latest version g++ to enable the compilation of llm.c. + image_id: docker:nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04 + any_of: + # Avoid using docker image for lambda due to the docker is not supported on + # Lambda yet, but the base image works. + - cloud: lambda + image_id: null + - cloud: aws + - cloud: gcp + - cloud: azure + - cloud: fluidstack + - cloud: kubernetes + +file_mounts: + ~/.cache/huggingface: + name: $BUCKET_NAME + store: $BUCKET_STORE + mode: COPY + +setup: | + cd ~ + + # install cudnn so we can use FlashAttention and run fast (optional) + # https://developer.nvidia.com/cudnn-downloads + # for me, CUDA 12 (run `nvcc --version`) running on Linux x86_64 Ubuntu 22.04 + if [ -f ./CUDNN_INSTALLED ]; then + echo "cudnn already installed" + else + system=$(lsb_release -si | tr '[:upper:]' '[:lower:]') + # Get version and remove the dot + version=$(lsb_release -sr | tr -d .) + export system_version="${system}${version}" + wget https://developer.download.nvidia.com/compute/cudnn/9.1.1/local_installers/cudnn-local-repo-${system_version}-9.1.1_1.0-1_amd64.deb -O cudnn-installer.deb + sudo dpkg -i cudnn-installer.deb + sudo cp /var/cudnn-local-repo-${system_version}-9.1.1/cudnn-*-keyring.gpg /usr/share/keyrings/ + # Remove problematic kubernetes.list source + sudo apt-get update --allow-releaseinfo-change || true + + sudo apt-get -y install cudnn-cuda-12 + + touch ./CUDNN_INSTALLED + fi + + # "install" cudnn-frontend to ~/ + sudo apt -y install git + git clone https://github.com/NVIDIA/cudnn-frontend.git || true + + # install MPI (optional, if you intend to use multiple GPUs) + # SkyPilot do not install MPI as that requires NCCL which needs to be manually + # installed. + sudo apt install -y openmpi-bin openmpi-doc libopenmpi-dev + # install nccl + pip install nvidia-nccl-cu12 + export LIBRARY_PATH=$LIBRARY_PATH:/usr/local/nccl2/lib + export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/usr/local/nccl2/include + + git clone https://github.com/karpathy/llm.c.git || true + cd llm.c + ln -s ~/.cache/huggingface/fineweb10B dev/data/ + # compile llm.c (mixed precision, with cuDNN flash-attention) + # first compilation is ~1 minute, mostly due to cuDNN + make train_gpt2cu USE_CUDNN=1 + + +run: | + cd ~/llm.c + # train on multiple GPUs + mpirun -np $SKYPILOT_NUM_GPUS_PER_NODE --allow-run-as-root ./train_gpt2cu \ + -i "dev/data/fineweb10B/fineweb_train_*.bin" \ + -j "dev/data/fineweb10B/fineweb_val_*.bin" \ + -o log124M \ + -e "d12" \ + -b 64 -t 1024 \ + -d 524288 \ + -r 1 \ + -z 1 \ + -c 0.1 \ + -l 0.0006 \ + -q 0.0 \ + -u 700 \ + -n 5000 \ + -v 250 -s 20000 \ + -h 1 + + # Upload the log and model to the bucket + rsync -Pavz log124M ~/.cache/huggingface diff --git a/llm/gpt-2/gpt2.yaml b/llm/gpt-2/gpt2.yaml new file mode 100644 index 00000000000..8e203772128 --- /dev/null +++ b/llm/gpt-2/gpt2.yaml @@ -0,0 +1,95 @@ +name: train + +resources: + accelerators: A100:8 + # Use docker image for latest version g++ to enable the compilation of llm.c. + image_id: docker:nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04 + any_of: + # Avoid using docker image for lambda due to the docker is not supported on + # Lambda yet, but the base image works. + - cloud: lambda + image_id: null + - cloud: aws + - cloud: gcp + - cloud: azure + - cloud: fluidstack + - cloud: kubernetes + + +setup: | + cd ~ + pip install tqdm tiktoken requests datasets + + # Training dependencies + # install cudnn so we can use FlashAttention and run fast (optional) + # https://developer.nvidia.com/cudnn-downloads + # for me, CUDA 12 (run `nvcc --version`) running on Linux x86_64 Ubuntu 22.04 + if [ -f ./CUDNN_INSTALLED ]; then + echo "cudnn already installed" + else + system=$(lsb_release -si | tr '[:upper:]' '[:lower:]') + # Get version and remove the dot + version=$(lsb_release -sr | tr -d .) + export system_version="${system}${version}" + wget https://developer.download.nvidia.com/compute/cudnn/9.1.1/local_installers/cudnn-local-repo-${system_version}-9.1.1_1.0-1_amd64.deb -O cudnn-installer.deb + sudo dpkg -i cudnn-installer.deb + sudo cp /var/cudnn-local-repo-${system_version}-9.1.1/cudnn-*-keyring.gpg /usr/share/keyrings/ + # Remove problematic kubernetes.list source + sudo apt-get update --allow-releaseinfo-change || true + + sudo apt-get -y install cudnn-cuda-12 + + touch ./CUDNN_INSTALLED + fi + + # "install" cudnn-frontend to ~/ + sudo apt -y install git + git clone https://github.com/NVIDIA/cudnn-frontend.git || true + + # install MPI (optional, if you intend to use multiple GPUs) + # SkyPilot do not install MPI as that requires NCCL which needs to be manually + # installed. + sudo apt install -y openmpi-bin openmpi-doc libopenmpi-dev + # install nccl + pip install nvidia-nccl-cu12 + export LIBRARY_PATH=$LIBRARY_PATH:/usr/local/nccl2/lib + export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/usr/local/nccl2/include + + git clone https://github.com/karpathy/llm.c.git || true + cd llm.c + + # add revision to fix the dataset version, as the latest fineweb + # dataset removed the samples, causing error: + # Please pass `features` or at least one example when writing data + sed -i 's/fw = load_dataset("HuggingFaceFW\/fineweb", name=remote_name, split="train")/fw = load_dataset("HuggingFaceFW\/fineweb", name=remote_name, split="train", revision="9767af12bf8f0f7d3c91e0345b89bc6b9cbe1a94")/' dev/data/fineweb.py + + # compile llm.c (mixed precision, with cuDNN flash-attention) + # first compilation is ~1 minute, mostly due to cuDNN + make train_gpt2cu USE_CUDNN=1 + + +run: | + cd ~/llm.c + # Processing data + # tokenize the FineWeb dataset 10B tokens sample (takes ~1 hour, get lunch?) + # writes ~19GB of raw GPT-2 tokens to dev/data/fineweb10B + # and ~46GB in ~/.cache/huggingface/datasets/HuggingFaceFW___fineweb + python dev/data/fineweb.py --version 10B + + # Start training on multiple GPUs + mpirun -np $SKYPILOT_NUM_GPUS_PER_NODE --allow-run-as-root ./train_gpt2cu \ + -i "dev/data/fineweb10B/fineweb_train_*.bin" \ + -j "dev/data/fineweb10B/fineweb_val_*.bin" \ + -o log124M \ + -e "d12" \ + -b 64 -t 1024 \ + -d 524288 \ + -r 1 \ + -z 1 \ + -c 0.1 \ + -l 0.0006 \ + -q 0.0 \ + -u 700 \ + -n 5000 \ + -v 250 -s 20000 \ + -h 1 diff --git a/llm/llama-2/README.md b/llm/llama-2/README.md index 7b20ea4aed7..d8f8151572e 100644 --- a/llm/llama-2/README.md +++ b/llm/llama-2/README.md @@ -33,7 +33,7 @@ Fill the access token in the [chatbot-hf.yaml](https://github.com/skypilot-org/s ```yaml envs: MODEL_SIZE: 7 - HF_TOKEN: + HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass. ``` diff --git a/llm/llama-2/chatbot-hf.yaml b/llm/llama-2/chatbot-hf.yaml index 4c0132e4dd4..ee9d0281296 100644 --- a/llm/llama-2/chatbot-hf.yaml +++ b/llm/llama-2/chatbot-hf.yaml @@ -6,7 +6,7 @@ resources: envs: MODEL_SIZE: 7 - HF_TOKEN: # TODO: Replace with huggingface token + HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass. setup: | conda activate chatbot @@ -24,12 +24,13 @@ run: | conda activate chatbot echo 'Starting controller...' - python -u -m fastchat.serve.controller > ~/controller.log 2>&1 & + python -u -m fastchat.serve.controller --host 127.0.0.1 > ~/controller.log 2>&1 & sleep 10 echo 'Starting model worker...' python -u -m fastchat.serve.model_worker \ --model-path meta-llama/Llama-2-${MODEL_SIZE}b-chat-hf \ --num-gpus $SKYPILOT_NUM_GPUS_PER_NODE 2>&1 \ + --host 127.0.0.1 \ | tee model_worker.log & echo 'Waiting for model worker to start...' diff --git a/llm/llama-2/chatbot-meta.yaml b/llm/llama-2/chatbot-meta.yaml index a0481fe760f..733a2a867d2 100644 --- a/llm/llama-2/chatbot-meta.yaml +++ b/llm/llama-2/chatbot-meta.yaml @@ -6,7 +6,7 @@ resources: envs: MODEL_SIZE: 7 - HF_TOKEN: # TODO: Replace with huggingface token + HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass. setup: | set -ex diff --git a/llm/llama-3/README.md b/llm/llama-3/README.md new file mode 100644 index 00000000000..d0c28dc93c6 --- /dev/null +++ b/llm/llama-3/README.md @@ -0,0 +1,354 @@ + +# Scale Serving Llama-3 on Any Cloud or Kubernetes with SkyPilot + + + + +

+Llama-3 x SkyPilot +

+ +[Llama-3](https://github.com/meta-llama/llama3) is the latest top open-source LLM from Meta. It has been released with a license that authorizes commercial use. You can deploy a private Llama-3 chatbot with SkyPilot in your own cloud with just one simple command. + +* [Llama-3 release](https://github.com/meta-llama/llama3) +* [Llama-3 blog](https://ai.meta.com/blog/meta-llama-3/) + + + + + +## Why use SkyPilot vs. commercial hosted solutions? + +* No lock-in: run on any supported cloud - AWS, Azure, GCP, Lambda Cloud, IBM, Samsung, OCI +* Everything stays in your cloud account (your VMs & buckets) +* No one else sees your chat history +* Pay absolute minimum — no managed solution markups +* Freely choose your own model size, GPU type, number of GPUs, etc, based on scale and budget. + +…and you get all of this with 1 click — let SkyPilot automate the infra. + + +## Prerequisites + +- Go to the [HuggingFace model page](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) and request access to the model `meta-llama/Meta-Llama-3-70B-Instruct`. +- Check that you have installed SkyPilot ([docs](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html)). +- Check that `sky check` shows clouds or Kubernetes are enabled. + +## SkyPilot YAML + +
+Click to see the full recipe YAML + +```yaml + +envs: + MODEL_NAME: meta-llama/Meta-Llama-3-70B-Instruct + # MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct + HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass. + +service: + replicas: 2 + # An actual request for readiness probe. + readiness_probe: + path: /v1/chat/completions + post_data: + model: $MODEL_NAME + messages: + - role: user + content: Hello! What is your name? + max_tokens: 1 + +resources: + accelerators: {L4:8, A10g:8, A10:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8} + # accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model. + cpus: 32+ + use_spot: True + disk_size: 512 # Ensure model checkpoints can fit. + disk_tier: best + ports: 8081 # Expose to internet traffic. + +setup: | + conda activate vllm + if [ $? -ne 0 ]; then + conda create -n vllm python=3.10 -y + conda activate vllm + fi + + pip install vllm==0.4.2 + # Install Gradio for web UI. + pip install gradio openai + pip install flash-attn==2.5.9.post1 + + +run: | + conda activate vllm + echo 'Starting vllm api server...' + + # https://github.com/vllm-project/vllm/issues/3098 + export PATH=$PATH:/sbin + + # NOTE: --gpu-memory-utilization 0.95 needed for 4-GPU nodes. + python -u -m vllm.entrypoints.openai.api_server \ + --port 8081 \ + --model $MODEL_NAME \ + --trust-remote-code --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ + --gpu-memory-utilization 0.95 \ + --max-num-seqs 64 \ + 2>&1 | tee api_server.log & + + while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do + echo 'Waiting for vllm api server to start...' + sleep 5 + done + + echo 'Starting gradio server...' + git clone https://github.com/vllm-project/vllm.git || true + python vllm/examples/gradio_openai_chatbot_webserver.py \ + -m $MODEL_NAME \ + --port 8811 \ + --model-url http://localhost:8081/v1 \ + --stop-token-ids 128009,128001 + +``` + +
+ +You can also get the full YAML file [here](https://github.com/skypilot-org/skypilot/blob/master/llm/llama-3/llama3.yaml). + +## Serving Llama-3: single instance + +Launch a single spot instance to serve Llama-3 on your infra: +```console +HF_TOKEN=xxx sky launch llama3.yaml -c llama3 --env HF_TOKEN +``` + +
+Example outputs: + +```console +... +I 04-18 16:31:30 optimizer.py:693] == Optimizer == +I 04-18 16:31:30 optimizer.py:704] Target: minimizing cost +I 04-18 16:31:30 optimizer.py:716] Estimated cost: $1.2 / hour +I 04-18 16:31:30 optimizer.py:716] +I 04-18 16:31:30 optimizer.py:839] Considered resources (1 node): +I 04-18 16:31:30 optimizer.py:909] ----------------------------------------------------------------------------------------------------------------- +I 04-18 16:31:30 optimizer.py:909] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN +I 04-18 16:31:30 optimizer.py:909] ----------------------------------------------------------------------------------------------------------------- +I 04-18 16:31:30 optimizer.py:909] Azure Standard_NC48ads_A100_v4[Spot] 48 440 A100-80GB:2 eastus 1.22 ✔ +I 04-18 16:31:30 optimizer.py:909] AWS g6.48xlarge[Spot] 192 768 L4:8 us-east-1b 1.43 +I 04-18 16:31:30 optimizer.py:909] Azure Standard_NC96ads_A100_v4[Spot] 96 880 A100-80GB:4 eastus 2.44 +I 04-18 16:31:30 optimizer.py:909] AWS g5.48xlarge[Spot] 192 768 A10G:8 us-east-2b 2.45 +I 04-18 16:31:30 optimizer.py:909] GCP g2-standard-96[Spot] 96 384 L4:8 asia-east1-a 2.49 +I 04-18 16:31:30 optimizer.py:909] Azure Standard_ND96asr_v4[Spot] 96 900 A100:8 eastus 4.82 +I 04-18 16:31:30 optimizer.py:909] GCP a2-highgpu-4g[Spot] 48 340 A100:4 europe-west4-a 4.82 +I 04-18 16:31:30 optimizer.py:909] AWS p4d.24xlarge[Spot] 96 1152 A100:8 us-east-2b 4.90 +I 04-18 16:31:30 optimizer.py:909] Azure Standard_ND96amsr_A100_v4[Spot] 96 1924 A100-80GB:8 southcentralus 5.17 +I 04-18 16:31:30 optimizer.py:909] GCP a2-ultragpu-4g[Spot] 48 680 A100-80GB:4 us-east4-c 7.39 +I 04-18 16:31:30 optimizer.py:909] GCP a2-highgpu-8g[Spot] 96 680 A100:8 europe-west4-a 9.65 +I 04-18 16:31:30 optimizer.py:909] GCP a2-ultragpu-8g[Spot] 96 1360 A100-80GB:8 us-east4-c 14.79 +I 04-18 16:31:30 optimizer.py:909] ----------------------------------------------------------------------------------------------------------------- +I 04-18 16:31:30 optimizer.py:909] +... +``` + +
+ +To run on Kubernetes or use an on-demand instance, pass `--no-use-spot` to the above command. + +
+Example outputs with Kubernetes / on-demand instances: + +```console +$ HF_TOKEN=xxx sky launch llama3.yaml -c llama3 --env HF_TOKEN --no-use-spot +... +I 04-18 16:34:13 optimizer.py:693] == Optimizer == +I 04-18 16:34:13 optimizer.py:704] Target: minimizing cost +I 04-18 16:34:13 optimizer.py:716] Estimated cost: $5.0 / hour +I 04-18 16:34:13 optimizer.py:716] +I 04-18 16:34:13 optimizer.py:839] Considered resources (1 node): +I 04-18 16:34:13 optimizer.py:909] ------------------------------------------------------------------------------------------------------------------ +I 04-18 16:34:13 optimizer.py:909] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN +I 04-18 16:34:13 optimizer.py:909] ------------------------------------------------------------------------------------------------------------------ +I 04-18 16:34:13 optimizer.py:909] Kubernetes 32CPU--512GB--8A100 32 512 A100:8 kubernetes 0.00 ✔ +I 04-18 16:34:13 optimizer.py:909] Fluidstack recE2ZDQmqR9HBKYs5xSnjtPw 64 240 A100-80GB:2 generic_1_canada 4.96 +I 04-18 16:34:13 optimizer.py:909] Fluidstack recUiB2e6s3XDxwE9 60 440 A100:4 calgary_1_canada 5.88 +I 04-18 16:34:13 optimizer.py:909] Azure Standard_NC48ads_A100_v4 48 440 A100-80GB:2 eastus 7.35 +I 04-18 16:34:13 optimizer.py:909] GCP g2-standard-96 96 384 L4:8 us-east4-a 7.98 +I 04-18 16:34:13 optimizer.py:909] Fluidstack recWGm4oJ9AB3XVPxzRaujgbx 126 480 A100-80GB:4 generic_1_canada 9.89 +I 04-18 16:34:13 optimizer.py:909] Paperspace A100-80Gx4 46 320 A100-80GB:4 East Coast (NY2) 12.72 +I 04-18 16:34:13 optimizer.py:909] AWS g6.48xlarge 192 768 L4:8 us-east-1 13.35 +I 04-18 16:34:13 optimizer.py:909] GCP a2-highgpu-4g 48 340 A100:4 us-central1-a 14.69 +I 04-18 16:34:13 optimizer.py:909] Azure Standard_NC96ads_A100_v4 96 880 A100-80GB:4 eastus 14.69 +I 04-18 16:34:13 optimizer.py:909] AWS g5.48xlarge 192 768 A10G:8 us-east-1 16.29 +I 04-18 16:34:13 optimizer.py:909] Fluidstack recUYj6oGJCvAvCXC7KQo5Fc7 252 960 A100-80GB:8 generic_1_canada 19.79 +I 04-18 16:34:13 optimizer.py:909] GCP a2-ultragpu-4g 48 680 A100-80GB:4 us-central1-a 20.11 +I 04-18 16:34:13 optimizer.py:909] Paperspace A100-80Gx8 96 640 A100-80GB:8 East Coast (NY2) 25.44 +I 04-18 16:34:13 optimizer.py:909] Azure Standard_ND96asr_v4 96 900 A100:8 eastus 27.20 +I 04-18 16:34:13 optimizer.py:909] GCP a2-highgpu-8g 96 680 A100:8 us-central1-a 29.39 +I 04-18 16:34:13 optimizer.py:909] Azure Standard_ND96amsr_A100_v4 96 1924 A100-80GB:8 eastus 32.77 +I 04-18 16:34:13 optimizer.py:909] AWS p4d.24xlarge 96 1152 A100:8 us-east-1 32.77 +I 04-18 16:34:13 optimizer.py:909] GCP a2-ultragpu-8g 96 1360 A100-80GB:8 us-central1-a 40.22 +I 04-18 16:34:13 optimizer.py:909] AWS p4de.24xlarge 96 1152 A100-80GB:8 us-east-1 40.97 +I 04-18 16:34:13 optimizer.py:909] ------------------------------------------------------------------------------------------------------------------ +... +``` + +
+ +Wait until the model is ready (this can take 10+ minutes), as indicated by these lines: +```console +... +(task, pid=17433) Waiting for vllm api server to start... +... +(task, pid=17433) INFO: Started server process [20621] +(task, pid=17433) INFO: Waiting for application startup. +(task, pid=17433) INFO: Application startup complete. +(task, pid=17433) INFO: Uvicorn running on http://0.0.0.0:8081 (Press CTRL+C to quit) +... +(task, pid=17433) Running on local URL: http://127.0.0.1:8811 +(task, pid=17433) Running on public URL: https://xxxxxxxxxx.gradio.live +... +(task, pid=17433) INFO 03-28 04:32:50 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0% +``` +🎉 **Congratulations!** 🎉 You have now launched the Llama-3 Instruct LLM on your infra. + +You can play with the model via +- Standard OpenAPI-compatible endpoints (e.g., `/v1/chat/completions`) +- Gradio UI (automatically launched) + +To curl `/v1/chat/completions`: +```console +ENDPOINT=$(sky status --endpoint 8081 llama3) + +# We need to manually specify the stop_token_ids to make sure the model finish +# on <|eot_id|>. +curl http://$ENDPOINT/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "meta-llama/Meta-Llama-3-70B-Instruct", + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "Who are you?" + } + ], + "stop_token_ids": [128009, 128001] + }' +``` + +To use the Gradio UI, open the URL shown in the logs: +```console +(task, pid=17433) Running on public URL: https://xxxxxxxxxx.gradio.live +``` + + +

+Gradio UI serving Llama-3 +

+ +To stop the instance: +```console +sky stop llama3 +``` + +To shut down all resources: +```console +sky down llama3 +``` + +**Note**: If you would like to try the 8B model, you can use the following accelerators: +```yaml +resources: + accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} +``` + +## Serving Llama-3: scaling up with SkyServe + +After playing with the model, you can deploy the model with autoscaling and load-balancing using SkyServe. + +With no change to the YAML, launch a fully managed service on your infra: +```console +HF_TOKEN=xxx sky serve up llama3.yaml -n llama3 --env HF_TOKEN +``` + +Wait until the service is ready: +```console +watch -n10 sky serve status llama3 +``` + +
+Example outputs: + +```console +Services +NAME VERSION UPTIME STATUS REPLICAS ENDPOINT +llama3 1 35s READY 2/2 xx.yy.zz.100:30001 + +Service Replicas +SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION +llama3 1 1 xx.yy.zz.121 18 mins ago 1x GCP([Spot]{'A100-80GB': 4}) READY us-east4 +llama3 2 1 xx.yy.zz.245 18 mins ago 1x GCP([Spot]{'A100-80GB': 4}) READY us-east4 +``` +
+ + +Get a single endpoint that load-balances across replicas: +```console +ENDPOINT=$(sky serve status --endpoint llama3) +``` + +> **Tip:** SkyServe fully manages the lifecycle of your replicas. For example, if a spot replica is preempted, the controller will automatically replace it. This significantly reduces the operational burden while saving costs. + +To curl the endpoint: +```console +curl -L $ENDPOINT/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "meta-llama/Meta-Llama-3-70B-Instruct", + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "Who are you?" + } + ] + }' +``` + +To shut down all resources: +```console +sky serve down llama3 +``` + +See more details in [SkyServe docs](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html). + + +### **Optional**: Connect a GUI to your Llama-3 endpoint + + + +It is also possible to access the Code Llama service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas. + +1. Start the chat web UI: +```bash +sky launch -c llama3-gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint llama3) +``` + +2. Then, we can access the GUI at the returned gradio link: +``` +| INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live +``` + + + +## Finetuning Llama-3 + +You can finetune Llama-3 on your own data. We have an tutorial for finetunning Llama-2 for Vicuna on SkyPilot, which can be adapted for Llama-3. You can find the tutorial [here](https://skypilot.readthedocs.io/en/latest/gallery/tutorials/finetuning.html) and a detailed blog post [here](https://blog.skypilot.co/finetuning-llama2-operational-guide/). diff --git a/llm/llama-3/gui.yaml b/llm/llama-3/gui.yaml new file mode 100644 index 00000000000..a97d9f26a62 --- /dev/null +++ b/llm/llama-3/gui.yaml @@ -0,0 +1,46 @@ +# Starts a GUI server that connects to the Llama-3 OpenAI API server. +# +# This works with the endpoint.yaml, please refer to llm/llama-3/README.md +# for more details. +# +# Usage: +# +# 1. If you have a endpoint started on a cluster (sky launch): +# `sky launch -c llama3-gui ./gui.yaml --env ENDPOINT=$(sky status --endpoint 8081 llama3)` +# 2. If you have a SkyPilot Service started (sky serve up) called llama3: +# `sky launch -c llama3-gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint llama3)` +# +# After the GUI server is started, you will see a gradio link in the output and +# you can click on it to open the GUI. + +envs: + MODEL_NAME: meta-llama/Meta-Llama-3-70B-Instruct + ENDPOINT: x.x.x.x:3031 # Address of the API server running llama3. + +resources: + cpus: 2 + +setup: | + conda activate llama3 + if [ $? -ne 0 ]; then + conda create -n llama3 python=3.10 -y + conda activate llama3 + fi + + # Install Gradio for web UI. + pip install gradio openai + +run: | + conda activate llama3 + export PATH=$PATH:/sbin + WORKER_IP=$(hostname -I | cut -d' ' -f1) + CONTROLLER_PORT=21001 + WORKER_PORT=21002 + + echo 'Starting gradio server...' + git clone https://github.com/vllm-project/vllm.git || true + python vllm/examples/gradio_openai_chatbot_webserver.py \ + -m $MODEL_NAME \ + --port 8811 \ + --model-url http://$ENDPOINT/v1 \ + --stop-token-ids 128009,128001 | tee ~/gradio.log diff --git a/llm/llama-3/llama3.yaml b/llm/llama-3/llama3.yaml new file mode 100644 index 00000000000..ff9dc4967ac --- /dev/null +++ b/llm/llama-3/llama3.yaml @@ -0,0 +1,126 @@ +# Serving Meta Llama-3 on your own infra. +# +# Usage: +# +# HF_TOKEN=xxx sky launch llama3.yaml -c llama3 --env HF_TOKEN +# +# curl /v1/chat/completions: +# +# ENDPOINT=$(sky status --endpoint 8081 llama3) +# +# # We need to manually specify the stop_token_ids to make sure the model finish +# # on <|eot_id|>. +# curl http://$ENDPOINT/v1/chat/completions \ +# -H "Content-Type: application/json" \ +# -d '{ +# "model": "meta-llama/Meta-Llama-3-8B-Instruct", +# "messages": [ +# { +# "role": "system", +# "content": "You are a helpful assistant." +# }, +# { +# "role": "user", +# "content": "Who are you?" +# } +# ], +# "stop_token_ids": [128009, 128001] +# }' +# +# Chat with model with Gradio UI: +# +# Running on local URL: http://127.0.0.1:8811 +# Running on public URL: https://.gradio.live +# +# Scale up with SkyServe: +# HF_TOKEN=xxx sky serve up llama3.yaml -n llama3 --env HF_TOKEN +# +# curl /v1/chat/completions: +# +# ENDPOINT=$(sky serve status --endpoint llama3) +# curl -L $ENDPOINT/v1/models +# curl -L http://$ENDPOINT/v1/chat/completions \ +# -H "Content-Type: application/json" \ +# -d '{ +# "model": "databricks/llama3-instruct", +# "messages": [ +# { +# "role": "system", +# "content": "You are a helpful assistant." +# }, +# { +# "role": "user", +# "content": "Who are you?" +# } +# ] +# }' + + +envs: + MODEL_NAME: meta-llama/Meta-Llama-3-70B-Instruct + # MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct + HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass. + +service: + replicas: 2 + # An actual request for readiness probe. + readiness_probe: + path: /v1/chat/completions + post_data: + model: $MODEL_NAME + messages: + - role: user + content: Hello! What is your name? + max_tokens: 1 + +resources: + accelerators: {L4:8, A10g:8, A10:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8} + # accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model. + cpus: 32+ + use_spot: True + disk_size: 512 # Ensure model checkpoints can fit. + disk_tier: best + ports: 8081 # Expose to internet traffic. + +setup: | + conda activate vllm + if [ $? -ne 0 ]; then + conda create -n vllm python=3.10 -y + conda activate vllm + fi + + pip install vllm==0.4.2 + # Install Gradio for web UI. + pip install gradio openai + pip install flash-attn==2.5.9.post1 + + +run: | + conda activate vllm + echo 'Starting vllm api server...' + + # https://github.com/vllm-project/vllm/issues/3098 + export PATH=$PATH:/sbin + + # NOTE: --gpu-memory-utilization 0.95 needed for 4-GPU nodes. + python -u -m vllm.entrypoints.openai.api_server \ + --port 8081 \ + --model $MODEL_NAME \ + --trust-remote-code --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ + --gpu-memory-utilization 0.95 \ + --max-num-seqs 64 \ + 2>&1 | tee api_server.log & + + while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do + echo 'Waiting for vllm api server to start...' + sleep 5 + done + + echo 'Starting gradio server...' + git clone https://github.com/vllm-project/vllm.git || true + python vllm/examples/gradio_openai_chatbot_webserver.py \ + -m $MODEL_NAME \ + --port 8811 \ + --model-url http://localhost:8081/v1 \ + --stop-token-ids 128009,128001 + diff --git a/llm/mixtral/README.md b/llm/mixtral/README.md index 208b40ca14b..0bddb77c665 100644 --- a/llm/mixtral/README.md +++ b/llm/mixtral/README.md @@ -53,7 +53,7 @@ We can now access the model through the OpenAI API with the IP and port: ```bash IP=$(sky status --ip mixtral) -curl -L http://$IP:8000/v1/completions \ +curl http://$IP:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "mistralai/Mixtral-8x7B-Instruct-v0.1", @@ -66,7 +66,7 @@ Chat API is also supported: ```bash IP=$(sky status --ip mixtral) -curl -L http://$IP:8000/v1/chat/completions \ +curl http://$IP:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "mistralai/Mixtral-8x7B-Instruct-v0.1", @@ -119,7 +119,7 @@ After the `sky serve up` command, there will be a single endpoint for the servic ```bash ENDPOINT=$(sky serve status --endpoint mixtral) -curl -L http://$ENDPOINT/v1/completions \ +curl http://$ENDPOINT/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "mistralai/Mixtral-8x7B-Instruct-v0.1", @@ -132,7 +132,7 @@ Chat API is also supported: ```bash ENDPOINT=$(sky serve status --endpoint mixtral) -curl -L http://$ENDPOINT/v1/chat/completions \ +curl http://$ENDPOINT/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "mistralai/Mixtral-8x7B-Instruct-v0.1", diff --git a/llm/qwen/README.md b/llm/qwen/README.md index dac1852ff79..6a76af71287 100644 --- a/llm/qwen/README.md +++ b/llm/qwen/README.md @@ -1,8 +1,15 @@ -# Serving Qwen1.5 on Your Own Cloud +# Serving Qwen2 on Your Own Cloud -[Qwen1.5](https://github.com/QwenLM/Qwen1.5) is one of the top open LLMs. -As of Feb 2024, Qwen1.5-72B-Chat is ranked higher than Mixtral-8x7b-Instruct-v0.1 on the LMSYS Chatbot Arena Leaderboard. +[Qwen2](https://github.com/QwenLM/Qwen2) is one of the top open LLMs. +As of Jun 2024, Qwen1.5-110B-Chat is ranked higher than GPT-4-0613 on the [LMSYS Chatbot Arena Leaderboard](https://chat.lmsys.org/?leaderboard). +📰 **Update (26 April 2024) -** SkyPilot now also supports the [**Qwen1.5-110B**](https://qwenlm.github.io/blog/qwen1.5-110b/) model! It performs competitively with Llama-3-70B across a [series of evaluations](https://qwenlm.github.io/blog/qwen1.5-110b/#model-quality). Use [serve-110b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/serve-110b.yaml) to serve the 110B model. + +📰 **Update (6 Jun 2024) -** SkyPilot now also supports the [**Qwen2**](https://qwenlm.github.io/blog/qwen2/) model! It further improves the competitive model, Qwen1.5. + +

+ qwen +

## References * [Qwen docs](https://qwen.readthedocs.io/en/latest/) @@ -20,19 +27,19 @@ As of Feb 2024, Qwen1.5-72B-Chat is ranked higher than Mixtral-8x7b-Instruct-v0. After [installing SkyPilot](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html), run your own Qwen model on vLLM with SkyPilot in 1-click: -1. Start serving Qwen 72B on a single instance with any available GPU in the list specified in [serve-72b.yaml](serve-72b.yaml) with a vLLM powered OpenAI-compatible endpoint (You can also switch to [serve-7b.yaml](serve-7b.yaml) for a smaller model): +1. Start serving Qwen 110B on a single instance with any available GPU in the list specified in [serve-110b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/serve-110b.yaml) with a vLLM powered OpenAI-compatible endpoint (You can also switch to [serve-72b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/serve-72b.yaml) or [serve-7b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/serve-7b.yaml) for a smaller model): ```console -sky launch -c qwen serve-72b.yaml +sky launch -c qwen serve-110b.yaml ``` 2. Send a request to the endpoint for completion: ```bash IP=$(sky status --ip qwen) -curl -L http://$IP:8000/v1/completions \ +curl http://$IP:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ - "model": "Qwen/Qwen1.5-72B-Chat", + "model": "Qwen/Qwen1.5-110B-Chat", "prompt": "My favorite food is", "max_tokens": 512 }' | jq -r '.choices[0].text' @@ -40,10 +47,10 @@ curl -L http://$IP:8000/v1/completions \ 3. Send a request for chat completion: ```bash -curl -L http://$IP:8000/v1/chat/completions \ +curl http://$IP:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ - "model": "Qwen/Qwen1.5-72B-Chat", + "model": "Qwen/Qwen1.5-110B-Chat", "messages": [ { "role": "system", @@ -87,14 +94,14 @@ As shown, the service is now backed by 2 replicas, one on Azure and one on GCP, type is chosen to be **the cheapest available one** on the clouds. That said, it maximizes the availability of the service while minimizing the cost. -3. To access the model, we use a `curl -L` command (`-L` to follow redirect) to send the request to the endpoint: +3. To access the model, we use a `curl` command to send the request to the endpoint: ```bash ENDPOINT=$(sky serve status --endpoint qwen) -curl -L http://$ENDPOINT/v1/chat/completions \ +curl http://$ENDPOINT/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ - "model": "Qwen/Qwen1.5-72B-Chat", + "model": "Qwen/Qwen2-72B-Instruct", "messages": [ { "role": "system", @@ -110,13 +117,13 @@ curl -L http://$ENDPOINT/v1/chat/completions \ ``` -## **Optional:** Accessing Code Llama with Chat GUI +## **Optional:** Accessing Qwen with Chat GUI -It is also possible to access the Code Llama service with a GUI using [FastChat](https://github.com/lm-sys/FastChat). +It is also possible to access the Qwen service with a GUI using [vLLM](https://github.com/vllm-project/vllm). -1. Start the chat web UI: +1. Start the chat web UI (change the `--env` flag to the model you are running): ```bash -sky launch -c qwen-gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint qwen) +sky launch -c qwen-gui ./gui.yaml --env MODEL_NAME='Qwen/Qwen2-72B-Instruct' --env ENDPOINT=$(sky serve status --endpoint qwen) ``` 2. Then, we can access the GUI at the returned gradio link: @@ -124,5 +131,3 @@ sky launch -c qwen-gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint q | INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live ``` -Note that you may get better results to use a higher temperature and top_p value. - diff --git a/llm/qwen/gui.yaml b/llm/qwen/gui.yaml index 86e2fe27220..bf3c01abf76 100644 --- a/llm/qwen/gui.yaml +++ b/llm/qwen/gui.yaml @@ -13,7 +13,8 @@ # you can click on it to open the GUI. envs: - ENDPOINT: x.x.x.x:3031 # Address of the API server running qwen. + ENDPOINT: x.x.x.x:3031 # Address of the API server running qwen. + MODEL_NAME: Qwen/Qwen1.5-72B-Chat resources: cpus: 2 @@ -25,8 +26,8 @@ setup: | conda activate qwen fi - pip install "fschat[model_worker,webui]" - pip install "openai<1" + # Install Gradio for web UI. + pip install gradio openai run: | conda activate qwen @@ -35,19 +36,9 @@ run: | CONTROLLER_PORT=21001 WORKER_PORT=21002 - cat < ~/model_info.json - { - "Qwen/Qwen1.5-72B-Chat": { - "model_name": "Qwen/Qwen1.5-72B-Chat", - "api_base": "http://${ENDPOINT}/v1", - "api_key": "empty", - "model_path": "Qwen/Qwen1.5-72B-Chat" - } - } - EOF - - python3 -m fastchat.serve.controller --host 0.0.0.0 --port ${CONTROLLER_PORT} > ~/controller.log 2>&1 & - echo 'Starting gradio server...' - python -u -m fastchat.serve.gradio_web_server --share \ - --register-openai-compatible-models ~/model_info.json | tee ~/gradio.log + git clone https://github.com/vllm-project/vllm.git || true + python vllm/examples/gradio_openai_chatbot_webserver.py \ + -m $MODEL_NAME \ + --port 8811 \ + --model-url http://$ENDPOINT/v1 | tee ~/gradio.log diff --git a/llm/qwen/serve-110b.yaml b/llm/qwen/serve-110b.yaml new file mode 100644 index 00000000000..1e98bd254e9 --- /dev/null +++ b/llm/qwen/serve-110b.yaml @@ -0,0 +1,43 @@ +envs: + MODEL_NAME: Qwen/Qwen1.5-110B-Chat + +service: + # Specifying the path to the endpoint to check the readiness of the replicas. + readiness_probe: + path: /v1/chat/completions + post_data: + model: $MODEL_NAME + messages: + - role: user + content: Hello! What is your name? + max_tokens: 1 + initial_delay_seconds: 1200 + # How many replicas to manage. + replicas: 2 + + +resources: + accelerators: {A100:8, A100-80GB:4, A100-80GB:8} + disk_size: 1024 + disk_tier: best + memory: 32+ + ports: 8000 + +setup: | + conda activate qwen + if [ $? -ne 0 ]; then + conda create -n qwen python=3.10 -y + conda activate qwen + fi + pip install vllm==0.4.2 + pip install flash-attn==2.5.9.post1 + +run: | + conda activate qwen + export PATH=$PATH:/sbin + python -u -m vllm.entrypoints.openai.api_server \ + --host 0.0.0.0 \ + --model $MODEL_NAME \ + --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ + --max-num-seqs 16 | tee ~/openai_api_server.log + diff --git a/llm/qwen/serve-72b.yaml b/llm/qwen/serve-72b.yaml index 86248011bbf..34e3e348f2f 100644 --- a/llm/qwen/serve-72b.yaml +++ b/llm/qwen/serve-72b.yaml @@ -1,5 +1,5 @@ envs: - MODEL_NAME: Qwen/Qwen1.5-72B-Chat + MODEL_NAME: Qwen/Qwen2-72B-Instruct service: # Specifying the path to the endpoint to check the readiness of the replicas. @@ -29,8 +29,8 @@ setup: | conda create -n qwen python=3.10 -y conda activate qwen fi - pip install -U vllm==0.3.2 - pip install -U transformers==4.38.0 + pip install vllm==0.4.2 + pip install flash-attn==2.5.9.post1 run: | conda activate qwen diff --git a/llm/qwen/serve-7b.yaml b/llm/qwen/serve-7b.yaml index a1ec7ee3f2b..f33adcdd2cd 100644 --- a/llm/qwen/serve-7b.yaml +++ b/llm/qwen/serve-7b.yaml @@ -1,5 +1,5 @@ envs: - MODEL_NAME: Qwen/Qwen1.5-7B-Chat + MODEL_NAME: Qwen/Qwen2-7B-Instruct service: # Specifying the path to the endpoint to check the readiness of the replicas. @@ -27,8 +27,8 @@ setup: | conda create -n qwen python=3.10 -y conda activate qwen fi - pip install -U vllm==0.3.2 - pip install -U transformers==4.38.0 + pip install vllm==0.4.2 + pip install flash-attn==2.5.9.post1 run: | conda activate qwen diff --git a/llm/sglang/README.md b/llm/sglang/README.md index 3ffcc2f484b..fc79529148a 100644 --- a/llm/sglang/README.md +++ b/llm/sglang/README.md @@ -68,7 +68,7 @@ ENDPOINT=$(sky serve status --endpoint sglang-llava) ```bash -curl -L $ENDPOINT/v1/chat/completions \ +curl $ENDPOINT/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "liuhaotian/llava-v1.6-vicuna-7b", @@ -149,7 +149,7 @@ ENDPOINT=$(sky serve status --endpoint sglang-llama2) 4. Once it status is `READY`, you can use the endpoint to interact with the model: ```bash -curl -L $ENDPOINT/v1/chat/completions \ +curl $ENDPOINT/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-2-7b-chat-hf", diff --git a/llm/sglang/llama2.yaml b/llm/sglang/llama2.yaml index 08427ab2001..8b58c4365d6 100644 --- a/llm/sglang/llama2.yaml +++ b/llm/sglang/llama2.yaml @@ -6,7 +6,7 @@ service: envs: MODEL_NAME: meta-llama/Llama-2-7b-chat-hf - HF_TOKEN: # Change to your own huggingface token + HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass. resources: accelerators: {L4:1, A10G:1, A10:1, A100:1, A100-80GB:1} diff --git a/llm/tgi/README.md b/llm/tgi/README.md index 8fb68222d68..8c8360d0465 100644 --- a/llm/tgi/README.md +++ b/llm/tgi/README.md @@ -17,7 +17,7 @@ A user can access the model with the following command: ```bash ENDPOINT=$(sky status --endpoint 8080 tgi) -curl -L $(sky serve status tgi --endpoint)/generate \ +curl $(sky serve status tgi --endpoint)/generate \ -H 'Content-Type: application/json' \ -d '{ "inputs": "What is Deep Learning?", @@ -51,7 +51,7 @@ After the service is launched, we can access the model with the following comman ```bash ENDPOINT=$(sky serve status --endpoint tgi) -curl -L $ENDPOINT/generate \ +curl $ENDPOINT/generate \ -H 'Content-Type: application/json' \ -d '{ "inputs": "What is Deep Learning?", diff --git a/llm/vicuna-llama-2/README.md b/llm/vicuna-llama-2/README.md index 0fc5da6c4ba..899792c299d 100644 --- a/llm/vicuna-llama-2/README.md +++ b/llm/vicuna-llama-2/README.md @@ -31,7 +31,7 @@ cd skypilot/llm/vicuna-llama-2 Paste the access token into [train.yaml](https://github.com/skypilot-org/skypilot/tree/master/llm/vicuna-llama-2/train.yaml): ```yaml envs: - HF_TOKEN: # Change to your own huggingface token + HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass. ``` ## Train your own Vicuna on Llama-2 diff --git a/llm/vicuna-llama-2/serve.yaml b/llm/vicuna-llama-2/serve.yaml index 0a98dab5d26..69f89f2fc28 100644 --- a/llm/vicuna-llama-2/serve.yaml +++ b/llm/vicuna-llama-2/serve.yaml @@ -27,11 +27,12 @@ run: | conda activate chatbot echo 'Starting controller...' - python -u -m fastchat.serve.controller > ~/controller.log 2>&1 & + python -u -m fastchat.serve.controller --host 127.0.0.1 > ~/controller.log 2>&1 & sleep 10 echo 'Starting model worker...' python -u -m fastchat.serve.model_worker \ --model-path /skypilot-vicuna 2>&1 \ + --host 127.0.0.1 \ | tee model_worker.log & echo 'Waiting for model worker to start...' diff --git a/llm/vicuna-llama-2/train.yaml b/llm/vicuna-llama-2/train.yaml index e23d5797e76..8d35c2dff85 100644 --- a/llm/vicuna-llama-2/train.yaml +++ b/llm/vicuna-llama-2/train.yaml @@ -1,7 +1,7 @@ envs: - HF_TOKEN: # Change to your own huggingface token - ARTIFACT_BUCKET_NAME: YOUR_OWN_BUCKET_NAME # Change to your own bucket name - WANDB_API_KEY: "" # Change to your own wandb api key + HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass. + ARTIFACT_BUCKET_NAME: # TODO: Fill with your unique bucket name, or use --env to pass. + WANDB_API_KEY: # TODO: Fill with your own WANDB_API_KEY, or use --env to pass. MODEL_SIZE: 7 USE_XFORMERS: 1 diff --git a/llm/vicuna/serve-openai-api-endpoint.yaml b/llm/vicuna/serve-openai-api-endpoint.yaml index 247043ee3c2..639dfadc6d6 100644 --- a/llm/vicuna/serve-openai-api-endpoint.yaml +++ b/llm/vicuna/serve-openai-api-endpoint.yaml @@ -19,11 +19,12 @@ run: | conda activate chatbot echo 'Starting controller...' - python -u -m fastchat.serve.controller > ~/controller.log 2>&1 & + python -u -m fastchat.serve.controller --host 127.0.0.1 > ~/controller.log 2>&1 & sleep 10 echo 'Starting model worker...' python -u -m fastchat.serve.model_worker \ --model-path lmsys/vicuna-${MODEL_SIZE}b-v1.3 2>&1 \ + --host 127.0.0.1 \ | tee model_worker.log & echo 'Waiting for model worker to start...' diff --git a/llm/vicuna/serve.yaml b/llm/vicuna/serve.yaml index d458112a42f..49185fcea20 100644 --- a/llm/vicuna/serve.yaml +++ b/llm/vicuna/serve.yaml @@ -19,11 +19,12 @@ run: | conda activate chatbot echo 'Starting controller...' - python -u -m fastchat.serve.controller > ~/controller.log 2>&1 & + python -u -m fastchat.serve.controller --host 127.0.0.1 > ~/controller.log 2>&1 & sleep 10 echo 'Starting model worker...' python -u -m fastchat.serve.model_worker \ --model-path lmsys/vicuna-${MODEL_SIZE}b-v1.3 2>&1 \ + --host 127.0.0.1 \ | tee model_worker.log & echo 'Waiting for model worker to start...' diff --git a/llm/vicuna/train.yaml b/llm/vicuna/train.yaml index c577561e858..a2121aaf8fd 100644 --- a/llm/vicuna/train.yaml +++ b/llm/vicuna/train.yaml @@ -1,3 +1,10 @@ +envs: + MODEL_SIZE: 7 + SEQ_LEN: 2048 + GC_SCALE: 4 + USE_FLASH_ATTN: 0 + WANDB_API_KEY: # TODO: Fill with your own WANDB_API_KEY, or use --env to pass. + resources: accelerators: A100-80GB:8 disk_size: 1000 @@ -109,10 +116,3 @@ run: | gsutil -m rsync -r -x 'checkpoint-*' $LOCAL_CKPT_PATH/ $CKPT_PATH/ exit $returncode - -envs: - MODEL_SIZE: 7 - SEQ_LEN: 2048 - GC_SCALE: 4 - USE_FLASH_ATTN: 0 - WANDB_API_KEY: "" diff --git a/llm/vllm/README.md b/llm/vllm/README.md index 568b8ff70bd..61932cd8571 100644 --- a/llm/vllm/README.md +++ b/llm/vllm/README.md @@ -154,7 +154,7 @@ ENDPOINT=$(sky serve status --endpoint vllm-llama2) 4. Once it status is `READY`, you can use the endpoint to interact with the model: ```bash -curl -L $ENDPOINT/v1/chat/completions \ +curl $ENDPOINT/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-2-7b-chat-hf", @@ -171,7 +171,7 @@ curl -L $ENDPOINT/v1/chat/completions \ }' ``` -Notice that it is the same with previously curl command, except for thr `-L` argument. You should get a similar response as the following: +Notice that it is the same with previously curl command. You should get a similar response as the following: ```console { diff --git a/llm/vllm/serve-openai-api.yaml b/llm/vllm/serve-openai-api.yaml index 9ddf7b280ba..a68f476edc7 100644 --- a/llm/vllm/serve-openai-api.yaml +++ b/llm/vllm/serve-openai-api.yaml @@ -1,6 +1,6 @@ envs: MODEL_NAME: meta-llama/Llama-2-7b-chat-hf - HF_TOKEN: # Change to your own huggingface token + HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass. resources: accelerators: {L4:1, A10G:1, A10:1, A100:1, A100-80GB:1} diff --git a/llm/vllm/service-with-auth.yaml b/llm/vllm/service-with-auth.yaml new file mode 100644 index 00000000000..0a40df29293 --- /dev/null +++ b/llm/vllm/service-with-auth.yaml @@ -0,0 +1,42 @@ +# service.yaml +# The newly-added `service` section to the `serve-openai-api.yaml` file. +service: + # Specifying the path to the endpoint to check the readiness of the service. + readiness_probe: + path: /v1/models + # Set authorization headers here if needed. + headers: + Authorization: Bearer $AUTH_TOKEN + # How many replicas to manage. + replicas: 1 + +# Fields below are the same with `serve-openai-api.yaml`. +envs: + MODEL_NAME: meta-llama/Llama-2-7b-chat-hf + HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass. + AUTH_TOKEN: # TODO: Fill with your own auth token (a random string), or use --env to pass. + +resources: + accelerators: {L4:1, A10G:1, A10:1, A100:1, A100-80GB:1} + ports: 8000 + +setup: | + conda activate vllm + if [ $? -ne 0 ]; then + conda create -n vllm python=3.10 -y + conda activate vllm + fi + + pip install transformers==4.38.0 + pip install vllm==0.3.2 + + python -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}')" + + +run: | + conda activate vllm + echo 'Starting vllm openai api server...' + python -m vllm.entrypoints.openai.api_server \ + --model $MODEL_NAME --tokenizer hf-internal-testing/llama-tokenizer \ + --host 0.0.0.0 --port 8000 --api-key $AUTH_TOKEN + diff --git a/llm/vllm/service.yaml b/llm/vllm/service.yaml index 335f8a50650..1e5d92a60e5 100644 --- a/llm/vllm/service.yaml +++ b/llm/vllm/service.yaml @@ -9,7 +9,7 @@ service: # Fields below are the same with `serve-openai-api.yaml`. envs: MODEL_NAME: meta-llama/Llama-2-7b-chat-hf - HF_TOKEN: # Change to your own huggingface token + HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass. resources: accelerators: {L4:1, A10G:1, A10:1, A100:1, A100-80GB:1} diff --git a/sky/__init__.py b/sky/__init__.py index d25c8297ea5..a077fb8966a 100644 --- a/sky/__init__.py +++ b/sky/__init__.py @@ -102,16 +102,16 @@ def set_proxy_env_var(proxy_var: str, urllib_var: Optional[str]): from sky.data import StoreType from sky.execution import exec # pylint: disable=redefined-builtin from sky.execution import launch +# TODO (zhwu): These imports are for backward compatibility, and spot APIs +# should be called with `sky.spot.xxx` instead. Remove in release 0.8.0 +from sky.jobs.core import spot_cancel +from sky.jobs.core import spot_launch +from sky.jobs.core import spot_queue +from sky.jobs.core import spot_tail_logs from sky.optimizer import Optimizer from sky.optimizer import OptimizeTarget from sky.resources import Resources from sky.skylet.job_lib import JobStatus -# TODO (zhwu): These imports are for backward compatibility, and spot APIs -# should be called with `sky.spot.xxx` instead. Remove in release 0.7.0 -from sky.spot.core import spot_cancel -from sky.spot.core import spot_launch -from sky.spot.core import spot_queue -from sky.spot.core import spot_tail_logs from sky.status_lib import ClusterStatus from sky.task import Task diff --git a/sky/adaptors/cloudflare.py b/sky/adaptors/cloudflare.py index 2a49dc6fff0..864248614f3 100644 --- a/sky/adaptors/cloudflare.py +++ b/sky/adaptors/cloudflare.py @@ -23,6 +23,7 @@ R2_PROFILE_NAME = 'r2' _INDENT_PREFIX = ' ' NAME = 'Cloudflare' +SKY_CHECK_NAME = 'Cloudflare (for R2 object store)' @contextlib.contextmanager diff --git a/sky/adaptors/gcp.py b/sky/adaptors/gcp.py index 6465709d42c..9f63bec87ee 100644 --- a/sky/adaptors/gcp.py +++ b/sky/adaptors/gcp.py @@ -21,8 +21,9 @@ def build(service_name: str, version: str, *args, **kwargs): service_name: GCP service name (e.g., 'compute', 'storagetransfer'). version: Service version (e.g., 'v1'). """ - from googleapiclient import discovery - return discovery.build(service_name, version, *args, **kwargs) + + return googleapiclient.discovery.build(service_name, version, *args, + **kwargs) @common.load_lazy_modules(_LAZY_MODULES) diff --git a/sky/adaptors/kubernetes.py b/sky/adaptors/kubernetes.py index c6b4e8bce0b..7cdb3ff3059 100644 --- a/sky/adaptors/kubernetes.py +++ b/sky/adaptors/kubernetes.py @@ -21,6 +21,8 @@ _networking_api = None _custom_objects_api = None _node_api = None +_apps_api = None +_api_client = None # Timeout to use for API calls API_TIMEOUT = 5 @@ -54,9 +56,9 @@ def _load_config(): ' If you were running a local Kubernetes ' 'cluster, run `sky local up` to start the cluster.') else: - err_str = ( - 'Failed to load Kubernetes configuration. ' - f'Please check if your kubeconfig file is valid.{suffix}') + err_str = ('Failed to load Kubernetes configuration. ' + 'Please check if your kubeconfig file exists at ' + f'~/.kube/config and is valid.{suffix}') err_str += '\nTo disable Kubernetes for SkyPilot: run `sky check`.' with ux_utils.print_exception_no_traceback(): raise ValueError(err_str) from None @@ -108,6 +110,24 @@ def node_api(): return _node_api +def apps_api(): + global _apps_api + if _apps_api is None: + _load_config() + _apps_api = kubernetes.client.AppsV1Api() + + return _apps_api + + +def api_client(): + global _api_client + if _api_client is None: + _load_config() + _api_client = kubernetes.client.ApiClient() + + return _api_client + + def api_exception(): return kubernetes.client.rest.ApiException diff --git a/sky/authentication.py b/sky/authentication.py index 7b5699d3337..966dad670c5 100644 --- a/sky/authentication.py +++ b/sky/authentication.py @@ -15,7 +15,7 @@ The local machine's public key should not be uploaded to the `~/.ssh/sky-key.pub` on the remote VM, because it will cause private/public key pair mismatch when the user tries to launch new VM from that remote VM -using SkyPilot, e.g., the node is used as a spot controller. (Lambda cloud +using SkyPilot, e.g., the node is used as a jobs controller. (Lambda cloud is an exception, due to the limitation of the cloud provider. See the comments in setup_lambda_authentication) """ @@ -408,7 +408,7 @@ def setup_kubernetes_authentication(config: Dict[str, Any]) -> Dict[str, Any]: # Add the user's public key to the SkyPilot cluster. public_key_path = os.path.expanduser(PUBLIC_SSH_KEY_PATH) secret_name = clouds.Kubernetes.SKY_SSH_KEY_SECRET_NAME - secret_field_name = clouds.Kubernetes.SKY_SSH_KEY_SECRET_FIELD_NAME + secret_field_name = clouds.Kubernetes().ssh_key_secret_field_name namespace = kubernetes_utils.get_current_kube_config_context_namespace() k8s = kubernetes.kubernetes with open(public_key_path, 'r', encoding='utf-8') as f: diff --git a/sky/backends/backend_utils.py b/sky/backends/backend_utils.py index 0da2bd9ef0b..03f644930f4 100644 --- a/sky/backends/backend_utils.py +++ b/sky/backends/backend_utils.py @@ -1,6 +1,7 @@ """Util constants/functions for the backends.""" from datetime import datetime import enum +import fnmatch import functools import os import pathlib @@ -47,7 +48,9 @@ from sky.utils import common_utils from sky.utils import controller_utils from sky.utils import env_options +from sky.utils import resources_utils from sky.utils import rich_utils +from sky.utils import schemas from sky.utils import subprocess_utils from sky.utils import timeline from sky.utils import ux_utils @@ -108,6 +111,9 @@ # Remote dir that holds our runtime files. _REMOTE_RUNTIME_FILES_DIR = '~/.sky/.runtime_files' +_ENDPOINTS_RETRY_MESSAGE = ('If the cluster was recently started, ' + 'please retry after a while.') + # Include the fields that will be used for generating tags that distinguishes # the cluster in ray, to avoid the stopped cluster being discarded due to # updates in the yaml template. @@ -327,7 +333,7 @@ def wrap_file_mount(cls, path: str) -> str: def make_safe_symlink_command(cls, *, source: str, target: str) -> str: """Returns a command that safely symlinks 'source' to 'target'. - All intermediate directories of 'source' will be owned by $USER, + All intermediate directories of 'source' will be owned by $(whoami), excluding the root directory (/). 'source' must be an absolute path; both 'source' and 'target' must not @@ -354,17 +360,17 @@ def make_safe_symlink_command(cls, *, source: str, target: str) -> str: target) # Below, use sudo in case the symlink needs sudo access to create. # Prepare to create the symlink: - # 1. make sure its dir(s) exist & are owned by $USER. + # 1. make sure its dir(s) exist & are owned by $(whoami). dir_of_symlink = os.path.dirname(source) commands = [ # mkdir, then loop over '/a/b/c' as /a, /a/b, /a/b/c. For each, - # chown $USER on it so user can use these intermediate dirs + # chown $(whoami) on it so user can use these intermediate dirs # (excluding /). f'sudo mkdir -p {dir_of_symlink}', # p: path so far ('(p=""; ' f'for w in $(echo {dir_of_symlink} | tr "/" " "); do ' - 'p=${p}/${w}; sudo chown $USER $p; done)') + 'p=${p}/${w}; sudo chown $(whoami) $p; done)') ] # 2. remove any existing symlink (ln -f may throw 'cannot # overwrite directory', if the link exists and points to a @@ -380,7 +386,7 @@ def make_safe_symlink_command(cls, *, source: str, target: str) -> str: # Link. f'sudo ln -s {target} {source}', # chown. -h to affect symlinks only. - f'sudo chown -h $USER {source}', + f'sudo chown -h $(whoami) {source}', ] return ' && '.join(commands) @@ -797,8 +803,14 @@ def write_cluster_config( assert cluster_name is not None excluded_clouds = [] remote_identity = skypilot_config.get_nested( - (str(cloud).lower(), 'remote_identity'), 'LOCAL_CREDENTIALS') - if remote_identity == 'SERVICE_ACCOUNT': + (str(cloud).lower(), 'remote_identity'), + schemas.get_default_remote_identity(str(cloud).lower())) + if remote_identity is not None and not isinstance(remote_identity, str): + for profile in remote_identity: + if fnmatch.fnmatchcase(cluster_name, list(profile.keys())[0]): + remote_identity = list(profile.values())[0] + break + if remote_identity != schemas.RemoteIdentityOptions.LOCAL_CREDENTIALS.value: if not cloud.supports_service_account_on_remote(): raise exceptions.InvalidCloudConfigs( 'remote_identity: SERVICE_ACCOUNT is specified in ' @@ -840,13 +852,20 @@ def write_cluster_config( ssh_proxy_command = ssh_proxy_command_config[region_name] logger.debug(f'Using ssh_proxy_command: {ssh_proxy_command!r}') - # User-supplied instance tags. - instance_tags = {} - instance_tags = skypilot_config.get_nested( - (str(cloud).lower(), 'instance_tags'), {}) - # instance_tags is a dict, which is guaranteed by the type check in + # User-supplied global instance tags from ~/.sky/config.yaml. + labels = skypilot_config.get_nested((str(cloud).lower(), 'labels'), {}) + # Deprecated: instance_tags have been replaced by labels. For backward + # compatibility, we support them and the schema allows them only if + # `labels` are not specified. This should be removed after 0.7.0. + labels = skypilot_config.get_nested((str(cloud).lower(), 'instance_tags'), + labels) + # labels is a dict, which is guaranteed by the type check in # schemas.py - assert isinstance(instance_tags, dict), instance_tags + assert isinstance(labels, dict), labels + + # Get labels from resources and override from the labels to_provision. + if to_provision.labels: + labels.update(to_provision.labels) # Dump the Ray ports to a file for Ray job submission dump_port_command = ( @@ -879,8 +898,10 @@ def write_cluster_config( 'vpc_name': skypilot_config.get_nested( (str(cloud).lower(), 'vpc_name'), None), - # User-supplied instance tags. - 'instance_tags': instance_tags, + # User-supplied labels. + 'labels': labels, + # User-supplied remote_identity + 'remote_identity': remote_identity, # The reservation pools that specified by the user. This is # currently only used by GCP. 'specific_reservations': specific_reservations, @@ -904,7 +925,14 @@ def write_cluster_config( 'dump_port_command': dump_port_command, # Sky-internal constants. 'sky_ray_cmd': constants.SKY_RAY_CMD, - 'sky_pip_cmd': constants.SKY_PIP_CMD, + # pip install needs to have python env activated to make sure + # installed packages are within the env path. + 'sky_pip_cmd': f'{constants.SKY_PIP_CMD}', + # Activate the SkyPilot runtime environment when starting ray + # cluster, so that ray autoscaler can access cloud SDK and CLIs + # on remote + 'sky_activate_python_env': + constants.ACTIVATE_SKY_REMOTE_PYTHON_ENV, 'ray_version': constants.SKY_REMOTE_RAY_VERSION, # Command for waiting ray cluster to be ready on head. 'ray_head_wait_initialized_command': @@ -951,7 +979,11 @@ def write_cluster_config( with open(tmp_yaml_path, 'w', encoding='utf-8') as f: f.write(restored_yaml_content) - config_dict['cluster_name_on_cloud'] = cluster_name_on_cloud + # Read the cluster name from the tmp yaml file, to take the backward + # compatbility restortion above into account. + # TODO: remove this after 2 minor releases, 0.8.0. + yaml_config = common_utils.read_yaml(tmp_yaml_path) + config_dict['cluster_name_on_cloud'] = yaml_config['cluster_name'] # Optimization: copy the contents of source files in file_mounts to a # special dir, and upload that as the only file_mount instead. Delay @@ -1059,7 +1091,7 @@ def get_ready_nodes_counts(pattern, output): def get_docker_user(ip: str, cluster_config_file: str) -> str: """Find docker container username.""" ssh_credentials = ssh_credential_from_yaml(cluster_config_file) - runner = command_runner.SSHCommandRunner(ip, port=22, **ssh_credentials) + runner = command_runner.SSHCommandRunner(node=(ip, 22), **ssh_credentials) container_name = constants.DEFAULT_DOCKER_CONTAINER_NAME whoami_returncode, whoami_stdout, whoami_stderr = runner.run( f'sudo docker exec {container_name} whoami', @@ -1092,7 +1124,7 @@ def wait_until_ray_cluster_ready( try: head_ip = _query_head_ip_with_retries( cluster_config_file, max_attempts=WAIT_HEAD_NODE_IP_MAX_ATTEMPTS) - except exceptions.FetchIPError as e: + except exceptions.FetchClusterInfoError as e: logger.error(common_utils.format_exception(e)) return False, None # failed @@ -1108,8 +1140,7 @@ def wait_until_ray_cluster_ready( ssh_credentials = ssh_credential_from_yaml(cluster_config_file, docker_user) last_nodes_so_far = 0 start = time.time() - runner = command_runner.SSHCommandRunner(head_ip, - port=22, + runner = command_runner.SSHCommandRunner(node=(head_ip, 22), **ssh_credentials) with rich_utils.safe_status( '[bold cyan]Waiting for workers...') as worker_status: @@ -1215,7 +1246,7 @@ def ssh_credential_from_yaml( def parallel_data_transfer_to_nodes( - runners: List[command_runner.SSHCommandRunner], + runners: List[command_runner.CommandRunner], source: Optional[str], target: str, cmd: Optional[str], @@ -1225,32 +1256,36 @@ def parallel_data_transfer_to_nodes( # Advanced options. log_path: str = os.devnull, stream_logs: bool = False, + source_bashrc: bool = False, ): """Runs a command on all nodes and optionally runs rsync from src->dst. Args: - runners: A list of SSHCommandRunner objects that represent multiple nodes. + runners: A list of CommandRunner objects that represent multiple nodes. source: Optional[str]; Source for rsync on local node target: str; Destination on remote node for rsync cmd: str; Command to be executed on all nodes action_message: str; Message to be printed while the command runs log_path: str; Path to the log file stream_logs: bool; Whether to stream logs to stdout + source_bashrc: bool; Source bashrc before running the command. """ fore = colorama.Fore style = colorama.Style origin_source = source - def _sync_node(runner: 'command_runner.SSHCommandRunner') -> None: + def _sync_node(runner: 'command_runner.CommandRunner') -> None: if cmd is not None: rc, stdout, stderr = runner.run(cmd, log_path=log_path, stream_logs=stream_logs, - require_outputs=True) + require_outputs=True, + source_bashrc=source_bashrc) err_msg = ('Failed to run command before rsync ' f'{origin_source} -> {target}. ' - 'Ensure that the network is stable, then retry.') + 'Ensure that the network is stable, then retry. ' + f'{cmd}') if log_path != os.devnull: err_msg += f' See logs in {log_path}' subprocess_utils.handle_returncode(rc, @@ -1315,7 +1350,7 @@ def _query_head_ip_with_retries(cluster_yaml: str, """Returns the IP of the head node by querying the cloud. Raises: - exceptions.FetchIPError: if we failed to get the head IP. + exceptions.FetchClusterInfoError: if we failed to get the head IP. """ backoff = common_utils.Backoff(initial_backoff=5, max_backoff_factor=5) for i in range(max_attempts): @@ -1344,8 +1379,8 @@ def _query_head_ip_with_retries(cluster_yaml: str, break except subprocess.CalledProcessError as e: if i == max_attempts - 1: - raise exceptions.FetchIPError( - reason=exceptions.FetchIPError.Reason.HEAD) from e + raise exceptions.FetchClusterInfoError( + reason=exceptions.FetchClusterInfoError.Reason.HEAD) from e # Retry if the cluster is not up yet. logger.debug('Retrying to get head ip.') time.sleep(backoff.current_backoff()) @@ -1370,7 +1405,7 @@ def get_node_ips(cluster_yaml: str, IPs. Raises: - exceptions.FetchIPError: if we failed to get the IPs. e.reason is + exceptions.FetchClusterInfoError: if we failed to get the IPs. e.reason is HEAD or WORKER. """ ray_config = common_utils.read_yaml(cluster_yaml) @@ -1391,11 +1426,12 @@ def get_node_ips(cluster_yaml: str, 'Failed to get cluster info for ' f'{ray_config["cluster_name"]} from the new provisioner ' f'with {common_utils.format_exception(e)}.') - raise exceptions.FetchIPError( - exceptions.FetchIPError.Reason.HEAD) from e + raise exceptions.FetchClusterInfoError( + exceptions.FetchClusterInfoError.Reason.HEAD) from e if len(metadata.instances) < expected_num_nodes: # Simulate the exception when Ray head node is not up. - raise exceptions.FetchIPError(exceptions.FetchIPError.Reason.HEAD) + raise exceptions.FetchClusterInfoError( + exceptions.FetchClusterInfoError.Reason.HEAD) return metadata.get_feasible_ips(get_internal_ips) if get_internal_ips: @@ -1425,8 +1461,8 @@ def get_node_ips(cluster_yaml: str, break except subprocess.CalledProcessError as e: if retry_cnt == worker_ip_max_attempts - 1: - raise exceptions.FetchIPError( - exceptions.FetchIPError.Reason.WORKER) from e + raise exceptions.FetchClusterInfoError( + exceptions.FetchClusterInfoError.Reason.WORKER) from e # Retry if the ssh is not ready for the workers yet. backoff_time = backoff.current_backoff() logger.debug('Retrying to get worker ip ' @@ -1451,8 +1487,8 @@ def get_node_ips(cluster_yaml: str, f'detected IP(s): {worker_ips[-n:]}.') worker_ips = worker_ips[-n:] else: - raise exceptions.FetchIPError( - exceptions.FetchIPError.Reason.WORKER) + raise exceptions.FetchClusterInfoError( + exceptions.FetchClusterInfoError.Reason.WORKER) else: worker_ips = [] return head_ip_list + worker_ips @@ -1516,7 +1552,7 @@ def check_owner_identity(cluster_name: str) -> None: for i, (owner, current) in enumerate(zip(owner_identity, current_user_identity)): - # Clean up the owner identiy for the backslash and newlines, caused + # Clean up the owner identity for the backslash and newlines, caused # by the cloud CLI output, e.g. gcloud. owner = owner.replace('\n', '').replace('\\', '') if owner == current: @@ -1739,14 +1775,11 @@ def _update_cluster_status_no_lock( def run_ray_status_to_check_ray_cluster_healthy() -> bool: try: - # TODO(zhwu): This function cannot distinguish transient network - # error in ray's get IPs vs. ray runtime failing. - # NOTE: fetching the IPs is very slow as it calls into # `ray get head-ip/worker-ips`. Using cached IPs is safe because # in the worst case we time out in the `ray status` SSH command # below. - external_ips = handle.cached_external_ips + runners = handle.get_command_runners(force_cached=True) # This happens when user interrupt the `sky launch` process before # the first time resources handle is written back to local database. # This is helpful when user interrupt after the provision is done @@ -1754,27 +1787,13 @@ def run_ray_status_to_check_ray_cluster_healthy() -> bool: # helps keep the cluster status to INIT after `sky status -r`, so # user will be notified that any auto stop/down might not be # triggered. - if external_ips is None or len(external_ips) == 0: + if not runners: logger.debug(f'Refreshing status ({cluster_name!r}): No cached ' f'IPs found. Handle: {handle}') - raise exceptions.FetchIPError( - reason=exceptions.FetchIPError.Reason.HEAD) - - # Potentially refresh the external SSH ports, in case the existing - # cluster before #2491 was launched without external SSH ports - # cached. - external_ssh_ports = handle.external_ssh_ports() - head_ssh_port = external_ssh_ports[0] - - # Check if ray cluster status is healthy. - ssh_credentials = ssh_credential_from_yaml(handle.cluster_yaml, - handle.docker_user, - handle.ssh_user) - - runner = command_runner.SSHCommandRunner(external_ips[0], - **ssh_credentials, - port=head_ssh_port) - rc, output, stderr = runner.run( + raise exceptions.FetchClusterInfoError( + reason=exceptions.FetchClusterInfoError.Reason.HEAD) + head_runner = runners[0] + rc, output, stderr = head_runner.run( instance_setup.RAY_STATUS_WITH_SKY_RAY_PORT_COMMAND, stream_logs=False, require_outputs=True, @@ -1794,7 +1813,7 @@ def run_ray_status_to_check_ray_cluster_healthy() -> bool: f'Refreshing status ({cluster_name!r}): ray status not showing ' f'all nodes ({ready_head + ready_workers}/' f'{total_nodes}); output: {output}; stderr: {stderr}') - except exceptions.FetchIPError: + except exceptions.FetchClusterInfoError: logger.debug( f'Refreshing status ({cluster_name!r}) failed to get IPs.') except RuntimeError as e: @@ -2232,18 +2251,18 @@ def check_cluster_available( # TODO(tian): Refactor to controller_utils. Current blocker: circular import. def is_controller_accessible( - controller_type: controller_utils.Controllers, + controller: controller_utils.Controllers, stopped_message: str, non_existent_message: Optional[str] = None, exit_if_not_accessible: bool = False, ) -> 'backends.CloudVmRayResourceHandle': - """Check if the spot/serve controller is up. + """Check if the jobs/serve controller is up. The controller is accessible when it is in UP or INIT state, and the ssh connection is successful. It can be used to check if the controller is accessible (since the autostop - is set for the controller) before the spot/serve commands interact with the + is set for the controller) before the jobs/serve commands interact with the controller. ClusterNotUpError will be raised whenever the controller cannot be accessed. @@ -2267,10 +2286,8 @@ def is_controller_accessible( failed to be connected. """ if non_existent_message is None: - non_existent_message = ( - controller_type.value.default_hint_if_non_existent) - cluster_name = controller_type.value.cluster_name - controller_name = controller_type.value.name.replace(' controller', '') + non_existent_message = controller.value.default_hint_if_non_existent + cluster_name = controller.value.cluster_name need_connection_check = False controller_status, handle = None, None try: @@ -2292,7 +2309,7 @@ def is_controller_accessible( # will not start the controller manually from the cloud console. # # The acquire_lock_timeout is set to 0 to avoid hanging the command when - # multiple spot.launch commands are running at the same time. Our later + # multiple jobs.launch commands are running at the same time. Our later # code will check if the controller is accessible by directly checking # the ssh connection to the controller, if it fails to get accurate # status of the controller. @@ -2304,6 +2321,7 @@ def is_controller_accessible( # We do not catch the exceptions related to the cluster owner identity # mismatch, please refer to the comment in # `backend_utils.check_cluster_available`. + controller_name = controller.value.name.replace(' controller', '') logger.warning( 'Failed to get the status of the controller. It is not ' f'fatal, but {controller_name} commands/calls may hang or return ' @@ -2329,18 +2347,18 @@ def is_controller_accessible( elif (controller_status == status_lib.ClusterStatus.INIT or need_connection_check): # Check ssh connection if (1) controller is in INIT state, or (2) we failed to fetch the - # status, both of which can happen when controller's status lock is held by another `sky spot launch` or + # status, both of which can happen when controller's status lock is held by another `sky jobs launch` or # `sky serve up`. If we have controller's head_ip available and it is ssh-reachable, # we can allow access to the controller. ssh_credentials = ssh_credential_from_yaml(handle.cluster_yaml, handle.docker_user, handle.ssh_user) - runner = command_runner.SSHCommandRunner(handle.head_ip, - **ssh_credentials, - port=handle.head_ssh_port) + runner = command_runner.SSHCommandRunner(node=(handle.head_ip, + handle.head_ssh_port), + **ssh_credentials) if not runner.check_connection(): - error_msg = controller_type.value.connection_error_hint + error_msg = controller.value.connection_error_hint else: assert controller_status == status_lib.ClusterStatus.UP, handle @@ -2379,7 +2397,7 @@ def get_clusters( of the clusters. Args: - include_controller: Whether to include controllers, e.g. spot controller + include_controller: Whether to include controllers, e.g. jobs controller or sky serve controller. refresh: Whether to refresh the status of the clusters. (Refreshing will set the status to STOPPED if the cluster cannot be pinged.) @@ -2539,8 +2557,8 @@ def get_task_demands_dict(task: 'task_lib.Task') -> Dict[str, float]: optionally accelerator demands. """ # TODO: Custom CPU and other memory resources are not supported yet. - # For sky spot/serve controller task, we set the CPU resource to a smaller - # value to support a larger number of spot jobs and services. + # For sky jobs/serve controller task, we set the CPU resource to a smaller + # value to support a larger number of managed jobs and services. resources_dict = { 'CPU': (constants.CONTROLLER_PROCESS_CPU_DEMAND if task.is_controller_task() else DEFAULT_TASK_CPU_DEMAND) @@ -2557,41 +2575,58 @@ def get_task_demands_dict(task: 'task_lib.Task') -> Dict[str, float]: return resources_dict -def get_task_resources_str(task: 'task_lib.Task') -> str: +def get_task_resources_str(task: 'task_lib.Task', + is_managed_job: bool = False) -> str: """Returns the resources string of the task. The resources string is only used as a display purpose, so we only show the accelerator demands (if any). Otherwise, the CPU demand is shown. """ - task_cpu_demand = (constants.CONTROLLER_PROCESS_CPU_DEMAND if - task.is_controller_task() else DEFAULT_TASK_CPU_DEMAND) + spot_str = '' + task_cpu_demand = (str(constants.CONTROLLER_PROCESS_CPU_DEMAND) + if task.is_controller_task() else + str(DEFAULT_TASK_CPU_DEMAND)) if task.best_resources is not None: accelerator_dict = task.best_resources.accelerators + if is_managed_job: + if task.best_resources.use_spot: + spot_str = '[Spot]' + task_cpu_demand = task.best_resources.cpus if accelerator_dict is None: resources_str = f'CPU:{task_cpu_demand}' else: resources_str = ', '.join( f'{k}:{v}' for k, v in accelerator_dict.items()) - elif len(task.resources) == 1: - resources_dict = list(task.resources)[0].accelerators - if resources_dict is None: - resources_str = f'CPU:{task_cpu_demand}' - else: - resources_str = ', '.join( - f'{k}:{v}' for k, v in resources_dict.items()) else: resource_accelerators = [] + min_cpus = float('inf') + spot_type: Set[str] = set() for resource in task.resources: + task_cpu_demand = '1+' + if resource.cpus is not None: + task_cpu_demand = resource.cpus + min_cpus = min(min_cpus, float(task_cpu_demand.strip('+ '))) + if resource.use_spot: + spot_type.add('Spot') + else: + spot_type.add('On-demand') + if resource.accelerators is None: continue for k, v in resource.accelerators.items(): resource_accelerators.append(f'{k}:{v}') + if is_managed_job: + if len(task.resources) > 1: + task_cpu_demand = f'{min_cpus}+' + if 'Spot' in spot_type: + spot_str = '|'.join(sorted(spot_type)) + spot_str = f'[{spot_str}]' if resource_accelerators: resources_str = ', '.join(set(resource_accelerators)) else: resources_str = f'CPU:{task_cpu_demand}' - resources_str = f'{task.num_nodes}x [{resources_str}]' + resources_str = f'{task.num_nodes}x[{resources_str}]{spot_str}' return resources_str @@ -2619,27 +2654,6 @@ def stop_handler(signum, frame): raise KeyboardInterrupt(exceptions.SIGTSTP_CODE) -def run_command_and_handle_ssh_failure(runner: command_runner.SSHCommandRunner, - command: str, - failure_message: str) -> str: - """Runs command remotely and returns output with proper error handling.""" - rc, stdout, stderr = runner.run(command, - require_outputs=True, - stream_logs=False) - if rc == 255: - # SSH failed - raise RuntimeError( - f'SSH with user {runner.ssh_user} and key {runner.ssh_private_key} ' - f'to {runner.ip} failed. This is most likely due to incorrect ' - 'credentials or incorrect permissions for the key file. Check ' - 'your credentials and try again.') - subprocess_utils.handle_returncode(rc, - command, - failure_message, - stderr=stderr) - return stdout - - def check_rsync_installed() -> None: """Checks if rsync is installed. @@ -2681,3 +2695,115 @@ def check_stale_runtime_on_remote(returncode: int, stderr: str, f'not interrupted): {colorama.Style.BRIGHT}sky start -f -y ' f'{cluster_name}{colorama.Style.RESET_ALL}' f'\n--- Details ---\n{stderr.strip()}\n') + + +def get_endpoints(cluster: str, + port: Optional[Union[int, str]] = None, + skip_status_check: bool = False) -> Dict[int, str]: + """Gets the endpoint for a given cluster and port number (endpoint). + + Args: + cluster: The name of the cluster. + port: The port number to get the endpoint for. If None, endpoints + for all ports are returned. + skip_status_check: Whether to skip the status check for the cluster. + This is useful when the cluster is known to be in a INIT state + and the caller wants to query the endpoints. Used by serve + controller to query endpoints during cluster launch when multiple + services may be getting launched in parallel (and as a result, + the controller may be in INIT status due to a concurrent launch). + + Returns: A dictionary of port numbers to endpoints. If endpoint is None, + the dictionary will contain all ports:endpoints exposed on the cluster. + If the endpoint is not exposed yet (e.g., during cluster launch or + waiting for cloud provider to expose the endpoint), an empty dictionary + is returned. + + Raises: + ValueError: if the port is invalid or the cloud provider does not + support querying endpoints. + exceptions.ClusterNotUpError: if the cluster is not in UP status. + """ + # Cast endpoint to int if it is not None + if port is not None: + try: + port = int(port) + except ValueError: + with ux_utils.print_exception_no_traceback(): + raise ValueError(f'Invalid endpoint {port!r}.') from None + cluster_records = get_clusters(include_controller=True, + refresh=False, + cluster_names=[cluster]) + cluster_record = cluster_records[0] + if (not skip_status_check and + cluster_record['status'] != status_lib.ClusterStatus.UP): + with ux_utils.print_exception_no_traceback(): + raise exceptions.ClusterNotUpError( + f'Cluster {cluster_record["name"]!r} ' + 'is not in UP status.', cluster_record['status']) + handle = cluster_record['handle'] + if not isinstance(handle, backends.CloudVmRayResourceHandle): + with ux_utils.print_exception_no_traceback(): + raise ValueError('Querying IP address is not supported ' + f'for cluster {cluster!r} with backend ' + f'{get_backend_from_handle(handle).NAME}.') + + launched_resources = handle.launched_resources + cloud = launched_resources.cloud + try: + cloud.check_features_are_supported( + launched_resources, {clouds.CloudImplementationFeatures.OPEN_PORTS}) + except exceptions.NotSupportedError: + with ux_utils.print_exception_no_traceback(): + raise ValueError('Querying endpoints is not supported ' + f'for cluster {cluster!r} on {cloud}.') from None + + config = common_utils.read_yaml(handle.cluster_yaml) + port_details = provision_lib.query_ports(repr(cloud), + handle.cluster_name_on_cloud, + handle.launched_resources.ports, + head_ip=handle.head_ip, + provider_config=config['provider']) + + # Validation before returning the endpoints + if port is not None: + # If the requested endpoint was not to be exposed + port_set = resources_utils.port_ranges_to_set( + handle.launched_resources.ports) + if port not in port_set: + logger.warning(f'Port {port} is not exposed on ' + f'cluster {cluster!r}.') + return {} + # If the user requested a specific port endpoint, check if it is exposed + if port not in port_details: + error_msg = (f'Port {port} not exposed yet. ' + f'{_ENDPOINTS_RETRY_MESSAGE} ') + if handle.launched_resources.cloud.is_same_cloud( + clouds.Kubernetes()): + # Add Kubernetes specific debugging info + error_msg += (kubernetes_utils.get_endpoint_debug_message()) + logger.warning(error_msg) + return {} + return {port: port_details[port][0].url()} + else: + if not port_details: + # If cluster had no ports to be exposed + if handle.launched_resources.ports is None: + logger.warning(f'Cluster {cluster!r} does not have any ' + 'ports to be exposed.') + return {} + # Else ports have not been exposed even though they exist. + # In this case, ask the user to retry. + else: + error_msg = (f'No endpoints exposed yet. ' + f'{_ENDPOINTS_RETRY_MESSAGE} ') + if handle.launched_resources.cloud.is_same_cloud( + clouds.Kubernetes()): + # Add Kubernetes specific debugging info + error_msg += \ + kubernetes_utils.get_endpoint_debug_message() + logger.warning(error_msg) + return {} + return { + port_num: urls[0].url() for port_num, urls in port_details.items() + } diff --git a/sky/backends/cloud_vm_ray_backend.py b/sky/backends/cloud_vm_ray_backend.py index 44ade8c9c5e..f0b5db6e2ba 100644 --- a/sky/backends/cloud_vm_ray_backend.py +++ b/sky/backends/cloud_vm_ray_backend.py @@ -1,7 +1,7 @@ """Backend: runs on cloud virtual machines, managed by Ray.""" -import base64 import copy import enum +import functools import getpass import inspect import json @@ -9,6 +9,7 @@ import os import pathlib import re +import shlex import signal import subprocess import sys @@ -28,12 +29,12 @@ from sky import clouds from sky import exceptions from sky import global_user_state +from sky import jobs as managed_jobs from sky import optimizer from sky import provision as provision_lib from sky import resources as resources_lib from sky import serve as serve_lib from sky import sky_logging -from sky import spot as spot_lib from sky import status_lib from sky import task as task_lib from sky.backends import backend_utils @@ -135,6 +136,19 @@ pathlib.Path(sky.__file__).resolve().parent / 'backends' / 'monkey_patches' / 'monkey_patch_ray_up.py') +# The maximum size of a command line arguments is 128 KB, i.e. the command +# executed with /bin/sh should be less than 128KB. +# https://github.com/torvalds/linux/blob/master/include/uapi/linux/binfmts.h +# +# If a user have very long run or setup commands, the generated command may +# exceed the limit, as we directly include scripts in job submission commands. +# If the command is too long, we instead write it to a file, rsync and execute +# it. +# +# We use 120KB as a threshold to be safe for other arguments that +# might be added during ssh. +_MAX_INLINE_SCRIPT_LENGTH = 120 * 1024 + def _get_cluster_config_template(cloud): cloud_to_template = { @@ -268,7 +282,15 @@ def get_or_fail(futures, pg) -> List[int]: \"\"\"Wait for tasks, if any fails, cancel all unready.\"\"\" returncodes = [1] * len(futures) # Wait for 1 task to be ready. - ready, unready = ray.wait(futures) + ready = [] + # Keep invoking ray.wait if ready is empty. This is because + # ray.wait with timeout=None will only wait for 10**6 seconds, + # which will cause tasks running for more than 12 days to return + # before becoming ready. + # (Such tasks are common in serving jobs.) + # Reference: https://github.com/ray-project/ray/blob/ray-2.9.3/python/ray/_private/worker.py#L2845-L2846 + while not ready: + ready, unready = ray.wait(futures) idx = futures.index(ready[0]) returncodes[idx] = ray.get(ready[0]) while unready: @@ -1470,10 +1492,12 @@ def _retry_zones( if zones and len(zones) == 1: launched_resources = launched_resources.copy(zone=zones[0].name) - prev_cluster_ips, prev_ssh_ports = None, None + prev_cluster_ips, prev_ssh_ports, prev_cluster_info = (None, None, + None) if prev_handle is not None: prev_cluster_ips = prev_handle.stable_internal_external_ips prev_ssh_ports = prev_handle.stable_ssh_ports + prev_cluster_info = prev_handle.cached_cluster_info # Record early, so if anything goes wrong, 'sky status' will show # the cluster name and users can appropriately 'sky down'. It also # means a second 'sky launch -c ' will attempt to reuse. @@ -1492,7 +1516,9 @@ def _retry_zones( # optimize the case where the cluster is restarted, i.e., no # need to query IPs and ports from the cloud provider. stable_internal_external_ips=prev_cluster_ips, - stable_ssh_ports=prev_ssh_ports) + stable_ssh_ports=prev_ssh_ports, + cluster_info=prev_cluster_info, + ) usage_lib.messages.usage.update_final_cluster_status( status_lib.ClusterStatus.INIT) @@ -1573,14 +1599,14 @@ def _retry_zones( # manually or by the cloud provider. # Optimize the case where the cluster's head IPs can be parsed # from the output of 'ray up'. - kwargs = {} if handle.launched_nodes == 1: - kwargs = { - 'internal_ips': [head_internal_ip], - 'external_ips': [head_external_ip] - } - handle.update_cluster_ips(max_attempts=_FETCH_IP_MAX_ATTEMPTS, - **kwargs) + handle.update_cluster_ips( + max_attempts=_FETCH_IP_MAX_ATTEMPTS, + internal_ips=[head_internal_ip], + external_ips=[head_external_ip]) + else: + handle.update_cluster_ips( + max_attempts=_FETCH_IP_MAX_ATTEMPTS) handle.update_ssh_ports(max_attempts=_FETCH_IP_MAX_ATTEMPTS) if cluster_exists: # Guard against the case where there's an existing cluster @@ -1983,9 +2009,20 @@ def provision_with_retries( cloud_user = None else: cloud_user = to_provision.cloud.get_current_user_identity() + + requested_features = self._requested_features.copy() + # Skip stop feature for Kubernetes controllers. + if (isinstance(to_provision.cloud, clouds.Kubernetes) and + controller_utils.Controllers.from_name(cluster_name) + is not None): + assert (clouds.CloudImplementationFeatures.STOP + in requested_features), requested_features + requested_features.remove( + clouds.CloudImplementationFeatures.STOP) + # Skip if to_provision.cloud does not support requested features to_provision.cloud.check_features_are_supported( - to_provision, self._requested_features) + to_provision, requested_features) config_dict = self._retry_zones( to_provision, @@ -2106,7 +2143,7 @@ class CloudVmRayResourceHandle(backends.backend.ResourceHandle): """ # Bump if any fields get added/removed/changed, and add backward # compaitibility logic in __setstate__. - _VERSION = 7 + _VERSION = 8 def __init__( self, @@ -2119,6 +2156,7 @@ def __init__( stable_internal_external_ips: Optional[List[Tuple[str, str]]] = None, stable_ssh_ports: Optional[List[int]] = None, + cluster_info: Optional[provision_common.ClusterInfo] = None, # The following 2 fields are deprecated. SkyPilot new provisioner # API handles the TPU node creation/deletion. # Backward compatibility for TPU nodes created before #2943. @@ -2135,10 +2173,10 @@ def __init__( # internal or external ips, depending on the use_internal_ips flag. self.stable_internal_external_ips = stable_internal_external_ips self.stable_ssh_ports = stable_ssh_ports + self.cached_cluster_info = cluster_info self.launched_nodes = launched_nodes self.launched_resources = launched_resources self.docker_user: Optional[str] = None - self.ssh_user: Optional[str] = None # Deprecated. SkyPilot new provisioner API handles the TPU node # creation/deletion. # Backward compatibility for TPU nodes created before #2943. @@ -2202,13 +2240,8 @@ def update_ssh_ports(self, max_attempts: int = 1) -> None: Use this method to use any cloud-specific port fetching logic. """ del max_attempts # Unused. - if isinstance(self.launched_resources.cloud, clouds.RunPod): - cluster_info = provision_lib.get_cluster_info( - str(self.launched_resources.cloud).lower(), - region=self.launched_resources.region, - cluster_name_on_cloud=self.cluster_name_on_cloud, - provider_config=None) - self.stable_ssh_ports = cluster_info.get_ssh_ports() + if self.cached_cluster_info is not None: + self.stable_ssh_ports = self.cached_cluster_info.get_ssh_ports() return head_ssh_port = 22 @@ -2216,11 +2249,49 @@ def update_ssh_ports(self, max_attempts: int = 1) -> None: [head_ssh_port] + [22] * (self.num_ips_per_node * self.launched_nodes - 1)) + def _update_cluster_info(self): + # When a cluster is on a cloud that does not support the new + # provisioner, we should skip updating cluster_info. + if (self.launched_resources.cloud.PROVISIONER_VERSION >= + clouds.ProvisionerVersion.SKYPILOT): + provider_name = str(self.launched_resources.cloud).lower() + config = {} + if os.path.exists(self.cluster_yaml): + # It is possible that the cluster yaml is not available when + # the handle is unpickled for service replicas from the + # controller with older version. + config = common_utils.read_yaml(self.cluster_yaml) + try: + cluster_info = provision_lib.get_cluster_info( + provider_name, + region=self.launched_resources.region, + cluster_name_on_cloud=self.cluster_name_on_cloud, + provider_config=config.get('provider', None)) + except Exception as e: # pylint: disable=broad-except + # This could happen when the VM is not fully launched, and a + # user is trying to terminate it with `sky down`. + logger.debug('Failed to get cluster info for ' + f'{self.cluster_name} from the new provisioner ' + f'with {common_utils.format_exception(e)}.') + raise exceptions.FetchClusterInfoError( + exceptions.FetchClusterInfoError.Reason.HEAD) from e + if cluster_info.num_instances != self.launched_nodes: + logger.debug( + f'Available nodes in the cluster {self.cluster_name} ' + 'do not match the number of nodes requested (' + f'{cluster_info.num_instances} != ' + f'{self.launched_nodes}).') + raise exceptions.FetchClusterInfoError( + exceptions.FetchClusterInfoError.Reason.HEAD) + self.cached_cluster_info = cluster_info + def update_cluster_ips( self, max_attempts: int = 1, internal_ips: Optional[List[Optional[str]]] = None, - external_ips: Optional[List[Optional[str]]] = None) -> None: + external_ips: Optional[List[Optional[str]]] = None, + cluster_info: Optional[provision_common.ClusterInfo] = None + ) -> None: """Updates the cluster IPs cached in the handle. We cache the cluster IPs in the handle to avoid having to retrieve @@ -2246,61 +2317,74 @@ def update_cluster_ips( external IPs from the cloud provider. Raises: - exceptions.FetchIPError: if we failed to get the IPs. e.reason is - HEAD or WORKER. + exceptions.FetchClusterInfoError: if we failed to get the cluster + infos. e.reason is HEAD or WORKER. """ - - def is_provided_ips_valid(ips: Optional[List[Optional[str]]]) -> bool: - return (ips is not None and - len(ips) == self.num_ips_per_node * self.launched_nodes and - all(ip is not None for ip in ips)) - - use_internal_ips = self._use_internal_ips() - - # cluster_feasible_ips is the list of IPs of the nodes in the cluster - # which can be used to connect to the cluster. It is a list of external - # IPs if the cluster is assigned public IPs, otherwise it is a list of - # internal IPs. - cluster_feasible_ips: List[str] - if is_provided_ips_valid(external_ips): - logger.debug(f'Using provided external IPs: {external_ips}') - cluster_feasible_ips = typing.cast(List[str], external_ips) + if cluster_info is not None: + self.cached_cluster_info = cluster_info + use_internal_ips = self._use_internal_ips() + cluster_feasible_ips = self.cached_cluster_info.get_feasible_ips( + use_internal_ips) + cluster_internal_ips = self.cached_cluster_info.get_feasible_ips( + force_internal_ips=True) else: - cluster_feasible_ips = backend_utils.get_node_ips( - self.cluster_yaml, - self.launched_nodes, - head_ip_max_attempts=max_attempts, - worker_ip_max_attempts=max_attempts, - get_internal_ips=use_internal_ips) - - if self.cached_external_ips == cluster_feasible_ips: - logger.debug('Skipping the fetching of internal IPs as the cached ' - 'external IPs matches the newly fetched ones.') - # Optimization: If the cached external IPs are the same as the - # retrieved feasible IPs, then we can skip retrieving internal - # IPs since the cached IPs are up-to-date. - return - logger.debug( - 'Cached external IPs do not match with the newly fetched ones: ' - f'cached ({self.cached_external_ips}), new ({cluster_feasible_ips})' - ) + # For clouds that do not support the SkyPilot Provisioner API. + # TODO(zhwu): once all the clouds are migrated to SkyPilot + # Provisioner API, we should remove this else block + def is_provided_ips_valid( + ips: Optional[List[Optional[str]]]) -> bool: + return (ips is not None and len(ips) + == self.num_ips_per_node * self.launched_nodes and + all(ip is not None for ip in ips)) + + use_internal_ips = self._use_internal_ips() + + # cluster_feasible_ips is the list of IPs of the nodes in the + # cluster which can be used to connect to the cluster. It is a list + # of external IPs if the cluster is assigned public IPs, otherwise + # it is a list of internal IPs. + if is_provided_ips_valid(external_ips): + logger.debug(f'Using provided external IPs: {external_ips}') + cluster_feasible_ips = typing.cast(List[str], external_ips) + else: + cluster_feasible_ips = backend_utils.get_node_ips( + self.cluster_yaml, + self.launched_nodes, + head_ip_max_attempts=max_attempts, + worker_ip_max_attempts=max_attempts, + get_internal_ips=use_internal_ips) + + if self.cached_external_ips == cluster_feasible_ips: + logger.debug( + 'Skipping the fetching of internal IPs as the cached ' + 'external IPs matches the newly fetched ones.') + # Optimization: If the cached external IPs are the same as the + # retrieved feasible IPs, then we can skip retrieving internal + # IPs since the cached IPs are up-to-date. + return - if use_internal_ips: - # Optimization: if we know use_internal_ips is True (currently - # only exposed for AWS and GCP), then our provisioner is guaranteed - # to not assign public IPs, thus the first list of IPs returned - # above are already private IPs. So skip the second query. - cluster_internal_ips = list(cluster_feasible_ips) - elif is_provided_ips_valid(internal_ips): - logger.debug(f'Using provided internal IPs: {internal_ips}') - cluster_internal_ips = typing.cast(List[str], internal_ips) - else: - cluster_internal_ips = backend_utils.get_node_ips( - self.cluster_yaml, - self.launched_nodes, - head_ip_max_attempts=max_attempts, - worker_ip_max_attempts=max_attempts, - get_internal_ips=True) + logger.debug( + 'Cached external IPs do not match with the newly fetched ones: ' + f'cached ({self.cached_external_ips}), new ' + f'({cluster_feasible_ips})') + + if use_internal_ips: + # Optimization: if we know use_internal_ips is True (currently + # only exposed for AWS and GCP), then our provisioner is + # guaranteed to not assign public IPs, thus the first list of + # IPs returned above are already private IPs. So skip the second + # query. + cluster_internal_ips = list(cluster_feasible_ips) + elif is_provided_ips_valid(internal_ips): + logger.debug(f'Using provided internal IPs: {internal_ips}') + cluster_internal_ips = typing.cast(List[str], internal_ips) + else: + cluster_internal_ips = backend_utils.get_node_ips( + self.cluster_yaml, + self.launched_nodes, + head_ip_max_attempts=max_attempts, + worker_ip_max_attempts=max_attempts, + get_internal_ips=True) assert len(cluster_feasible_ips) == len(cluster_internal_ips), ( f'Cluster {self.cluster_name!r}:' @@ -2319,6 +2403,39 @@ def is_provided_ips_valid(ips: Optional[List[Optional[str]]]) -> bool: internal_external_ips[1:], key=lambda x: x[1]) self.stable_internal_external_ips = stable_internal_external_ips + @functools.lru_cache() + @timeline.event + def get_command_runners(self, + force_cached: bool = False, + avoid_ssh_control: bool = False + ) -> List[command_runner.CommandRunner]: + """Returns a list of command runners for the cluster.""" + ssh_credentials = backend_utils.ssh_credential_from_yaml( + self.cluster_yaml, self.docker_user, self.ssh_user) + if avoid_ssh_control: + ssh_credentials.pop('ssh_control_name', None) + if (clouds.ProvisionerVersion.RAY_PROVISIONER_SKYPILOT_TERMINATOR >= + self.launched_resources.cloud.PROVISIONER_VERSION): + ip_list = (self.cached_external_ips + if force_cached else self.external_ips()) + if ip_list is None: + return [] + # Potentially refresh the external SSH ports, in case the existing + # cluster before #2491 was launched without external SSH ports + # cached. + port_list = self.external_ssh_ports() + runners = command_runner.SSHCommandRunner.make_runner_list( + zip(ip_list, port_list), **ssh_credentials) + return runners + if self.cached_cluster_info is None: + assert not force_cached, 'cached_cluster_info is None.' + self._update_cluster_info() + assert self.cached_cluster_info is not None, self + runners = provision_lib.get_command_runners( + self.cached_cluster_info.provider_name, self.cached_cluster_info, + **ssh_credentials) + return runners + @property def cached_internal_ips(self) -> Optional[List[str]]: if self.stable_internal_external_ips is not None: @@ -2384,6 +2501,16 @@ def setup_docker_user(self, cluster_config_file: str): def cluster_yaml(self): return os.path.expanduser(self._cluster_yaml) + @property + def ssh_user(self): + if self.cached_cluster_info is not None: + # Overload ssh_user with the user stored in cluster_info, which is + # useful for kubernetes case, where the ssh_user can depend on the + # container image used. For those clusters launched with ray + # autoscaler, we directly use the ssh_user in yaml config. + return self.cached_cluster_info.ssh_user + return None + @property def head_ip(self): external_ips = self.cached_external_ips @@ -2429,8 +2556,8 @@ def __setstate__(self, state): if version < 6: state['cluster_name_on_cloud'] = state['cluster_name'] - if version < 7: - self.ssh_user = None + if version < 8: + self.cached_cluster_info = None self.__dict__.update(state) @@ -2440,7 +2567,7 @@ def __setstate__(self, state): if version < 3 and head_ip is not None: try: self.update_cluster_ips() - except exceptions.FetchIPError: + except exceptions.FetchClusterInfoError: # This occurs when an old cluster from was autostopped, # so the head IP in the database is not updated. pass @@ -2449,6 +2576,14 @@ def __setstate__(self, state): self._update_cluster_region() + if version < 8: + try: + self._update_cluster_info() + except exceptions.FetchClusterInfoError: + # This occurs when an old cluster from was autostopped, + # so the head IP in the database is not updated. + pass + class CloudVmRayBackend(backends.Backend['CloudVmRayResourceHandle']): """Backend: runs on cloud virtual machines, managed by Ray. @@ -2561,6 +2696,17 @@ def check_resources_fit_cluster( f'{example_resource.zone!r},' 'but the existing cluster ' f'{zone_str}') + if (example_resource.requires_fuse and + not launched_resources.requires_fuse): + # Will not be reached for non-k8s case since the + # less_demanding_than only fails fuse requirement when + # the cloud is Kubernetes AND the cluster doesn't have fuse. + with ux_utils.print_exception_no_traceback(): + raise exceptions.ResourcesMismatchError( + 'Task requires FUSE support for mounting object ' + 'stores, but the existing cluster with ' + f'{launched_resources!r} does not support FUSE ' + f'mounting. Launch a new cluster to run this task.') requested_resource_str = ', '.join(requested_resource_list) if isinstance(task.resources, list): requested_resource_str = f'[{requested_resource_str}]' @@ -2729,22 +2875,17 @@ def _provision( provision_record=provision_record, custom_resource=resources_vars.get('custom_resources'), log_dir=self.log_dir) - # We must query the IPs from the cloud provider, when the - # provisioning is done, to make sure the cluster IPs are - # up-to-date. + # We use the IPs from the cluster_info to update_cluster_ips, + # when the provisioning is done, to make sure the cluster IPs + # are up-to-date. # The staled IPs may be caused by the node being restarted # manually or by the cloud provider. # Optimize the case where the cluster's IPs can be retrieved # from cluster_info. - internal_ips, external_ips = zip(*cluster_info.ip_tuples()) - if not cluster_info.has_external_ips(): - external_ips = internal_ips + handle.docker_user = cluster_info.docker_user handle.update_cluster_ips(max_attempts=_FETCH_IP_MAX_ATTEMPTS, - internal_ips=list(internal_ips), - external_ips=list(external_ips)) + cluster_info=cluster_info) handle.update_ssh_ports(max_attempts=_FETCH_IP_MAX_ATTEMPTS) - handle.docker_user = cluster_info.docker_user - handle.ssh_user = cluster_info.ssh_user # Update launched resources. handle.launched_resources = handle.launched_resources.copy( @@ -2776,11 +2917,7 @@ def _provision( handle.launched_resources.cloud.get_zone_shell_cmd()) # zone is None for Azure if get_zone_cmd is not None: - ssh_credentials = backend_utils.ssh_credential_from_yaml( - handle.cluster_yaml, handle.docker_user, - handle.ssh_user) - runners = command_runner.SSHCommandRunner.make_runner_list( - ip_list, port_list=ssh_port_list, **ssh_credentials) + runners = handle.get_command_runners() def _get_zone(runner): retry_count = 0 @@ -2821,8 +2958,11 @@ def _get_zone(runner): logger.debug('Checking if skylet is running on the head node.') with rich_utils.safe_status( '[bold cyan]Preparing SkyPilot runtime'): + # We need to source bashrc for skylet to make sure the autostop + # event can access the path to the cloud CLIs. self.run_on_head(handle, - instance_setup.MAYBE_SKYLET_RESTART_CMD) + instance_setup.MAYBE_SKYLET_RESTART_CMD, + source_bashrc=True) self._update_after_cluster_provisioned( handle, to_provision_config.prev_handle, task, @@ -2921,7 +3061,6 @@ def _sync_workdir(self, handle: CloudVmRayResourceHandle, fore = colorama.Fore style = colorama.Style ip_list = handle.external_ips() - port_list = handle.external_ssh_ports() assert ip_list is not None, 'external_ips is not cached in handle' full_workdir = os.path.abspath(os.path.expanduser(workdir)) @@ -2946,14 +3085,10 @@ def _sync_workdir(self, handle: CloudVmRayResourceHandle, log_path = os.path.join(self.log_dir, 'workdir_sync.log') - ssh_credentials = backend_utils.ssh_credential_from_yaml( - handle.cluster_yaml, handle.docker_user, handle.ssh_user) - # TODO(zhwu): refactor this with backend_utils.parallel_cmd_with_rsync - runners = command_runner.SSHCommandRunner.make_runner_list( - ip_list, port_list=port_list, **ssh_credentials) + runners = handle.get_command_runners() - def _sync_workdir_node(runner: command_runner.SSHCommandRunner) -> None: + def _sync_workdir_node(runner: command_runner.CommandRunner) -> None: runner.rsync( source=workdir, target=SKY_REMOTE_WORKDIR, @@ -3000,47 +3135,51 @@ def _setup(self, handle: CloudVmRayResourceHandle, task: task_lib.Task, return setup = task.setup # Sync the setup script up and run it. - ip_list = handle.external_ips() internal_ips = handle.internal_ips() - port_list = handle.external_ssh_ports() - assert ip_list is not None, 'external_ips is not cached in handle' - ssh_credentials = backend_utils.ssh_credential_from_yaml( - handle.cluster_yaml, handle.docker_user, handle.ssh_user) - # Disable connection sharing for setup script to avoid old - # connections being reused, which may cause stale ssh agent - # forwarding. - ssh_credentials.pop('ssh_control_name', None) - remote_setup_file_name = f'/tmp/sky_setup_{self.run_timestamp}' # Need this `-i` option to make sure `source ~/.bashrc` work setup_cmd = f'/bin/bash -i {remote_setup_file_name} 2>&1' + runners = handle.get_command_runners(avoid_ssh_control=True) def _setup_node(node_id: int) -> None: setup_envs = task.envs.copy() setup_envs.update(self._skypilot_predefined_env_vars(handle)) setup_envs['SKYPILOT_SETUP_NODE_IPS'] = '\n'.join(internal_ips) setup_envs['SKYPILOT_SETUP_NODE_RANK'] = str(node_id) - runner = command_runner.SSHCommandRunner(ip_list[node_id], - port=port_list[node_id], - **ssh_credentials) + runner = runners[node_id] setup_script = log_lib.make_task_bash_script(setup, env_vars=setup_envs) - with tempfile.NamedTemporaryFile('w', prefix='sky_setup_') as f: - f.write(setup_script) - f.flush() - setup_sh_path = f.name - runner.rsync(source=setup_sh_path, - target=remote_setup_file_name, - up=True, - stream_logs=False) + encoded_script = shlex.quote(setup_script) + if (detach_setup or + len(encoded_script) > _MAX_INLINE_SCRIPT_LENGTH): + with tempfile.NamedTemporaryFile('w', prefix='sky_setup_') as f: + f.write(setup_script) + f.flush() + setup_sh_path = f.name + runner.rsync(source=setup_sh_path, + target=remote_setup_file_name, + up=True, + stream_logs=False) + create_script_code = 'true' + else: + create_script_code = (f'{{ echo {encoded_script} > ' + f'{remote_setup_file_name}; }}') + if detach_setup: return setup_log_path = os.path.join(self.log_dir, - f'setup-{runner.ip}.log') + f'setup-{runner.node_id}.log') returncode = runner.run( - setup_cmd, + f'{create_script_code} && {setup_cmd}', log_path=setup_log_path, process_stream=False, + # We do not source bashrc for setup, since bashrc is sourced + # in the script already. + # Skip an empty line and two lines due to the /bin/bash -i and + # source ~/.bashrc in the setup_cmd. + # bash: cannot set terminal process group (7398): Inappropriate ioctl for device # pylint: disable=line-too-long + # bash: no job control in this shell + skip_lines=3, ) def error_message() -> str: @@ -3070,7 +3209,7 @@ def error_message() -> str: command=setup_cmd, error_msg=error_message) - num_nodes = len(ip_list) + num_nodes = len(runners) plural = 's' if num_nodes > 1 else '' if not detach_setup: logger.info(f'{fore.CYAN}Running setup on {num_nodes} node{plural}.' @@ -3096,7 +3235,7 @@ def _exec_code_on_head( codegen: str, job_id: int, detach_run: bool = False, - spot_dag: Optional['dag.Dag'] = None, + managed_job_dag: Optional['dag.Dag'] = None, ) -> None: """Executes generated code on the head node.""" style = colorama.Style @@ -3110,10 +3249,8 @@ def _exec_code_on_head( mkdir_code = (f'{cd} && mkdir -p {remote_log_dir} && ' f'touch {remote_log_path}') - encoded_script = base64.b64encode( - codegen.encode('utf-8')).decode('utf-8') - create_script_code = (f'{{ echo "{encoded_script}" | base64 --decode > ' - f'{script_path}; }}') + encoded_script = shlex.quote(codegen) + create_script_code = (f'{{ echo {encoded_script} > {script_path}; }}') job_submit_cmd = ( f'RAY_DASHBOARD_PORT=$({constants.SKY_PYTHON_CMD} -c "from sky.skylet import job_lib; print(job_lib.get_job_submission_port())" 2> /dev/null || echo 8265);' # pylint: disable=line-too-long f'{cd} && {constants.SKY_RAY_CMD} job submit ' @@ -3125,23 +3262,41 @@ def _exec_code_on_head( code = job_lib.JobLibCodeGen.queue_job(job_id, job_submit_cmd) job_submit_cmd = ' && '.join([mkdir_code, create_script_code, code]) - - if spot_dag is not None: - # Add the spot job to spot queue table. - spot_codegen = spot_lib.SpotCodeGen() - spot_code = spot_codegen.set_pending(job_id, spot_dag) - # Set the spot job to PENDING state to make sure that this spot - # job appears in the `sky spot queue`, when there are already 16 - # controller process jobs running on the controller VM with 8 - # CPU cores. - # The spot job should be set to PENDING state *after* the + if len(job_submit_cmd) > _MAX_INLINE_SCRIPT_LENGTH: + runners = handle.get_command_runners() + head_runner = runners[0] + with tempfile.NamedTemporaryFile('w', prefix='sky_app_') as fp: + fp.write(codegen) + fp.flush() + script_path = os.path.join(SKY_REMOTE_APP_DIR, + f'sky_job_{job_id}') + # We choose to sync code + exec, because the alternative of 'ray + # submit' may not work as it may use system python (python2) to + # execute the script. Happens for AWS. + head_runner.rsync(source=fp.name, + target=script_path, + up=True, + stream_logs=False) + job_submit_cmd = f'{mkdir_code} && {code}' + + if managed_job_dag is not None: + # Add the managed job to job queue database. + managed_job_codegen = managed_jobs.ManagedJobCodeGen() + managed_job_code = managed_job_codegen.set_pending( + job_id, managed_job_dag) + # Set the managed job to PENDING state to make sure that this + # managed job appears in the `sky jobs queue`, when there are + # already 2x vCPU controller processes running on the controller VM, + # e.g., 16 controller processes running on a controller with 8 + # vCPUs. + # The managed job should be set to PENDING state *after* the # controller process job has been queued, as our skylet on spot - # controller will set the spot job in FAILED state if the + # controller will set the managed job in FAILED state if the # controller process job does not exist. - # We cannot set the spot job to PENDING state in the codegen for + # We cannot set the managed job to PENDING state in the codegen for # the controller process job, as it will stay in the job pending # table and not be executed until there is an empty slot. - job_submit_cmd = job_submit_cmd + ' && ' + spot_code + job_submit_cmd = job_submit_cmd + ' && ' + managed_job_code returncode, stdout, stderr = self.run_on_head(handle, job_submit_cmd, @@ -3162,8 +3317,9 @@ def _exec_code_on_head( try: if not detach_run: - if handle.cluster_name == spot_lib.SPOT_CONTROLLER_NAME: - self.tail_spot_logs(handle, job_id) + if (handle.cluster_name in controller_utils.Controllers. + JOBS_CONTROLLER.value.candidate_cluster_names): + self.tail_managed_job_logs(handle, job_id) else: # Sky logs. Not using subprocess.run since it will make the # ssh keep connected after ctrl-c. @@ -3171,24 +3327,24 @@ def _exec_code_on_head( finally: name = handle.cluster_name controller = controller_utils.Controllers.from_name(name) - if controller == controller_utils.Controllers.SPOT_CONTROLLER: + if controller == controller_utils.Controllers.JOBS_CONTROLLER: logger.info( - f'{fore.CYAN}Spot Job ID: ' + f'{fore.CYAN}Managed Job ID: ' f'{style.BRIGHT}{job_id}{style.RESET_ALL}' '\nTo cancel the job:\t\t' - f'{backend_utils.BOLD}sky spot cancel {job_id}' + f'{backend_utils.BOLD}sky jobs cancel {job_id}' f'{backend_utils.RESET_BOLD}' '\nTo stream job logs:\t\t' - f'{backend_utils.BOLD}sky spot logs {job_id}' + f'{backend_utils.BOLD}sky jobs logs {job_id}' f'{backend_utils.RESET_BOLD}' f'\nTo stream controller logs:\t' - f'{backend_utils.BOLD}sky spot logs --controller {job_id}' + f'{backend_utils.BOLD}sky jobs logs --controller {job_id}' f'{backend_utils.RESET_BOLD}' - '\nTo view all spot jobs:\t\t' - f'{backend_utils.BOLD}sky spot queue' + '\nTo view all managed jobs:\t' + f'{backend_utils.BOLD}sky jobs queue' f'{backend_utils.RESET_BOLD}' - '\nTo view the spot job dashboard:\t' - f'{backend_utils.BOLD}sky spot dashboard' + '\nTo view managed job dashboard:\t' + f'{backend_utils.BOLD}sky jobs dashboard' f'{backend_utils.RESET_BOLD}') elif controller is None: logger.info(f'{fore.CYAN}Job ID: ' @@ -3476,15 +3632,7 @@ def sync_down_logs( logger.info(f'{fore.CYAN}Job {job_id} logs: {log_dir}' f'{style.RESET_ALL}') - ip_list = handle.external_ips() - assert ip_list is not None, 'external_ips is not cached in handle' - ssh_port_list = handle.external_ssh_ports() - assert ssh_port_list is not None, 'external_ssh_ports is not cached ' \ - 'in handle' - ssh_credentials = backend_utils.ssh_credential_from_yaml( - handle.cluster_yaml, handle.docker_user, handle.ssh_user) - runners = command_runner.SSHCommandRunner.make_runner_list( - ip_list, port_list=ssh_port_list, **ssh_credentials) + runners = handle.get_command_runners() def _rsync_down(args) -> None: """Rsync down logs from remote nodes. @@ -3496,7 +3644,10 @@ def _rsync_down(args) -> None: try: os.makedirs(local_log_dir, exist_ok=True) runner.rsync( - source=f'{remote_log_dir}/*', + # Require a `/` at the end to make sure the parent dir + # are not created locally. We do not add additional '*' as + # kubernetes's rsync does not work with an ending '*'. + source=f'{remote_log_dir}/', target=local_log_dir, up=False, stream_logs=False, @@ -3505,7 +3656,7 @@ def _rsync_down(args) -> None: if e.returncode == exceptions.RSYNC_FILE_NOT_FOUND_CODE: # Raised by rsync_down. Remote log dir may not exist, since # the job can be run on some part of the nodes. - logger.debug(f'{runner.ip} does not have the tasks/*.') + logger.debug(f'{runner.node_id} does not have the tasks/*.') else: raise @@ -3518,12 +3669,20 @@ def _rsync_down(args) -> None: def tail_logs(self, handle: CloudVmRayResourceHandle, job_id: Optional[int], - spot_job_id: Optional[int] = None, + managed_job_id: Optional[int] = None, follow: bool = True) -> int: + """Tail the logs of a job. + + Args: + handle: The handle to the cluster. + job_id: The job ID to tail the logs of. + managed_job_id: The managed job ID for display purpose only. + follow: Whether to follow the logs. + """ code = job_lib.JobLibCodeGen.tail_logs(job_id, - spot_job_id=spot_job_id, + managed_job_id=managed_job_id, follow=follow) - if job_id is None and spot_job_id is None: + if job_id is None and managed_job_id is None: logger.info( 'Job ID not provided. Streaming the logs of the latest job.') @@ -3550,17 +3709,16 @@ def tail_logs(self, returncode = e.code return returncode - def tail_spot_logs(self, - handle: CloudVmRayResourceHandle, - job_id: Optional[int] = None, - job_name: Optional[str] = None, - follow: bool = True) -> None: + def tail_managed_job_logs(self, + handle: CloudVmRayResourceHandle, + job_id: Optional[int] = None, + job_name: Optional[str] = None, + controller: bool = False, + follow: bool = True) -> None: # if job_name is not None, job_id should be None assert job_name is None or job_id is None, (job_name, job_id) - if job_name is not None: - code = spot_lib.SpotCodeGen.stream_logs_by_name(job_name, follow) - else: - code = spot_lib.SpotCodeGen.stream_logs_by_id(job_id, follow) + code = managed_jobs.ManagedJobCodeGen.stream_logs( + job_name, job_id, follow, controller) # With the stdin=subprocess.DEVNULL, the ctrl-c will not directly # kill the process, so we need to handle it manually here. @@ -3676,7 +3834,7 @@ def teardown_no_lock(self, # even when the command was executed successfully. self.run_on_head(handle, f'{constants.SKY_RAY_CMD} stop --force') - except exceptions.FetchIPError: + except exceptions.FetchClusterInfoError: # This error is expected if the previous cluster IP is # failed to be found, # i.e., the cluster is already stopped/terminated. @@ -3953,6 +4111,14 @@ def post_teardown_cleanup(self, pass except exceptions.PortDoesNotExistError: logger.debug('Ports do not exist. Skipping cleanup.') + except Exception as e: # pylint: disable=broad-except + if purge: + logger.warning( + f'Failed to cleanup ports. Skipping since purge is ' + f'set. Details: ' + f'{common_utils.format_exception(e, use_bracket=True)}') + else: + raise # The cluster file must exist because the cluster_yaml will only # be removed after the cluster entry in the database is removed. @@ -3991,6 +4157,16 @@ def set_autostop(self, # The core.autostop() function should have already checked that the # cloud and resources support requested autostop. if idle_minutes_to_autostop is not None: + # Skip auto-stop for Kubernetes clusters. + if (isinstance(handle.launched_resources.cloud, clouds.Kubernetes) + and not down and idle_minutes_to_autostop >= 0): + # We should hit this code path only for the controllers on + # Kubernetes clusters. + assert (controller_utils.Controllers.from_name( + handle.cluster_name) is not None), handle.cluster_name + logger.info('Auto-stop is not supported for Kubernetes ' + 'clusters. Skipping.') + return # Check if we're stopping spot assert (handle.launched_resources is not None and @@ -4053,6 +4229,7 @@ def run_on_head( require_outputs: bool = False, separate_stderr: bool = False, process_stream: bool = True, + source_bashrc: bool = False, **kwargs, ) -> Union[int, Tuple[int, str, str]]: """Runs 'cmd' on the cluster's head node. @@ -4078,6 +4255,9 @@ def run_on_head( process_stream: Whether to post-process the stdout/stderr of the command, such as replacing or skipping lines on the fly. If enabled, lines are printed only when '\r' or '\n' is found. + source_bashrc: Whether to source bashrc when running on the command + on the VM. If it is a user-related commands, it would always be + good to source bashrc to make sure the env vars are set. Returns: returncode @@ -4085,24 +4265,17 @@ def run_on_head( A tuple of (returncode, stdout, stderr). Raises: - exceptions.FetchIPError: If the head node IP cannot be fetched. + exceptions.FetchClusterInfoError: If the cluster info cannot be + fetched. """ # This will try to fetch the head node IP if it is not cached. - external_ips = handle.external_ips(max_attempts=_FETCH_IP_MAX_ATTEMPTS) - head_ip = external_ips[0] - external_ssh_ports = handle.external_ssh_ports( - max_attempts=_FETCH_IP_MAX_ATTEMPTS) - head_ssh_port = external_ssh_ports[0] - ssh_credentials = backend_utils.ssh_credential_from_yaml( - handle.cluster_yaml, handle.docker_user, handle.ssh_user) - runner = command_runner.SSHCommandRunner(head_ip, - port=head_ssh_port, - **ssh_credentials) + runners = handle.get_command_runners() + head_runner = runners[0] if under_remote_workdir: cmd = f'cd {SKY_REMOTE_WORKDIR} && {cmd}' - return runner.run( + return head_runner.run( cmd, port_forward=port_forward, log_path=log_path, @@ -4111,6 +4284,7 @@ def run_on_head( ssh_mode=ssh_mode, require_outputs=require_outputs, separate_stderr=separate_stderr, + source_bashrc=source_bashrc, **kwargs, ) @@ -4254,13 +4428,7 @@ def _execute_file_mounts(self, handle: CloudVmRayResourceHandle, style = colorama.Style logger.info(f'{fore.CYAN}Processing file mounts.{style.RESET_ALL}') start = time.time() - ip_list = handle.external_ips() - port_list = handle.external_ssh_ports() - assert ip_list is not None, 'external_ips is not cached in handle' - ssh_credentials = backend_utils.ssh_credential_from_yaml( - handle.cluster_yaml, handle.docker_user, handle.ssh_user) - runners = command_runner.SSHCommandRunner.make_runner_list( - ip_list, port_list=port_list, **ssh_credentials) + runners = handle.get_command_runners() log_path = os.path.join(self.log_dir, 'file_mounts.log') # Check the files and warn @@ -4356,12 +4524,21 @@ def _execute_file_mounts(self, handle: CloudVmRayResourceHandle, action_message='Syncing', log_path=log_path, stream_logs=False, + # Need to source bashrc, as the cloud specific CLI or SDK may + # require PATH in bashrc. + source_bashrc=True, ) # (2) Run the commands to create symlinks on all the nodes. symlink_command = ' && '.join(symlink_commands) if symlink_command: - - def _symlink_node(runner: command_runner.SSHCommandRunner): + # ALIAS_SUDO_TO_EMPTY_FOR_ROOT_CMD sets sudo to empty string for + # root. We need this as we do not source bashrc for the command for + # better performance, and our sudo handling is only in bashrc. + symlink_command = ( + f'{command_runner.ALIAS_SUDO_TO_EMPTY_FOR_ROOT_CMD} && ' + f'{symlink_command}') + + def _symlink_node(runner: command_runner.CommandRunner): returncode = runner.run(symlink_command, log_path=log_path) subprocess_utils.handle_returncode( returncode, symlink_command, @@ -4401,13 +4578,7 @@ def _execute_storage_mounts( logger.info(f'{fore.CYAN}Processing {len(storage_mounts)} ' f'storage mount{plural}.{style.RESET_ALL}') start = time.time() - ip_list = handle.external_ips() - port_list = handle.external_ssh_ports() - assert ip_list is not None, 'external_ips is not cached in handle' - ssh_credentials = backend_utils.ssh_credential_from_yaml( - handle.cluster_yaml, handle.docker_user, handle.ssh_user) - runners = command_runner.SSHCommandRunner.make_runner_list( - ip_list, port_list=port_list, **ssh_credentials) + runners = handle.get_command_runners() log_path = os.path.join(self.log_dir, 'storage_mounts.log') for dst, storage_obj in storage_mounts.items(): @@ -4438,6 +4609,9 @@ def _execute_storage_mounts( run_rsync=False, action_message='Mounting', log_path=log_path, + # Need to source bashrc, as the cloud specific CLI or SDK + # may require PATH in bashrc. + source_bashrc=True, ) except exceptions.CommandError as e: if e.returncode == exceptions.MOUNT_PATH_NON_EMPTY_CODE: @@ -4546,8 +4720,8 @@ def _get_task_env_vars(self, task: task_lib.Task, job_id: int, handle: CloudVmRayResourceHandle) -> Dict[str, str]: """Returns the environment variables for the task.""" env_vars = task.envs.copy() - # If it is a managed spot job, the TASK_ID_ENV_VAR will have been - # already set by the controller. + # If it is a managed job, the TASK_ID_ENV_VAR will have been already set + # by the controller. if constants.TASK_ID_ENV_VAR not in env_vars: env_vars[ constants.TASK_ID_ENV_VAR] = common_utils.get_global_job_id( @@ -4599,7 +4773,7 @@ def _execute_task_one_node(self, handle: CloudVmRayResourceHandle, codegen.build(), job_id, detach_run=detach_run, - spot_dag=task.spot_dag) + managed_job_dag=task.managed_job_dag) def _execute_task_n_nodes(self, handle: CloudVmRayResourceHandle, task: task_lib.Task, job_id: int, @@ -4654,4 +4828,4 @@ def _execute_task_n_nodes(self, handle: CloudVmRayResourceHandle, codegen.build(), job_id, detach_run=detach_run, - spot_dag=task.spot_dag) + managed_job_dag=task.managed_job_dag) diff --git a/sky/benchmark/benchmark_utils.py b/sky/benchmark/benchmark_utils.py index 9fd0c529453..e1323bb714a 100644 --- a/sky/benchmark/benchmark_utils.py +++ b/sky/benchmark/benchmark_utils.py @@ -172,7 +172,7 @@ def _create_benchmark_bucket() -> Tuple[str, str]: raise_if_no_cloud_access=True) # Already checked by raise_if_no_cloud_access=True. assert enabled_clouds - bucket_type = data.StoreType.from_cloud(enabled_clouds[0]) + bucket_type = data.StoreType.from_cloud(enabled_clouds[0]).value # Create a benchmark bucket. logger.info(f'Creating a bucket {bucket_name} to save the benchmark logs.') diff --git a/sky/check.py b/sky/check.py index 9c28c43ec53..e8a61317d63 100644 --- a/sky/check.py +++ b/sky/check.py @@ -1,26 +1,33 @@ """Credential checks: check cloud credentials and enable clouds.""" import traceback -from typing import Dict, Iterable, List, Optional, Tuple +from types import ModuleType +from typing import Dict, Iterable, List, Optional, Tuple, Union import click import colorama import rich -from sky import clouds +from sky import clouds as sky_clouds from sky import exceptions from sky import global_user_state +from sky import skypilot_config from sky.adaptors import cloudflare from sky.utils import ux_utils -# TODO(zhwu): add check for a single cloud to improve performance -def check(quiet: bool = False, verbose: bool = False) -> None: +def check( + quiet: bool = False, + verbose: bool = False, + clouds: Optional[Iterable[str]] = None, +) -> None: echo = (lambda *_args, **_kwargs: None) if quiet else click.echo echo('Checking credentials to enable clouds for SkyPilot.') - enabled_clouds = [] + disabled_clouds = [] - def check_one_cloud(cloud_tuple: Tuple[str, clouds.Cloud]) -> None: + def check_one_cloud( + cloud_tuple: Tuple[str, Union[sky_clouds.Cloud, + ModuleType]]) -> None: cloud_repr, cloud = cloud_tuple echo(f' Checking {cloud_repr}...', nl=False) try: @@ -43,52 +50,117 @@ def check_one_cloud(cloud_tuple: Tuple[str, clouds.Cloud]) -> None: if reason is not None: echo(f' Hint: {reason}') else: + disabled_clouds.append(cloud_repr) echo(f' Reason: {reason}') + def get_cloud_tuple( + cloud_name: str) -> Tuple[str, Union[sky_clouds.Cloud, ModuleType]]: + # Validates cloud_name and returns a tuple of the cloud's name and + # the cloud object. Includes special handling for Cloudflare. + if cloud_name.lower().startswith('cloudflare'): + return cloudflare.SKY_CHECK_NAME, cloudflare + else: + cloud_obj = sky_clouds.CLOUD_REGISTRY.from_str(cloud_name) + assert cloud_obj is not None, f'Cloud {cloud_name!r} not found' + return repr(cloud_obj), cloud_obj + + def get_all_clouds(): + return tuple([repr(c) for c in sky_clouds.CLOUD_REGISTRY.values()] + + [cloudflare.SKY_CHECK_NAME]) + + if clouds is not None: + cloud_list = clouds + else: + cloud_list = get_all_clouds() + clouds_to_check = [get_cloud_tuple(c) for c in cloud_list] + + # Use allowed_clouds from config if it exists, otherwise check all clouds. + # Also validate names with get_cloud_tuple. + config_allowed_cloud_names = [ + get_cloud_tuple(c)[0] for c in skypilot_config.get_nested( + ['allowed_clouds'], get_all_clouds()) + ] + # Use disallowed_cloud_names for logging the clouds that will be disabled + # because they are not included in allowed_clouds in config.yaml. + disallowed_cloud_names = [ + c for c in get_all_clouds() if c not in config_allowed_cloud_names + ] + # Check only the clouds which are allowed in the config. clouds_to_check = [ - (repr(cloud), cloud) for cloud in clouds.CLOUD_REGISTRY.values() + c for c in clouds_to_check if c[0] in config_allowed_cloud_names ] - clouds_to_check.append(('Cloudflare, for R2 object store', cloudflare)) for cloud_tuple in sorted(clouds_to_check): check_one_cloud(cloud_tuple) - # Cloudflare is not a real cloud in clouds.CLOUD_REGISTRY, and should not be - # inserted into the DB (otherwise `sky launch` and other code would error - # out when it's trying to look it up in the registry). - enabled_clouds = [ + # Cloudflare is not a real cloud in sky_clouds.CLOUD_REGISTRY, and should + # not be inserted into the DB (otherwise `sky launch` and other code would + # error out when it's trying to look it up in the registry). + enabled_clouds_set = { cloud for cloud in enabled_clouds if not cloud.startswith('Cloudflare') - ] - global_user_state.set_enabled_clouds(enabled_clouds) - - if len(enabled_clouds) == 0: + } + disabled_clouds_set = { + cloud for cloud in disabled_clouds if not cloud.startswith('Cloudflare') + } + config_allowed_clouds_set = { + cloud for cloud in config_allowed_cloud_names + if not cloud.startswith('Cloudflare') + } + previously_enabled_clouds_set = { + repr(cloud) for cloud in global_user_state.get_cached_enabled_clouds() + } + + # Determine the set of enabled clouds: (previously enabled clouds + newly + # enabled clouds - newly disabled clouds) intersected with + # config_allowed_clouds, if specified in config.yaml. + # This means that if a cloud is already enabled and is not included in + # allowed_clouds in config.yaml, it will be disabled. + all_enabled_clouds = (config_allowed_clouds_set & ( + (previously_enabled_clouds_set | enabled_clouds_set) - + disabled_clouds_set)) + global_user_state.set_enabled_clouds(list(all_enabled_clouds)) + + disallowed_clouds_hint = None + if disallowed_cloud_names: + disallowed_clouds_hint = ( + '\nNote: The following clouds were disabled because they were not ' + 'included in allowed_clouds in ~/.sky/config.yaml: ' + f'{", ".join([c for c in disallowed_cloud_names])}') + if len(all_enabled_clouds) == 0: echo( click.style( 'No cloud is enabled. SkyPilot will not be able to run any ' 'task. Run `sky check` for more info.', fg='red', bold=True)) + if disallowed_clouds_hint: + echo(click.style(disallowed_clouds_hint, dim=True)) raise SystemExit() else: + clouds_arg = (' ' + + ' '.join(disabled_clouds) if clouds is not None else '') echo( click.style( '\nTo enable a cloud, follow the hints above and rerun: ', - dim=True) + click.style('sky check', bold=True) + '\n' + - click.style( + dim=True) + click.style(f'sky check{clouds_arg}', bold=True) + + '\n' + click.style( 'If any problems remain, refer to detailed docs at: ' 'https://skypilot.readthedocs.io/en/latest/getting-started/installation.html', # pylint: disable=line-too-long dim=True)) + if disallowed_clouds_hint: + echo(click.style(disallowed_clouds_hint, dim=True)) + # Pretty print for UX. if not quiet: enabled_clouds_str = '\n :heavy_check_mark: '.join( - [''] + sorted(enabled_clouds)) + [''] + sorted(all_enabled_clouds)) rich.print('\n[green]:tada: Enabled clouds :tada:' f'{enabled_clouds_str}[/green]') def get_cached_enabled_clouds_or_refresh( - raise_if_no_cloud_access: bool = False) -> List[clouds.Cloud]: + raise_if_no_cloud_access: bool = False) -> List[sky_clouds.Cloud]: """Returns cached enabled clouds and if no cloud is enabled, refresh. This function will perform a refresh if no public cloud is enabled. @@ -120,7 +192,8 @@ def get_cached_enabled_clouds_or_refresh( def get_cloud_credential_file_mounts( - excluded_clouds: Optional[Iterable[clouds.Cloud]]) -> Dict[str, str]: + excluded_clouds: Optional[Iterable[sky_clouds.Cloud]] +) -> Dict[str, str]: """Returns the files necessary to access all enabled clouds. Returns a dictionary that will be added to a task's file mounts @@ -130,7 +203,7 @@ def get_cloud_credential_file_mounts( file_mounts = {} for cloud in enabled_clouds: if (excluded_clouds is not None and - clouds.cloud_in_list(cloud, excluded_clouds)): + sky_clouds.cloud_in_iterable(cloud, excluded_clouds)): continue cloud_file_mounts = cloud.get_credential_file_mounts() file_mounts.update(cloud_file_mounts) diff --git a/sky/cli.py b/sky/cli.py index 72667cffc97..db5291d949c 100644 --- a/sky/cli.py +++ b/sky/cli.py @@ -47,14 +47,13 @@ import sky from sky import backends from sky import check as sky_check -from sky import clouds +from sky import clouds as sky_clouds from sky import core from sky import exceptions from sky import global_user_state -from sky import provision as provision_lib +from sky import jobs as managed_jobs from sky import serve as serve_lib from sky import sky_logging -from sky import spot as spot_lib from sky import status_lib from sky.adaptors import common as adaptors_common from sky.backends import backend_utils @@ -91,9 +90,9 @@ provision a new cluster with that name. Otherwise provision a new cluster with an autogenerated name.""" -# The maximum number of in-progress spot jobs to show in the status +# The maximum number of in-progress managed jobs to show in the status # command. -_NUM_SPOT_JOBS_TO_SHOW_IN_STATUS = 5 +_NUM_MANAGED_JOBS_TO_SHOW_IN_STATUS = 5 _STATUS_PROPERTY_CLUSTER_NUM_ERROR_MESSAGE = ( '{cluster_num} cluster{plural} {verb}. Please specify {cause} ' @@ -103,7 +102,7 @@ 'please retry after a while.') _DAG_NOT_SUPPORTED_MESSAGE = ('YAML specifies a DAG which is only supported by ' - '`sky spot launch`. `{command}` supports a ' + '`sky jobs launch`. `{command}` supports a ' 'single task only.') @@ -480,7 +479,7 @@ def _parse_override_params( if cloud.lower() == 'none': override_params['cloud'] = None else: - override_params['cloud'] = clouds.CLOUD_REGISTRY.from_str(cloud) + override_params['cloud'] = sky_clouds.CLOUD_REGISTRY.from_str(cloud) if region is not None: if region.lower() == 'none': override_params['region'] = None @@ -566,9 +565,10 @@ def _launch_with_confirm( raise_if_no_cloud_access=True) except exceptions.NoCloudAccessError as e: # Catch the exception where the public cloud is not enabled, and - # only print the error message without the error type. - click.secho(e, fg='yellow') - sys.exit(1) + # make it yellow for better visibility. + with ux_utils.print_exception_no_traceback(): + raise RuntimeError(f'{colorama.Fore.YELLOW}{e}' + f'{colorama.Style.RESET_ALL}') from e dag = sky.optimize(dag) task = dag.tasks[0] @@ -688,7 +688,7 @@ def _pop_and_ignore_fields_in_override_params( def _make_task_or_dag_from_entrypoint_with_overrides( - entrypoint: List[str], + entrypoint: Tuple[str, ...], *, entrypoint_name: str = 'Task', name: Optional[str] = None, @@ -708,8 +708,8 @@ def _make_task_or_dag_from_entrypoint_with_overrides( ports: Optional[Tuple[str]] = None, env: Optional[List[Tuple[str, str]]] = None, field_to_ignore: Optional[List[str]] = None, - # spot launch specific - spot_recovery: Optional[str] = None, + # job launch specific + job_recovery: Optional[str] = None, ) -> Union[sky.Task, sky.Dag]: """Creates a task or a dag from an entrypoint with overrides. @@ -772,14 +772,16 @@ def _make_task_or_dag_from_entrypoint_with_overrides( else: task = sky.Task(name='sky-cmd', run=entrypoint) task.set_resources({sky.Resources()}) + # env update has been done for DAG in load_chain_dag_from_yaml for YAML. + task.update_envs(env) # Override. if workdir is not None: task.workdir = workdir - # Spot launch specific. - if spot_recovery is not None: - override_params['spot_recovery'] = spot_recovery + # job launch specific. + if job_recovery is not None: + override_params['job_recovery'] = job_recovery task.set_resources_override(override_params) @@ -787,7 +789,6 @@ def _make_task_or_dag_from_entrypoint_with_overrides( task.num_nodes = num_nodes if name is not None: task.name = name - task.update_envs(env) return task @@ -816,13 +817,30 @@ def get_help(self, ctx): return super().get_help(ctx) -def _with_deprecation_warning(f, original_name, alias_name): +def _with_deprecation_warning( + f, + original_name: str, + alias_name: str, + override_command_argument: Optional[Dict[str, Any]] = None): @functools.wraps(f) def wrapper(self, *args, **kwargs): + override_str = '' + if override_command_argument is not None: + overrides = [] + for k, v in override_command_argument.items(): + if isinstance(v, bool): + if v: + overrides.append(f'--{k}') + else: + overrides.append(f'--no-{k}') + else: + overrides.append(f'--{k.replace("_", "-")}={v}') + override_str = ' with additional arguments ' + ' '.join(overrides) click.secho( - f'WARNING: `{alias_name}` is deprecated and will be removed in a ' - f'future release. Please use `{original_name}` instead.\n', + f'WARNING: `{alias_name}` has been renamed to `{original_name}` ' + f'and will be removed in a future release. Please use the ' + f'latter{override_str} instead.\n', err=True, fg='yellow') return f(self, *args, **kwargs) @@ -830,17 +848,49 @@ def wrapper(self, *args, **kwargs): return wrapper -def _add_command_alias_to_group(group, command, name, hidden): +def _override_arguments(callback, override_command_argument: Dict[str, Any]): + + def wrapper(*args, **kwargs): + logger.info(f'Overriding arguments: {override_command_argument}') + kwargs.update(override_command_argument) + return callback(*args, **kwargs) + + return wrapper + + +def _add_command_alias( + group: click.Group, + command: click.Command, + hidden: bool = False, + new_group: Optional[click.Group] = None, + new_command_name: Optional[str] = None, + override_command_argument: Optional[Dict[str, Any]] = None, + with_warning: bool = True, +) -> None: """Add a alias of a command to a group.""" + if new_group is None: + new_group = group + if new_command_name is None: + new_command_name = command.name + if new_group == group and new_command_name == command.name: + raise ValueError('Cannot add an alias to the same command.') new_command = copy.deepcopy(command) new_command.hidden = hidden - new_command.name = name + new_command.name = new_command_name + + if override_command_argument: + new_command.callback = _override_arguments(new_command.callback, + override_command_argument) orig = f'sky {group.name} {command.name}' - alias = f'sky {group.name} {name}' - new_command.invoke = _with_deprecation_warning(new_command.invoke, orig, - alias) - group.add_command(new_command, name=name) + alias = f'sky {new_group.name} {new_command_name}' + if with_warning: + new_command.invoke = _with_deprecation_warning( + new_command.invoke, + orig, + alias, + override_command_argument=override_command_argument) + new_group.add_command(new_command, name=new_command_name) def _deprecate_and_hide_command(group, command_to_deprecate, @@ -980,7 +1030,7 @@ def cli(): 'the same data on the boot disk as an existing cluster.')) @usage_lib.entrypoint def launch( - entrypoint: List[str], + entrypoint: Tuple[str, ...], cluster: Optional[str], dryrun: bool, detach_setup: bool, @@ -1082,11 +1132,19 @@ def launch( @cli.command(cls=_DocumentedCodeCommand) @click.argument('cluster', - required=True, + required=False, type=str, **_get_shell_complete_args(_complete_cluster_name)) +@click.option( + '--cluster', + '-c', + 'cluster_option', + hidden=True, + type=str, + help='This is the same as the positional argument, just for consistency.', + **_get_shell_complete_args(_complete_cluster_name)) @click.argument('entrypoint', - required=True, + required=False, type=str, nargs=-1, **_get_shell_complete_args(_complete_file_name)) @@ -1101,8 +1159,9 @@ def launch( @usage_lib.entrypoint # pylint: disable=redefined-builtin def exec( - cluster: str, - entrypoint: List[str], + cluster: Optional[str], + cluster_option: Optional[str], + entrypoint: Tuple[str, ...], detach_run: bool, name: Optional[str], cloud: Optional[str], @@ -1180,6 +1239,17 @@ def exec( sky exec mycluster --env WANDB_API_KEY python train_gpu.py """ + if cluster_option is None and cluster is None: + raise click.UsageError('Missing argument \'[CLUSTER]\' and ' + '\'[ENTRYPOINT]...\'') + if cluster_option is not None: + if cluster is not None: + entrypoint = (cluster,) + entrypoint + cluster = cluster_option + if not entrypoint: + raise click.UsageError('Missing argument \'[ENTRYPOINT]...\'') + assert cluster is not None, (cluster, cluster_option, entrypoint) + env = _merge_env_vars(env_file, env) controller_utils.check_cluster_name_not_controller( cluster, operation_str='Executing task on it') @@ -1219,30 +1289,30 @@ def exec( sky.exec(task, backend=backend, cluster_name=cluster, detach_run=detach_run) -def _get_spot_jobs( +def _get_managed_jobs( refresh: bool, skip_finished: bool, show_all: bool, limit_num_jobs_to_show: bool = False, is_called_by_user: bool = False) -> Tuple[Optional[int], str]: - """Get the in-progress spot jobs. + """Get the in-progress managed jobs. Args: - refresh: Query the latest statuses, restarting the spot controller if + refresh: Query the latest statuses, restarting the jobs controller if stopped. skip_finished: Show only in-progress jobs. - show_all: Show all information of each spot job (e.g., region, price). + show_all: Show all information of each job (e.g., region, price). limit_num_jobs_to_show: If True, limit the number of jobs to show to - _NUM_SPOT_JOBS_TO_SHOW_IN_STATUS, which is mainly used by + _NUM_MANAGED_JOBS_TO_SHOW_IN_STATUS, which is mainly used by `sky status`. is_called_by_user: If this function is called by user directly, or an internal call. Returns: A tuple of (num_in_progress_jobs, msg). If num_in_progress_jobs is None, - it means there is an error when querying the spot jobs. In this case, + it means there is an error when querying the managed jobs. In this case, msg contains the error message. Otherwise, msg contains the formatted - spot job table. + managed job table. """ num_in_progress_jobs = None try: @@ -1250,32 +1320,51 @@ def _get_spot_jobs( usage_lib.messages.usage.set_internal() with sky_logging.silent(): # Make the call silent - spot_jobs = spot_lib.queue(refresh=refresh, - skip_finished=skip_finished) - num_in_progress_jobs = len(spot_jobs) + managed_jobs_ = managed_jobs.queue(refresh=refresh, + skip_finished=skip_finished) + num_in_progress_jobs = len(set(job['job_id'] for job in managed_jobs_)) except exceptions.ClusterNotUpError as e: controller_status = e.cluster_status msg = str(e) if controller_status is None: - msg += (f' (See: {colorama.Style.BRIGHT}sky spot -h' + msg += (f' (See: {colorama.Style.BRIGHT}sky jobs -h' f'{colorama.Style.RESET_ALL})') elif (controller_status == status_lib.ClusterStatus.STOPPED and is_called_by_user): - msg += (f' (See finished jobs: {colorama.Style.BRIGHT}' - f'sky spot queue --refresh{colorama.Style.RESET_ALL})') + msg += (f' (See finished managed jobs: {colorama.Style.BRIGHT}' + f'sky jobs queue --refresh{colorama.Style.RESET_ALL})') except RuntimeError as e: - msg = ('Failed to query spot jobs due to connection ' - 'issues. Try again later. ' - f'Details: {common_utils.format_exception(e, use_bracket=True)}') + msg = '' + try: + # Check the controller status again, as the RuntimeError is likely + # due to the controller being autostopped when querying the jobs. + controller_type = controller_utils.Controllers.JOBS_CONTROLLER + record = backend_utils.refresh_cluster_record( + controller_type.value.cluster_name, + cluster_status_lock_timeout=0) + if (record is None or + record['status'] == status_lib.ClusterStatus.STOPPED): + msg = controller_type.value.default_hint_if_non_existent + except Exception: # pylint: disable=broad-except + # This is to an best effort to find the latest controller status to + # print more helpful message, so we can ignore any exception to + # print the original error. + pass + if not msg: + msg = ( + 'Failed to query managed jobs due to connection ' + 'issues. Try again later. ' + f'Details: {common_utils.format_exception(e, use_bracket=True)}' + ) except Exception as e: # pylint: disable=broad-except - msg = ('Failed to query spot jobs: ' + msg = ('Failed to query managed jobs: ' f'{common_utils.format_exception(e, use_bracket=True)}') else: - max_jobs_to_show = (_NUM_SPOT_JOBS_TO_SHOW_IN_STATUS + max_jobs_to_show = (_NUM_MANAGED_JOBS_TO_SHOW_IN_STATUS if limit_num_jobs_to_show else None) - msg = spot_lib.format_job_table(spot_jobs, - show_all=show_all, - max_jobs=max_jobs_to_show) + msg = managed_jobs.format_job_table(managed_jobs_, + show_all=show_all, + max_jobs=max_jobs_to_show) return num_in_progress_jobs, msg @@ -1314,9 +1403,27 @@ def _get_services(service_names: Optional[List[str]], msg += (f' (See: {colorama.Style.BRIGHT}sky serve -h' f'{colorama.Style.RESET_ALL})') except RuntimeError as e: - msg = ('Failed to fetch service statuses due to connection issues. ' - 'Please try again later. Details: ' - f'{common_utils.format_exception(e, use_bracket=True)}') + msg = '' + try: + # Check the controller status again, as the RuntimeError is likely + # due to the controller being autostopped when querying the + # services. + controller_type = controller_utils.Controllers.SKY_SERVE_CONTROLLER + record = backend_utils.refresh_cluster_record( + controller_type.value.cluster_name, + cluster_status_lock_timeout=0) + if (record is None or + record['status'] == status_lib.ClusterStatus.STOPPED): + msg = controller_type.value.default_hint_if_non_existent + except Exception: # pylint: disable=broad-except + # This is to an best effort to find the latest controller status to + # print more helpful message, so we can ignore any exception to + # print the original error. + pass + if not msg: + msg = ('Failed to fetch service statuses due to connection issues. ' + 'Please try again later. Details: ' + f'{common_utils.format_exception(e, use_bracket=True)}') except Exception as e: # pylint: disable=broad-except msg = ('Failed to fetch service statuses: ' f'{common_utils.format_exception(e, use_bracket=True)}') @@ -1380,11 +1487,11 @@ def _get_services(service_names: Optional[List[str]], type=int, help=('Get the endpoint URL for the specified port number on the ' 'cluster. This option will override all other options.')) -@click.option('--show-spot-jobs/--no-show-spot-jobs', +@click.option('--show-managed-jobs/--no-show-managed-jobs', default=True, is_flag=True, required=False, - help='Also show recent in-progress spot jobs, if any.') + help='Also show recent in-progress managed jobs, if any.') @click.option('--show-services/--no-show-services', default=True, is_flag=True, @@ -1398,8 +1505,8 @@ def _get_services(service_names: Optional[List[str]], @usage_lib.entrypoint # pylint: disable=redefined-builtin def status(all: bool, refresh: bool, ip: bool, endpoints: bool, - endpoint: Optional[int], show_spot_jobs: bool, show_services: bool, - clusters: List[str]): + endpoint: Optional[int], show_managed_jobs: bool, + show_services: bool, clusters: List[str]): # NOTE(dev): Keep the docstring consistent between the Python API and CLI. """Show clusters. @@ -1458,19 +1565,20 @@ def status(all: bool, refresh: bool, ip: bool, endpoints: bool, or for autostop-enabled clusters, use ``--refresh`` to query the latest cluster statuses from the cloud providers. """ - # Using a pool with 2 worker to run the spot job query and sky serve service - # query in parallel to speed up. The pool provides a AsyncResult object that - # can be used as a future. + # Using a pool with 2 worker to run the managed job query and sky serve + # service query in parallel to speed up. The pool provides a AsyncResult + # object that can be used as a future. with multiprocessing.Pool(2) as pool: - # Do not show spot queue if user specifies clusters, and if user + # Do not show job queue if user specifies clusters, and if user # specifies --ip or --endpoint(s). - show_spot_jobs = show_spot_jobs and not any([clusters, ip, endpoints]) + show_managed_jobs = show_managed_jobs and not any( + [clusters, ip, endpoints]) show_endpoints = endpoints or endpoint is not None show_single_endpoint = endpoint is not None - if show_spot_jobs: - # Run the spot job query in parallel to speed up the status query. - spot_jobs_future = pool.apply_async( - _get_spot_jobs, + if show_managed_jobs: + # Run managed job query in parallel to speed up the status query. + managed_jobs_future = pool.apply_async( + _get_managed_jobs, kwds=dict(refresh=False, skip_finished=True, show_all=False, @@ -1563,71 +1671,28 @@ def status(all: bool, refresh: bool, ip: bool, endpoints: bool, head_ip = handle.external_ips()[0] if show_endpoints: - launched_resources = handle.launched_resources - cloud = launched_resources.cloud - try: - cloud.check_features_are_supported( - launched_resources, - {clouds.CloudImplementationFeatures.OPEN_PORTS}) - except exceptions.NotSupportedError: - with ux_utils.print_exception_no_traceback(): - raise ValueError('Querying endpoints is not supported ' - f'for {cloud}.') from None - - config = common_utils.read_yaml(handle.cluster_yaml) - port_details = provision_lib.query_ports( - repr(cloud), handle.cluster_name_on_cloud, - handle.launched_resources.ports, config['provider']) - - if endpoint is not None: - # If cluster had no ports to be exposed - ports_set = resources_utils.port_ranges_to_set( - handle.launched_resources.ports) - if endpoint not in ports_set: - with ux_utils.print_exception_no_traceback(): - raise ValueError(f'Port {endpoint} is not exposed ' - 'on cluster ' - f'{cluster_record["name"]!r}.') - # If the user requested a specific port endpoint - if endpoint not in port_details: - error_msg = (f'Port {endpoint} not exposed yet. ' - f'{_ENDPOINTS_RETRY_MESSAGE} ') - if handle.launched_resources.cloud.is_same_cloud( - clouds.Kubernetes()): - # Add Kubernetes specific debugging info - error_msg += ( - kubernetes_utils.get_endpoint_debug_message()) - with ux_utils.print_exception_no_traceback(): - raise RuntimeError(error_msg) - click.echo(port_details[endpoint][0].url(ip=head_ip)) - return - - if not port_details: - # If cluster had no ports to be exposed - if handle.launched_resources.ports is None: - with ux_utils.print_exception_no_traceback(): - raise ValueError('Cluster does not have any ports ' - 'to be exposed.') - # Else wait for the ports to be exposed - else: - error_msg = (f'No endpoints exposed yet. ' - f'{_ENDPOINTS_RETRY_MESSAGE} ') - if handle.launched_resources.cloud.is_same_cloud( - clouds.Kubernetes()): - # Add Kubernetes specific debugging info - error_msg += \ - kubernetes_utils.get_endpoint_debug_message() - with ux_utils.print_exception_no_traceback(): - raise RuntimeError(error_msg) - - for port, urls in port_details.items(): - click.echo( - f'{colorama.Fore.BLUE}{colorama.Style.BRIGHT}{port}' - f'{colorama.Style.RESET_ALL}: ' - f'{colorama.Fore.CYAN}{colorama.Style.BRIGHT}' - f'{urls[0].url(ip=head_ip)}{colorama.Style.RESET_ALL}') + if endpoint: + cluster_endpoint = core.endpoints(cluster_record['name'], + endpoint).get( + endpoint, None) + if not cluster_endpoint: + raise click.Abort( + f'Endpoint {endpoint} not found for cluster ' + f'{cluster_record["name"]!r}.') + click.echo(cluster_endpoint) + else: + cluster_endpoints = core.endpoints(cluster_record['name']) + assert isinstance(cluster_endpoints, dict) + if not cluster_endpoints: + raise click.Abort(f'No endpoint found for cluster ' + f'{cluster_record["name"]!r}.') + for port, port_endpoint in cluster_endpoints.items(): + click.echo( + f'{colorama.Fore.BLUE}{colorama.Style.BRIGHT}{port}' + f'{colorama.Style.RESET_ALL}: ' + f'{colorama.Fore.CYAN}{colorama.Style.BRIGHT}' + f'{port_endpoint}{colorama.Style.RESET_ALL}') return - click.echo(head_ip) return hints = [] @@ -1655,16 +1720,16 @@ def _try_get_future_result(future) -> Tuple[bool, Any]: interrupted = True return interrupted, result - spot_jobs_query_interrupted = False - if show_spot_jobs: + managed_jobs_query_interrupted = False + if show_managed_jobs: click.echo(f'\n{colorama.Fore.CYAN}{colorama.Style.BRIGHT}' - f'Managed spot jobs{colorama.Style.RESET_ALL}') - with rich_utils.safe_status('[cyan]Checking spot jobs[/]'): - spot_jobs_query_interrupted, result = _try_get_future_result( - spot_jobs_future) - if spot_jobs_query_interrupted: + f'Managed jobs{colorama.Style.RESET_ALL}') + with rich_utils.safe_status('[cyan]Checking managed jobs[/]'): + managed_jobs_query_interrupted, result = _try_get_future_result( + managed_jobs_future) + if managed_jobs_query_interrupted: # Set to -1, so that the controller is not considered - # down, and the hint for showing sky spot queue + # down, and the hint for showing sky jobs queue # will still be shown. num_in_progress_jobs = -1 msg = 'KeyboardInterrupt' @@ -1673,29 +1738,30 @@ def _try_get_future_result(future) -> Tuple[bool, Any]: click.echo(msg) if num_in_progress_jobs is not None: - # spot controller is UP. + # jobs controller is UP. job_info = '' if num_in_progress_jobs > 0: plural_and_verb = ' is' if num_in_progress_jobs > 1: plural_and_verb = 's are' job_info = ( - f'{num_in_progress_jobs} spot job{plural_and_verb} ' + f'{num_in_progress_jobs} managed job{plural_and_verb} ' 'in progress') - if num_in_progress_jobs > _NUM_SPOT_JOBS_TO_SHOW_IN_STATUS: + if (num_in_progress_jobs > + _NUM_MANAGED_JOBS_TO_SHOW_IN_STATUS): job_info += ( - f' ({_NUM_SPOT_JOBS_TO_SHOW_IN_STATUS} latest ones ' - 'shown)') + f' ({_NUM_MANAGED_JOBS_TO_SHOW_IN_STATUS} latest ' + 'ones shown)') job_info += '. ' hints.append( - controller_utils.Controllers.SPOT_CONTROLLER.value. + controller_utils.Controllers.JOBS_CONTROLLER.value. in_progress_hint.format(job_info=job_info)) if show_services: click.echo(f'\n{colorama.Fore.CYAN}{colorama.Style.BRIGHT}' f'Services{colorama.Style.RESET_ALL}') num_services = None - if spot_jobs_query_interrupted: + if managed_jobs_query_interrupted: # The pool is terminated, so we cannot run the service query. msg = 'KeyboardInterrupt' else: @@ -1712,7 +1778,7 @@ def _try_get_future_result(future) -> Tuple[bool, Any]: hints.append(controller_utils.Controllers.SKY_SERVE_CONTROLLER. value.in_progress_hint) - if show_spot_jobs or show_services: + if show_managed_jobs or show_services: try: pool.close() pool.join() @@ -1983,7 +2049,7 @@ def logs( help='Skip confirmation prompt.') @click.argument('jobs', required=False, type=int, nargs=-1) @usage_lib.entrypoint -def cancel(cluster: str, all: bool, jobs: List[int], yes: bool): # pylint: disable=redefined-builtin +def cancel(cluster: str, all: bool, jobs: List[int], yes: bool): # pylint: disable=redefined-builtin, redefined-outer-name # NOTE(dev): Keep the docstring consistent between the Python API and CLI. """Cancel job(s). @@ -2030,16 +2096,16 @@ def cancel(cluster: str, all: bool, jobs: List[int], yes: bool): # pylint: disa try: core.cancel(cluster, all=all, job_ids=job_ids_to_cancel) - except exceptions.NotSupportedError: + except exceptions.NotSupportedError as e: controller = controller_utils.Controllers.from_name(cluster) assert controller is not None, cluster - click.echo(controller.value.decline_cancel_hint) - sys.exit(1) + with ux_utils.print_exception_no_traceback(): + raise click.UsageError(controller.value.decline_cancel_hint) from e except ValueError as e: raise click.UsageError(str(e)) - except exceptions.ClusterNotUpError as e: - click.echo(str(e)) - sys.exit(1) + except exceptions.ClusterNotUpError: + with ux_utils.print_exception_no_traceback(): + raise @cli.command(cls=_DocumentedCodeCommand) @@ -2382,7 +2448,7 @@ def start( if not to_start: return - # Checks for controller clusters (spot controller / sky serve controller). + # Checks for controller clusters (jobs controller / sky serve controller). controllers, normal_clusters = [], [] for name in to_start: if controller_utils.Controllers.from_name(name) is not None: @@ -2501,14 +2567,26 @@ def down( purge=purge) -def _hint_or_raise_for_down_spot_controller(controller_name: str): +def _hint_or_raise_for_down_jobs_controller(controller_name: str): + """Helper function to check job controller status before tearing it down. + + Raises helpful exceptions and errors if the controller is not in a safe + state to be torn down. + + Raises: + RuntimeError: if failed to get the job queue. + exceptions.NotSupportedError: if the controller is not in a safe state + to be torn down (e.g., because it has jobs running or + it is in init state) + """ controller = controller_utils.Controllers.from_name(controller_name) assert controller is not None, controller_name with rich_utils.safe_status( - '[bold cyan]Checking for in-progress spot jobs[/]'): + '[bold cyan]Checking for in-progress managed jobs[/]'): try: - spot_jobs = spot_lib.queue(refresh=False, skip_finished=True) + managed_jobs_ = managed_jobs.queue(refresh=False, + skip_finished=True) except exceptions.ClusterNotUpError as e: if controller.value.connection_error_hint in str(e): with ux_utils.print_exception_no_traceback(): @@ -2517,21 +2595,21 @@ def _hint_or_raise_for_down_spot_controller(controller_name: str): decline_down_when_failed_to_fetch_status_hint) if e.cluster_status is None: click.echo( - 'Managed spot controller has already been torn down.') + 'Managed jobs controller has already been torn down.') sys.exit(0) - # At this point, the spot jobs are failed to be fetched due to the - # controller being STOPPED or being firstly launched, i.e., there is - # no in-prgress spot jobs. - spot_jobs = [] + # At this point, the managed jobs are failed to be fetched due to + # the controller being STOPPED or being firstly launched, i.e., + # there is no in-prgress managed jobs. + managed_jobs_ = [] msg = (f'{colorama.Fore.YELLOW}WARNING: Tearing down the managed ' - 'spot controller. Please be aware of the following:' + 'jobs controller. Please be aware of the following:' f'{colorama.Style.RESET_ALL}' - '\n * All logs and status information of the spot ' - 'jobs (output of `sky spot queue`) will be lost.') + '\n * All logs and status information of the managed ' + 'jobs (output of `sky jobs queue`) will be lost.') click.echo(msg) - if spot_jobs: - job_table = spot_lib.format_job_table(spot_jobs, show_all=False) + if managed_jobs_: + job_table = managed_jobs.format_job_table(managed_jobs_, show_all=False) msg = controller.value.decline_down_for_dirty_controller_hint # Add prefix to each line to align with the bullet point. msg += '\n'.join( @@ -2539,11 +2617,22 @@ def _hint_or_raise_for_down_spot_controller(controller_name: str): with ux_utils.print_exception_no_traceback(): raise exceptions.NotSupportedError(msg) else: - click.echo(' * No in-progress spot jobs found. It should be safe to ' + click.echo(' * No in-progress managed jobs found. It should be safe to ' 'terminate (see caveats above).') def _hint_or_raise_for_down_sky_serve_controller(controller_name: str): + """Helper function to check serve controller status before tearing it down. + + Raises helpful exceptions and errors if the controller is not in a safe + state to be torn down. + + Raises: + RuntimeError: if failed to get the service status. + exceptions.NotSupportedError: if the controller is not in a safe state + to be torn down (e.g., because it has services running or + it is in init state) + """ controller = controller_utils.Controllers.from_name(controller_name) assert controller is not None, controller_name with rich_utils.safe_status('[bold cyan]Checking for live services[/]'): @@ -2575,8 +2664,8 @@ def _hint_or_raise_for_down_sky_serve_controller(controller_name: str): _CONTROLLER_TO_HINT_OR_RAISE = { - controller_utils.Controllers.SPOT_CONTROLLER: - (_hint_or_raise_for_down_spot_controller), + controller_utils.Controllers.JOBS_CONTROLLER: + (_hint_or_raise_for_down_jobs_controller), controller_utils.Controllers.SKY_SERVE_CONTROLLER: (_hint_or_raise_for_down_sky_serve_controller), } @@ -2591,9 +2680,9 @@ def _down_or_stop_clusters( idle_minutes_to_autostop: Optional[int] = None) -> None: """Tears down or (auto-)stops a cluster (or all clusters). - Controllers (spot controller and sky serve controller) can only be + Controllers (jobs controller and sky serve controller) can only be terminated if the cluster name is explicitly and uniquely specified (not - via glob) and purge is set to True. + via glob). """ if down: command = 'down' @@ -2662,12 +2751,13 @@ def _down_or_stop_clusters( # TODO(zhwu): This hint or raise is not transactional, which # means even if it passed the check with no in-progress spot # or service and prompt the confirmation for termination, - # a user could still do a `sky spot launch` or a + # a user could still do a `sky jobs launch` or a # `sky serve up` before typing the delete, causing a leaked - # spot job or service. We should make this check atomic with - # the termination. + # managed job or service. We should make this check atomic + # with the termination. hint_or_raise(controller_name) - except exceptions.ClusterOwnerIdentityMismatchError as e: + except (exceptions.ClusterOwnerIdentityMismatchError, + RuntimeError) as e: if purge: click.echo(common_utils.format_exception(e)) else: @@ -2794,24 +2884,38 @@ def _down_or_stop(name: str): progress.refresh() -@cli.command() +@cli.command(cls=_DocumentedCodeCommand) +@click.argument('clouds', required=False, type=str, nargs=-1) @click.option('--verbose', '-v', is_flag=True, default=False, help='Show the activated account for each cloud.') @usage_lib.entrypoint -def check(verbose: bool): +def check(clouds: Tuple[str], verbose: bool): """Check which clouds are available to use. This checks access credentials for all clouds supported by SkyPilot. If a cloud is detected to be inaccessible, the reason and correction steps will be shown. + If CLOUDS are specified, checks credentials for only those clouds. + The enabled clouds are cached and form the "search space" to be considered for each task. + + Examples: + + .. code-block:: bash + + # Check credentials for all supported clouds. + sky check + \b + # Check only specific clouds - AWS and GCP. + sky check aws gcp """ - sky_check.check(verbose=verbose) + clouds_arg = clouds if len(clouds) > 0 else None + sky_check.check(verbose=verbose, clouds=clouds_arg) @cli.command() @@ -2863,6 +2967,15 @@ def show_gpus( To show all regions for a specified accelerator, use ``sky show-gpus --all-regions``. + If ``--region`` or ``--all-regions`` is not specified, the price displayed + for each instance type is the lowest across all regions for both on-demand + and spot instances. There may be multiple regions with the same lowest + price. + + If ``--cloud kubernetes`` is specified, it will show the maximum quantities + of the GPU available on a single node and the real-time availability of + the GPU across all nodes in the Kubernetes cluster. + Definitions of certain fields: * ``DEVICE_MEM``: Memory of a single device; does not depend on the device @@ -2870,10 +2983,15 @@ def show_gpus( * ``HOST_MEM``: Memory of the host instance (VM). - If ``--region`` or ``--all-regions`` is not specified, the price displayed - for each instance type is the lowest across all regions for both on-demand - and spot instances. There may be multiple regions with the same lowest - price. + * ``QTY_PER_NODE`` (Kubernetes only): GPU quantities that can be requested + on a single node. + + * ``TOTAL_GPUS`` (Kubernetes only): Total number of GPUs available in the + Kubernetes cluster. + + * ``TOTAL_FREE_GPUS`` (Kubernetes only): Number of currently free GPUs + in the Kubernetes cluster. This is fetched in real-time and may change + when other users are using the cluster. """ # validation for the --region flag if region is not None and cloud is None: @@ -2890,15 +3008,70 @@ def show_gpus( '--all-regions and --region flags cannot be used simultaneously.') # This will validate 'cloud' and raise if not found. - cloud_obj = clouds.CLOUD_REGISTRY.from_str(cloud) + cloud_obj = sky_clouds.CLOUD_REGISTRY.from_str(cloud) service_catalog.validate_region_zone(region, None, clouds=cloud) show_all = all if show_all and accelerator_str is not None: raise click.UsageError('--all is only allowed without a GPU name.') + # Kubernetes specific bools + cloud_is_kubernetes = isinstance(cloud_obj, sky_clouds.Kubernetes) + kubernetes_autoscaling = kubernetes_utils.get_autoscaler_type() is not None + kubernetes_is_enabled = sky_clouds.cloud_in_iterable( + sky_clouds.Kubernetes(), global_user_state.get_cached_enabled_clouds()) + + if cloud_is_kubernetes and region is not None: + raise click.UsageError( + 'The --region flag cannot be set with --cloud kubernetes.') + def _list_to_str(lst): return ', '.join([str(e) for e in lst]) + def _get_kubernetes_realtime_gpu_table( + name_filter: Optional[str] = None, + quantity_filter: Optional[int] = None): + if quantity_filter: + qty_header = 'QTY_FILTER' + free_header = 'FILTERED_FREE_GPUS' + else: + qty_header = 'QTY_PER_NODE' + free_header = 'TOTAL_FREE_GPUS' + realtime_gpu_table = log_utils.create_table( + ['GPU', qty_header, 'TOTAL_GPUS', free_header]) + counts, capacity, available = service_catalog.list_accelerator_realtime( + gpus_only=True, + clouds='kubernetes', + name_filter=name_filter, + region_filter=region, + quantity_filter=quantity_filter, + case_sensitive=False) + assert (set(counts.keys()) == set(capacity.keys()) == set( + available.keys())), (f'Keys of counts ({list(counts.keys())}), ' + f'capacity ({list(capacity.keys())}), ' + f'and available ({list(available.keys())}) ' + 'must be same.') + if len(counts) == 0: + err_msg = 'No GPUs found in Kubernetes cluster. ' + debug_msg = 'To further debug, run: sky check ' + if name_filter is not None: + gpu_info_msg = f' {name_filter!r}' + if quantity_filter is not None: + gpu_info_msg += (' with requested quantity' + f' {quantity_filter}') + err_msg = (f'Resources{gpu_info_msg} not found ' + 'in Kubernetes cluster. ') + debug_msg = ('To show available accelerators on kubernetes,' + ' run: sky show-gpus --cloud kubernetes ') + full_err_msg = (err_msg + kubernetes_utils.NO_GPU_HELP_MESSAGE + + debug_msg) + raise ValueError(full_err_msg) + for gpu, _ in sorted(counts.items()): + realtime_gpu_table.add_row([ + gpu, + _list_to_str(counts.pop(gpu)), capacity[gpu], available[gpu] + ]) + return realtime_gpu_table + def _output(): gpu_table = log_utils.create_table( ['COMMON_GPU', 'AVAILABLE_QUANTITIES']) @@ -2909,29 +3082,69 @@ def _output(): name, quantity = None, None + # Optimization - do not poll for Kubernetes API for fetching + # common GPUs because that will be fetched later for the table after + # common GPUs. + clouds_to_list = cloud + if cloud is None: + clouds_to_list = [ + c for c in service_catalog.ALL_CLOUDS if c != 'kubernetes' + ] + + k8s_messages = '' if accelerator_str is None: + # Collect k8s related messages in k8s_messages and print them at end + print_section_titles = False + # If cloud is kubernetes, we want to show real-time capacity + if kubernetes_is_enabled and (cloud is None or cloud_is_kubernetes): + try: + # If --cloud kubernetes is not specified, we want to catch + # the case where no GPUs are available on the cluster and + # print the warning at the end. + k8s_realtime_table = _get_kubernetes_realtime_gpu_table() + except ValueError as e: + if not cloud_is_kubernetes: + # Make it a note if cloud is not kubernetes + k8s_messages += 'Note: ' + k8s_messages += str(e) + else: + print_section_titles = True + yield (f'{colorama.Fore.CYAN}{colorama.Style.BRIGHT}' + f'Kubernetes GPUs{colorama.Style.RESET_ALL}\n') + yield from k8s_realtime_table.get_string() + if kubernetes_autoscaling: + k8s_messages += ( + '\n' + kubernetes_utils.KUBERNETES_AUTOSCALER_NOTE) + if cloud_is_kubernetes: + # Do not show clouds if --cloud kubernetes is specified + if not kubernetes_is_enabled: + yield ('Kubernetes is not enabled. To fix, run: ' + 'sky check kubernetes ') + yield k8s_messages + return + + # For show_all, show the k8s message at the start since output is + # long and the user may not scroll to the end. + if show_all and k8s_messages: + yield k8s_messages + yield '\n\n' + result = service_catalog.list_accelerator_counts( gpus_only=True, - clouds=cloud, + clouds=clouds_to_list, region_filter=region, ) - if (len(result) == 0 and cloud_obj is not None and - cloud_obj.is_same_cloud(clouds.Kubernetes())): - yield kubernetes_utils.NO_GPU_ERROR_MESSAGE - return + if print_section_titles: + # If section titles were printed above, print again here + yield '\n\n' + yield (f'{colorama.Fore.CYAN}{colorama.Style.BRIGHT}' + f'Cloud GPUs{colorama.Style.RESET_ALL}\n') # "Common" GPUs - # If cloud is kubernetes, we want to show all GPUs here, even if - # they are not listed as common in SkyPilot. - if (cloud_obj is not None and - cloud_obj.is_same_cloud(clouds.Kubernetes())): - for gpu, _ in sorted(result.items()): + for gpu in service_catalog.get_common_gpus(): + if gpu in result: gpu_table.add_row([gpu, _list_to_str(result.pop(gpu))]) - else: - for gpu in service_catalog.get_common_gpus(): - if gpu in result: - gpu_table.add_row([gpu, _list_to_str(result.pop(gpu))]) yield from gpu_table.get_string() # Google TPUs @@ -2952,6 +3165,9 @@ def _output(): else: yield ('\n\nHint: use -a/--all to see all accelerators ' '(including non-common ones) and pricing.') + if k8s_messages: + yield '\n' + yield k8s_messages return else: # Parse accelerator string @@ -2975,25 +3191,82 @@ def _output(): else: name, quantity = accelerator_str, None + print_section_titles = False + if (kubernetes_is_enabled and (cloud is None or cloud_is_kubernetes) and + not show_all): + # Print section title if not showing all and instead a specific + # accelerator is requested + print_section_titles = True + yield (f'{colorama.Fore.CYAN}{colorama.Style.BRIGHT}' + f'Kubernetes GPUs{colorama.Style.RESET_ALL}\n') + try: + k8s_realtime_table = _get_kubernetes_realtime_gpu_table( + name_filter=name, quantity_filter=quantity) + yield from k8s_realtime_table.get_string() + except ValueError as e: + # In the case of a specific accelerator, show the error message + # immediately (e.g., "Resources H100 not found ...") + yield str(e) + if kubernetes_autoscaling: + k8s_messages += ('\n' + + kubernetes_utils.KUBERNETES_AUTOSCALER_NOTE) + yield k8s_messages + if cloud_is_kubernetes: + # Do not show clouds if --cloud kubernetes is specified + if not kubernetes_is_enabled: + yield ('Kubernetes is not enabled. To fix, run: ' + 'sky check kubernetes ') + return + + # For clouds other than Kubernetes, get the accelerator details # Case-sensitive result = service_catalog.list_accelerators(gpus_only=True, name_filter=name, quantity_filter=quantity, region_filter=region, - clouds=cloud, + clouds=clouds_to_list, case_sensitive=False, all_regions=all_regions) + # Import here to save module load speed. + # pylint: disable=import-outside-toplevel,line-too-long + from sky.clouds.service_catalog import common + + # For each gpu name (count not included): + # - Group by cloud + # - Sort within each group by prices + # - Sort groups by each cloud's (min price, min spot price) + new_result = {} + for i, (gpu, items) in enumerate(result.items()): + df = pd.DataFrame([t._asdict() for t in items]) + # Determine the minimum prices for each cloud. + min_price_df = df.groupby('cloud').agg(min_price=('price', 'min'), + min_spot_price=('spot_price', + 'min')) + df = df.merge(min_price_df, on='cloud') + # Sort within each cloud by price. + df = df.groupby('cloud', group_keys=False).apply( + lambda x: x.sort_values(by=['price', 'spot_price'])) + # Sort across groups (clouds). + df = df.sort_values(by=['min_price', 'min_spot_price']) + df = df.drop(columns=['min_price', 'min_spot_price']) + sorted_dataclasses = [ + common.InstanceTypeInfo(*row) + for row in df.to_records(index=False) + ] + new_result[gpu] = sorted_dataclasses + result = new_result - if len(result) == 0: - if cloud == 'kubernetes': - yield kubernetes_utils.NO_GPU_ERROR_MESSAGE - return + if print_section_titles and not show_all: + yield '\n\n' + yield (f'{colorama.Fore.CYAN}{colorama.Style.BRIGHT}' + f'Cloud GPUs{colorama.Style.RESET_ALL}\n') + if len(result) == 0: quantity_str = (f' with requested quantity {quantity}' if quantity else '') - yield f'Resources \'{name}\'{quantity_str} not found. ' - yield 'Try \'sky show-gpus --all\' ' - yield 'to show available accelerators.' + cloud_str = f' on {cloud_obj}.' if cloud else ' in cloud catalogs.' + yield f'Resources \'{name}\'{quantity_str} not found{cloud_str} ' + yield 'To show available accelerators, run: sky show-gpus --all' return for i, (gpu, items) in enumerate(result.items()): @@ -3016,13 +3289,14 @@ def _output(): instance_type_str = item.instance_type if not pd.isna( item.instance_type) else '(attachable)' cpu_count = item.cpu_count - if pd.isna(cpu_count): - cpu_str = '-' - elif isinstance(cpu_count, (float, int)): + if not pd.isna(cpu_count) and isinstance( + cpu_count, (float, int)): if int(cpu_count) == cpu_count: cpu_str = str(int(cpu_count)) else: cpu_str = f'{cpu_count:.1f}' + else: + cpu_str = '-' device_memory_str = (f'{item.device_memory:.0f}GB' if not pd.isna(item.device_memory) else '-') host_memory_str = f'{item.memory:.0f}GB' if not pd.isna( @@ -3147,12 +3421,12 @@ def bench(): @cli.group(cls=_NaturalOrderGroup) -def spot(): - """Managed Spot CLI (spot instances with auto-recovery).""" +def jobs(): + """Managed Jobs CLI (jobs with auto-recovery).""" pass -@spot.command('launch', cls=_DocumentedCodeCommand) +@jobs.command('launch', cls=_DocumentedCodeCommand) @click.argument('entrypoint', required=True, type=str, @@ -3160,10 +3434,16 @@ def spot(): **_get_shell_complete_args(_complete_file_name)) # TODO(zhwu): Add --dryrun option to test the launch command. @_add_click_options(_TASK_OPTIONS_WITH_NAME + _EXTRA_RESOURCES_OPTIONS) -@click.option('--spot-recovery', +@click.option('--cluster', + '-c', default=None, type=str, - help='Spot recovery strategy to use for the managed spot task.') + hidden=True, + help=('Alias for --name, the name of the spot job.')) +@click.option('--job-recovery', + default=None, + type=str, + help='Recovery strategy to use for managed jobs.') @click.option( '--detach-run', '-d', @@ -3181,8 +3461,8 @@ def spot(): '(Default: True; this flag is deprecated and will be removed in a ' 'future release.) Whether to retry provisioning infinitely until the ' 'cluster is up, if unavailability errors are encountered. This ' # pylint: disable=bad-docstring-quotes - 'applies to launching the spot clusters (both the initial and any ' - 'recovery attempts), not the spot controller.')) + 'applies to launching all managed jobs (both the initial and ' + 'any recovery attempts), not the jobs controller.')) @click.option('--yes', '-y', is_flag=True, @@ -3191,9 +3471,10 @@ def spot(): help='Skip confirmation prompt.') @timeline.event @usage_lib.entrypoint -def spot_launch( - entrypoint: List[str], +def jobs_launch( + entrypoint: Tuple[str, ...], name: Optional[str], + cluster: Optional[str], workdir: Optional[str], cloud: Optional[str], region: Optional[str], @@ -3205,7 +3486,7 @@ def spot_launch( num_nodes: Optional[int], use_spot: Optional[bool], image_id: Optional[str], - spot_recovery: Optional[str], + job_recovery: Optional[str], env_file: Optional[Dict[str, str]], env: List[Tuple[str, str]], disk_size: Optional[int], @@ -3215,7 +3496,7 @@ def spot_launch( retry_until_up: bool, yes: bool, ): - """Launch a managed spot job from a YAML or a command. + """Launch a managed job from a YAML or a command. If ENTRYPOINT points to a valid YAML file, it is read in as the task specification. Otherwise, it is interpreted as a bash command. @@ -3225,10 +3506,15 @@ def spot_launch( .. code-block:: bash # You can use normal task YAMLs. - sky spot launch task.yaml + sky jobs launch task.yaml - sky spot launch 'echo hello!' + sky jobs launch 'echo hello!' """ + if cluster is not None: + if name is not None and name != cluster: + raise click.UsageError('Cannot specify both --name and --cluster. ' + 'Use one of the flags as they are alias.') + name = cluster env = _merge_env_vars(env_file, env) task_or_dag = _make_task_or_dag_from_entrypoint_with_overrides( entrypoint, @@ -3248,16 +3534,17 @@ def spot_launch( disk_size=disk_size, disk_tier=disk_tier, ports=ports, - spot_recovery=spot_recovery, + job_recovery=job_recovery, ) - # Deprecation. + # Deprecation. We set the default behavior to be retry until up, and the + # flag `--retry-until-up` is deprecated. We can remove the flag in 0.8.0. if retry_until_up is not None: flag_str = '--retry-until-up' if not retry_until_up: flag_str = '--no-retry-until-up' click.secho( f'Flag {flag_str} is deprecated and will be removed in a ' - 'future release (managed spot jobs will always be retried). ' + 'future release (managed jobs will always be retried). ' 'Please file an issue if this does not work for you.', fg='yellow') else: @@ -3275,27 +3562,26 @@ def spot_launch( dag.name = name dag_utils.maybe_infer_and_fill_dag_and_task_names(dag) - dag_utils.fill_default_spot_config_in_dag_for_spot_launch(dag) + dag_utils.fill_default_config_in_dag_for_job_launch(dag) - click.secho( - f'Managed spot job {dag.name!r} will be launched on (estimated):', - fg='yellow') + click.secho(f'Managed job {dag.name!r} will be launched on (estimated):', + fg='yellow') dag = sky.optimize(dag) if not yes: - prompt = f'Launching the spot job {dag.name!r}. Proceed?' + prompt = f'Launching a managed job {dag.name!r}. Proceed?' if prompt is not None: click.confirm(prompt, default=True, abort=True, show_default=True) common_utils.check_cluster_name_is_valid(name) - spot_lib.launch(dag, - name, - detach_run=detach_run, - retry_until_up=retry_until_up) + managed_jobs.launch(dag, + name, + detach_run=detach_run, + retry_until_up=retry_until_up) -@spot.command('queue', cls=_DocumentedCodeCommand) +@jobs.command('queue', cls=_DocumentedCodeCommand) @click.option('--all', '-a', default=False, @@ -3308,7 +3594,7 @@ def spot_launch( default=False, is_flag=True, required=False, - help='Query the latest statuses, restarting the spot controller if stopped.' + help='Query the latest statuses, restarting the jobs controller if stopped.' ) @click.option('--skip-finished', '-s', @@ -3318,21 +3604,21 @@ def spot_launch( help='Show only pending/running jobs\' information.') @usage_lib.entrypoint # pylint: disable=redefined-builtin -def spot_queue(all: bool, refresh: bool, skip_finished: bool): - """Show statuses of managed spot jobs. +def jobs_queue(all: bool, refresh: bool, skip_finished: bool): + """Show statuses of managed jobs. - Each spot job can have one of the following statuses: + Each managed jobs can have one of the following statuses: - - ``PENDING``: Job is waiting for a free slot on the spot controller to be + - ``PENDING``: Job is waiting for a free slot on the jobs controller to be accepted. - - ``SUBMITTED``: Job is submitted to and accepted by the spot controller. + - ``SUBMITTED``: Job is submitted to and accepted by the jobs controller. - - ``STARTING``: Job is starting (provisioning a spot cluster). + - ``STARTING``: Job is starting (provisioning a cluster for the job). - ``RUNNING``: Job is running. - - ``RECOVERING``: The spot cluster is recovering from a preemption. + - ``RECOVERING``: The cluster of the job is recovering from a preemption. - ``SUCCEEDED``: Job succeeded. @@ -3355,12 +3641,12 @@ def spot_queue(all: bool, refresh: bool, skip_finished: bool): - ``FAILED_CONTROLLER``: Job failed due to an unexpected error in the spot controller. - If the job failed, either due to user code or spot unavailability, the - error log can be found with ``sky spot logs --controller``, e.g.: + If the job failed, either due to user code or resource unavailability, the + error log can be found with ``sky jobs logs --controller``, e.g.: .. code-block:: bash - sky spot logs --controller job_id + sky jobs logs --controller job_id This also shows the logs for provisioning and any preemption and recovery attempts. @@ -3369,37 +3655,37 @@ def spot_queue(all: bool, refresh: bool, skip_finished: bool): .. code-block:: bash - watch -n60 sky spot queue + watch -n60 sky jobs queue """ - click.secho('Fetching managed spot job statuses...', fg='yellow') - with rich_utils.safe_status('[cyan]Checking spot jobs[/]'): - _, msg = _get_spot_jobs(refresh=refresh, - skip_finished=skip_finished, - show_all=all, - is_called_by_user=True) + click.secho('Fetching managed job statuses...', fg='yellow') + with rich_utils.safe_status('[cyan]Checking managed jobs[/]'): + _, msg = _get_managed_jobs(refresh=refresh, + skip_finished=skip_finished, + show_all=all, + is_called_by_user=True) if not skip_finished: in_progress_only_hint = '' else: in_progress_only_hint = ' (showing in-progress jobs only)' click.echo(f'{colorama.Fore.CYAN}{colorama.Style.BRIGHT}' - f'Managed spot jobs{colorama.Style.RESET_ALL}' + f'Managed jobs{colorama.Style.RESET_ALL}' f'{in_progress_only_hint}\n{msg}') -@spot.command('cancel', cls=_DocumentedCodeCommand) +@jobs.command('cancel', cls=_DocumentedCodeCommand) @click.option('--name', '-n', required=False, type=str, - help='Managed spot job name to cancel.') + help='Managed job name to cancel.') @click.argument('job_ids', default=None, type=int, required=False, nargs=-1) @click.option('--all', '-a', is_flag=True, default=False, required=False, - help='Cancel all managed spot jobs.') + help='Cancel all managed jobs.') @click.option('--yes', '-y', is_flag=True, @@ -3408,8 +3694,8 @@ def spot_queue(all: bool, refresh: bool, skip_finished: bool): help='Skip confirmation prompt.') @usage_lib.entrypoint # pylint: disable=redefined-builtin -def spot_cancel(name: Optional[str], job_ids: Tuple[int], all: bool, yes: bool): - """Cancel managed spot jobs. +def jobs_cancel(name: Optional[str], job_ids: Tuple[int], all: bool, yes: bool): + """Cancel managed jobs. You can provide either a job name or a list of job IDs to be cancelled. They are exclusive options. @@ -3418,15 +3704,15 @@ def spot_cancel(name: Optional[str], job_ids: Tuple[int], all: bool, yes: bool): .. code-block:: bash - # Cancel managed spot job with name 'my-job' - $ sky spot cancel -n my-job + # Cancel managed job with name 'my-job' + $ sky jobs cancel -n my-job \b - # Cancel managed spot jobs with IDs 1, 2, 3 - $ sky spot cancel 1 2 3 + # Cancel managed jobs with IDs 1, 2, 3 + $ sky jobs cancel 1 2 3 """ backend_utils.is_controller_accessible( - controller_type=controller_utils.Controllers.SPOT_CONTROLLER, - stopped_message='All managed spot jobs should have finished.', + controller=controller_utils.Controllers.JOBS_CONTROLLER, + stopped_message='All managed jobs should have finished.', exit_if_not_accessible=True) job_id_str = ','.join(map(str, job_ids)) @@ -3439,24 +3725,24 @@ def spot_cancel(name: Optional[str], job_ids: Tuple[int], all: bool, yes: bool): f'Provided {argument_str!r}.') if not yes: - job_identity_str = (f'managed spot jobs with IDs {job_id_str}' + job_identity_str = (f'managed jobs with IDs {job_id_str}' if job_ids else repr(name)) if all: - job_identity_str = 'all managed spot jobs' + job_identity_str = 'all managed jobs' click.confirm(f'Cancelling {job_identity_str}. Proceed?', default=True, abort=True, show_default=True) - spot_lib.cancel(job_ids=job_ids, name=name, all=all) + managed_jobs.cancel(job_ids=job_ids, name=name, all=all) -@spot.command('logs', cls=_DocumentedCodeCommand) +@jobs.command('logs', cls=_DocumentedCodeCommand) @click.option('--name', '-n', required=False, type=str, - help='Managed spot job name.') + help='Managed job name.') @click.option( '--follow/--no-follow', is_flag=True, @@ -3471,22 +3757,20 @@ def spot_cancel(name: Optional[str], job_ids: Tuple[int], all: bool, yes: bool): 'launching/recoveries, etc.')) @click.argument('job_id', required=False, type=int) @usage_lib.entrypoint -def spot_logs(name: Optional[str], job_id: Optional[int], follow: bool, +def jobs_logs(name: Optional[str], job_id: Optional[int], follow: bool, controller: bool): - """Tail the log of a managed spot job.""" + """Tail the log of a managed job.""" try: - if controller: - core.tail_logs(spot_lib.SPOT_CONTROLLER_NAME, - job_id=job_id, - follow=follow) - else: - spot_lib.tail_logs(name=name, job_id=job_id, follow=follow) - except exceptions.ClusterNotUpError as e: - click.echo(e) - sys.exit(1) + managed_jobs.tail_logs(name=name, + job_id=job_id, + follow=follow, + controller=controller) + except exceptions.ClusterNotUpError: + with ux_utils.print_exception_no_traceback(): + raise -@spot.command('dashboard', cls=_DocumentedCodeCommand) +@jobs.command('dashboard', cls=_DocumentedCodeCommand) @click.option( '--port', '-p', @@ -3496,19 +3780,18 @@ def spot_logs(name: Optional[str], job_id: Optional[int], follow: bool, help=('Local port to use for the dashboard. If None, a free port is ' 'automatically chosen.')) @usage_lib.entrypoint -def spot_dashboard(port: Optional[int]): - """Opens a dashboard for spot jobs (needs controller to be UP).""" +def jobs_dashboard(port: Optional[int]): + """Opens a dashboard for managed jobs (needs controller to be UP).""" # TODO(zongheng): ideally, the controller/dashboard server should expose the # API perhaps via REST. Then here we would (1) not have to use SSH to try to # see if the controller is UP first, which is slow; (2) not have to run SSH # port forwarding first (we'd just launch a local dashboard which would make # REST API calls to the controller dashboard server). - click.secho('Checking if spot controller is up...', fg='yellow') - hint = ( - 'Dashboard is not available if spot controller is not up. Run a spot ' - 'job first.') + click.secho('Checking if jobs controller is up...', fg='yellow') + hint = ('Dashboard is not available if jobs controller is not up. Run a ' + 'managed job first.') backend_utils.is_controller_accessible( - controller_type=controller_utils.Controllers.SPOT_CONTROLLER, + controller=controller_utils.Controllers.JOBS_CONTROLLER, stopped_message=hint, non_existent_message=hint, exit_if_not_accessible=True) @@ -3519,8 +3802,9 @@ def spot_dashboard(port: Optional[int]): free_port = common_utils.find_free_port(remote_port) else: free_port = port - ssh_command = (f'ssh -qNL {free_port}:localhost:{remote_port} ' - f'{spot_lib.SPOT_CONTROLLER_NAME}') + ssh_command = ( + f'ssh -qNL {free_port}:localhost:{remote_port} ' + f'{controller_utils.Controllers.JOBS_CONTROLLER.value.cluster_name}') click.echo('Forwarding port: ', nl=False) click.secho(f'{ssh_command}', dim=True) @@ -3539,12 +3823,31 @@ def spot_dashboard(port: Optional[int]): try: os.killpg(os.getpgid(ssh_process.pid), signal.SIGTERM) except ProcessLookupError: - # This happens if spot controller is auto-stopped. + # This happens if jobs controller is auto-stopped. pass finally: click.echo('Exiting.') +# TODO(zhwu): Backward compatibility for the old `sky spot launch` command. +# It is now renamed to `sky jobs launch` and the old command is deprecated. +# Remove in v0.8.0. +@cli.group(cls=_NaturalOrderGroup) +def spot(): + """Alias for Managed Jobs CLI (default to managed spot jobs).""" + pass + + +_add_command_alias(jobs, + jobs_launch, + new_group=spot, + override_command_argument={'use_spot': True}) +_add_command_alias(jobs, jobs_queue, new_group=spot) +_add_command_alias(jobs, jobs_logs, new_group=spot) +_add_command_alias(jobs, jobs_cancel, new_group=spot) +_add_command_alias(jobs, jobs_dashboard, new_group=spot) + + @cli.group(cls=_NaturalOrderGroup) def serve(): """SkyServe CLI (multi-region, multi-cloud serving).""" @@ -3553,7 +3856,7 @@ def serve(): def _generate_task_with_service( service_name: str, - service_yaml_args: List[str], + service_yaml_args: Tuple[str, ...], workdir: Optional[str], cloud: Optional[str], region: Optional[str], @@ -3658,7 +3961,7 @@ def _generate_task_with_service( @timeline.event @usage_lib.entrypoint def serve_up( - service_yaml: List[str], + service_yaml: Tuple[str, ...], service_name: Optional[str], workdir: Optional[str], cloud: Optional[str], @@ -3776,7 +4079,7 @@ def serve_up( @usage_lib.entrypoint def serve_update( service_name: str, - service_yaml: List[str], + service_yaml: Tuple[str, ...], workdir: Optional[str], cloud: Optional[str], region: Optional[str], @@ -4040,7 +4343,7 @@ def serve_down(service_names: List[str], all: bool, purge: bool, yes: bool): f'Provided {argument_str!r}.') backend_utils.is_controller_accessible( - controller_type=controller_utils.Controllers.SKY_SERVE_CONTROLLER, + controller=controller_utils.Controllers.SKY_SERVE_CONTROLLER, stopped_message='All services should have been terminated.', exit_if_not_accessible=True) @@ -4122,9 +4425,9 @@ def serve_logs( target=target_component, replica_id=replica_id, follow=follow) - except exceptions.ClusterNotUpError as e: - click.echo(e) - sys.exit(1) + except exceptions.ClusterNotUpError: + with ux_utils.print_exception_no_traceback(): + raise # ============================== @@ -4811,13 +5114,14 @@ def local_up(gpus: bool): f'exists.{style.RESET_ALL}\n' 'If you want to delete it instead, run: sky local down') else: - click.echo('Failed to create local cluster. ' - f'Full log: {log_path}' - f'\nError: {style.BRIGHT}{stderr}{style.RESET_ALL}') - sys.exit(1) + with ux_utils.print_exception_no_traceback(): + raise RuntimeError( + 'Failed to create local cluster. ' + f'Full log: {log_path}' + f'\nError: {style.BRIGHT}{stderr}{style.RESET_ALL}') # Run sky check with rich_utils.safe_status('[bold cyan]Running sky check...'): - sky_check.check(quiet=True) + sky_check.check(clouds=['kubernetes'], quiet=True) if cluster_created: # Prepare completion message which shows CPU and GPU count # Get number of CPUs @@ -4911,14 +5215,15 @@ def local_down(): elif returncode == 100: click.echo('\nLocal cluster does not exist.') else: - click.echo('Failed to create local cluster. ' - f'Stdout: {stdout}' - f'\nError: {style.BRIGHT}{stderr}{style.RESET_ALL}') - sys.exit(1) + with ux_utils.print_exception_no_traceback(): + raise RuntimeError( + 'Failed to create local cluster. ' + f'Stdout: {stdout}' + f'\nError: {style.BRIGHT}{stderr}{style.RESET_ALL}') if cluster_removed: # Run sky check with rich_utils.safe_status('[bold cyan]Running sky check...'): - sky_check.check(quiet=True) + sky_check.check(clouds=['kubernetes'], quiet=True) click.echo( f'{colorama.Fore.GREEN}Local cluster removed.{style.RESET_ALL}') diff --git a/sky/clouds/__init__.py b/sky/clouds/__init__.py index 6f3cb642542..af8fb0f3032 100644 --- a/sky/clouds/__init__.py +++ b/sky/clouds/__init__.py @@ -1,7 +1,7 @@ """Clouds in Sky.""" from sky.clouds.cloud import Cloud -from sky.clouds.cloud import cloud_in_list +from sky.clouds.cloud import cloud_in_iterable from sky.clouds.cloud import CloudImplementationFeatures from sky.clouds.cloud import ProvisionerVersion from sky.clouds.cloud import Region @@ -47,5 +47,5 @@ 'StatusVersion', 'Fluidstack', # Utility functions - 'cloud_in_list', + 'cloud_in_iterable', ] diff --git a/sky/clouds/aws.py b/sky/clouds/aws.py index 6008c4d080d..b2b55e14d5e 100644 --- a/sky/clouds/aws.py +++ b/sky/clouds/aws.py @@ -37,7 +37,7 @@ # It has the following purposes: # - make all nodes (any cloud) able to access private S3 buckets # - make some remote nodes able to launch new nodes on AWS (i.e., makes -# AWS head node able to launch AWS workers, or any-cloud spot controller +# AWS head node able to launch AWS workers, or any-cloud jobs controller # able to launch spot clusters on AWS). # # If we detect the current user identity is AWS SSO, we will not upload this @@ -81,6 +81,8 @@ class AWSIdentityType(enum.Enum): IAM_ROLE = 'iam-role' + CONTAINER_ROLE = 'container-role' + # Name Value Type Location # ---- ----- ---- -------- # profile None None @@ -338,9 +340,6 @@ def get_egress_cost(self, num_gigabytes: float): cost += 0.0 return cost - def is_same_cloud(self, other: clouds.Cloud): - return isinstance(other, AWS) - @classmethod def get_default_instance_type( cls, @@ -541,10 +540,16 @@ def check_credentials(cls) -> Tuple[bool, Optional[str]]: elif identity_type == AWSIdentityType.IAM_ROLE: # When using an IAM role, the credentials may not exist in the # ~/.aws/credentials file. So we don't check for the existence of the - # file. This will happen when the user is on a VM (or spot-controller) - # created by an SSO account, i.e. the VM will be assigned the IAM - # role: skypilot-v1. + # file. This will happen when the user is on a VM (or + # jobs-controller) created by an SSO account, i.e. the VM will be + # assigned the IAM role: skypilot-v1. hints = f'AWS IAM role is set.{single_cloud_hint}' + elif identity_type == AWSIdentityType.CONTAINER_ROLE: + # Similar to the IAM ROLE, an ECS container may not store credentials + # in the~/.aws/credentials file. So we don't check for the existence of + # the file. i.e. the container will be assigned the IAM role of the + # task: skypilot-v1. + hints = f'AWS container-role is set.{single_cloud_hint}' else: # This file is required because it is required by the VMs launched on # other clouds to access private s3 buckets and resources like EC2. @@ -604,6 +609,8 @@ def _is_access_key_of_type(type_str: str) -> bool: return AWSIdentityType.SSO elif _is_access_key_of_type(AWSIdentityType.IAM_ROLE.value): return AWSIdentityType.IAM_ROLE + elif _is_access_key_of_type(AWSIdentityType.CONTAINER_ROLE.value): + return AWSIdentityType.CONTAINER_ROLE elif _is_access_key_of_type(AWSIdentityType.ENV.value): return AWSIdentityType.ENV else: @@ -745,7 +752,7 @@ def get_credential_file_mounts(self) -> Dict[str, str]: # credentials. We need to define a mechanism to find out the cloud # provider of the cluster to be launched in this function and make sure # the cluster will not be used for launching clusters in other clouds, - # e.g. spot controller. + # e.g. jobs controller. if self._current_identity_type( ) != AWSIdentityType.SHARED_CREDENTIALS_FILE: return {} @@ -962,3 +969,22 @@ def delete_image(cls, image_id: str, region: Optional[str]) -> None: error_msg=f'Failed to delete image {image_id!r} on {region}.', stderr=stderr, stream_logs=True) + + @classmethod + def is_label_valid(cls, label_key: str, + label_value: str) -> Tuple[bool, Optional[str]]: + key_regex = re.compile(r'^[^aws:][\S]{0,127}$') + value_regex = re.compile(r'^[\S]{0,255}$') + key_valid = bool(key_regex.match(label_key)) + value_valid = bool(value_regex.match(label_value)) + error_msg = None + if not key_valid: + error_msg = (f'Invalid tag key {label_key} for AWS. ' + 'Key must start with any character except \'aws:\' ' + 'and must be 128 characters or fewer in length.') + if not value_valid: + error_msg = (f'Invalid tag value {label_value} for AWS. ' + 'Value must be 256 characters or fewer in length.') + if not key_valid or not value_valid: + return False, error_msg + return True, None diff --git a/sky/clouds/azure.py b/sky/clouds/azure.py index a0e16c83471..4df1cd4a4bf 100644 --- a/sky/clouds/azure.py +++ b/sky/clouds/azure.py @@ -134,9 +134,6 @@ def get_egress_cost(self, num_gigabytes: float): cost += 0.0 return cost - def is_same_cloud(self, other): - return isinstance(other, Azure) - @classmethod def get_image_size(cls, image_id: str, region: Optional[str]) -> float: if region is None: @@ -163,7 +160,7 @@ def get_image_size(cls, image_id: str, region: Optional[str]) -> float: try: image = compute_client.virtual_machine_images.get( region, publisher, offer, sku, version) - except azure.exceptions().ResourceNotFoundError() as e: + except azure.exceptions().ResourceNotFoundError as e: with ux_utils.print_exception_no_traceback(): raise ValueError(f'Image not found: {image_id}') from e if image.os_disk_image is None: @@ -294,7 +291,8 @@ def make_deploy_resources_variables( else: custom_resources = None - if resources.image_id is None: + if (resources.image_id is None or + resources.extract_docker_image() is not None): # pylint: disable=import-outside-toplevel from sky.clouds.service_catalog import azure_catalog gen_version = azure_catalog.get_gen_version_from_instance_type( diff --git a/sky/clouds/cloud.py b/sky/clouds/cloud.py index 34000fb9a16..c5ff78e1c79 100644 --- a/sky/clouds/cloud.py +++ b/sky/clouds/cloud.py @@ -42,7 +42,8 @@ class CloudImplementationFeatures(enum.Enum): CUSTOM_DISK_TIER = 'custom_disk_tier' OPEN_PORTS = 'open_ports' STORAGE_MOUNTING = 'storage_mounting' - HOST_CONTROLLERS = 'host_controllers' # Can run spot/serve controllers + HOST_CONTROLLERS = 'host_controllers' # Can run jobs/serve controllers + AUTO_TERMINATE = 'auto_terminate' # Pod/VM can stop or down itself class Region(collections.namedtuple('Region', ['name'])): @@ -246,8 +247,8 @@ def get_egress_cost(self, num_gigabytes: float): """ raise NotImplementedError - def is_same_cloud(self, other: 'Cloud'): - raise NotImplementedError + def is_same_cloud(self, other: 'Cloud') -> bool: + return isinstance(other, self.__class__) def make_deploy_resources_variables( self, @@ -320,6 +321,25 @@ def is_image_tag_valid(cls, image_tag: str, region: Optional[str]) -> bool: region, clouds=cls._REPR.lower()) + @classmethod + def is_label_valid(cls, label_key: str, + label_value: str) -> Tuple[bool, Optional[str]]: + """Validates that the label key and value are valid for this cloud. + + Labels can be implemented in different ways across clouds. For example, + on AWS we use instance tags, on GCP we use labels, and on Kubernetes we + use labels. This method should be implemented to validate the label + format for the cloud. + + Returns: + A tuple of a boolean indicating whether the label is valid and an + optional string describing the reason if the label is invalid. + """ + # If a cloud does not support labels, they are ignored. Only clouds + # that support labels implement this method. + del label_key, label_value + return True, None + def get_feasible_launchable_resources( self, resources: 'resources_lib.Resources', @@ -477,15 +497,16 @@ def validate_region_zone( zone, clouds=self._REPR.lower()) - def need_cleanup_after_preemption( + def need_cleanup_after_preemption_or_failure( self, resources: 'resources_lib.Resources') -> bool: - """Returns whether a spot resource needs cleanup after preeemption. + """Whether a resource needs cleanup after preeemption or failure. In most cases, spot resources do not need cleanup after preemption, as long as the cluster can be relaunched with the same name and tag, no matter the preemption behavior is to terminate or stop the cluster. - The only exception by far is GCP's Spot TPU VM. We override this method - in gcp.py. + Similar for on-demand resources that go into maintenance mode. The + only exception by far is GCP's TPU VM. We override this method in + gcp.py. """ del resources return False @@ -746,6 +767,6 @@ def __getstate__(self): # === Helper functions === -def cloud_in_list(cloud: Cloud, cloud_list: Iterable[Cloud]) -> bool: +def cloud_in_iterable(cloud: Cloud, cloud_list: Iterable[Cloud]) -> bool: """Returns whether the cloud is in the given cloud list.""" return any(cloud.is_same_cloud(c) for c in cloud_list) diff --git a/sky/clouds/cudo.py b/sky/clouds/cudo.py index ad7a22e6e03..1a32bb0bd2c 100644 --- a/sky/clouds/cudo.py +++ b/sky/clouds/cudo.py @@ -46,6 +46,15 @@ class Cudo(clouds.Cloud): 'https://skypilot.readthedocs.io/en/latest/getting-started/installation.html' ) + _PROJECT_HINT = ( + 'Create a project and then set it as the default project,:\n' + f'{_INDENT_PREFIX} $ cudoctl projects create my-project-name\n' + f'{_INDENT_PREFIX} $ cudoctl init\n' + f'{_INDENT_PREFIX}For more info: ' + # pylint: disable=line-too-long + 'https://skypilot.readthedocs.io/en/latest/getting-started/installation.html' + ) + _CLOUD_UNSUPPORTED_FEATURES = { clouds.CloudImplementationFeatures.STOP: 'Stopping not supported.', clouds.CloudImplementationFeatures.SPOT_INSTANCE: @@ -155,10 +164,6 @@ def get_egress_cost(self, num_gigabytes: float) -> float: # `return 0.0` is a good placeholder.) return 0.0 - def is_same_cloud(self, other: clouds.Cloud) -> bool: - # Returns true if the two clouds are the same cloud type. - return isinstance(other, Cudo) - @classmethod def get_default_instance_type( cls, @@ -293,7 +298,9 @@ def check_credentials(cls) -> Tuple[bool, Optional[str]]: project_id, error = cudo_api.get_project_id() if error is not None: return False, ( - f'Error getting project ' + f'Default project is not set. ' + f'{cls._PROJECT_HINT}\n' + f'{cls._INDENT_PREFIX}' f'{common_utils.format_exception(error, use_bracket=True)}') try: api = cudo_api.virtual_machines() diff --git a/sky/clouds/fluidstack.py b/sky/clouds/fluidstack.py index 30274c8b665..d7921a3f51a 100644 --- a/sky/clouds/fluidstack.py +++ b/sky/clouds/fluidstack.py @@ -50,6 +50,9 @@ class Fluidstack(clouds.Cloud): clouds.CloudImplementationFeatures.CUSTOM_DISK_TIER: 'Custom disk tiers' f' is not supported in {_REPR}.', + clouds.CloudImplementationFeatures.HOST_CONTROLLERS: + 'Host controllers' + f' are not supported in {_REPR}.', } # Using the latest SkyPilot provisioner API to provision and check status. PROVISIONER_VERSION = clouds.ProvisionerVersion.SKYPILOT @@ -137,10 +140,6 @@ def get_egress_cost(self, num_gigabytes: float) -> float: def __repr__(self): return 'Fluidstack' - def is_same_cloud(self, other: clouds.Cloud) -> bool: - # Returns true if the two clouds are the same cloud type. - return isinstance(other, Fluidstack) - @classmethod def get_default_instance_type( cls, diff --git a/sky/clouds/gcp.py b/sky/clouds/gcp.py index 4f5fefd9b10..7e7dacc539f 100644 --- a/sky/clouds/gcp.py +++ b/sky/clouds/gcp.py @@ -327,9 +327,6 @@ def get_egress_cost(self, num_gigabytes: float): else: return 0.08 * num_gigabytes - def is_same_cloud(self, other): - return isinstance(other, GCP) - @classmethod def _is_machine_image(cls, image_id: str) -> bool: find_machine = re.match(r'projects/.*/.*/machineImages/.*', image_id) @@ -473,8 +470,8 @@ def make_deploy_resources_variables( # CUDA driver version 535.86.10, CUDA Library 12.2 image_id = 'skypilot:gpu-debian-11' - if resources.image_id is not None and resources.extract_docker_image( - ) is None: + if (resources.image_id is not None and + resources.extract_docker_image() is None): if None in resources.image_id: image_id = resources.image_id[None] else: @@ -757,13 +754,13 @@ def check_credentials(cls) -> Tuple[bool, Optional[str]]: # pylint: disable=import-outside-toplevel,unused-import import google.auth - import googleapiclient.discovery # This takes user's credential info from "~/.config/gcloud/application_default_credentials.json". # pylint: disable=line-too-long credentials, project = google.auth.default() - crm = googleapiclient.discovery.build('cloudresourcemanager', - 'v1', - credentials=credentials) + crm = gcp.build('cloudresourcemanager', + 'v1', + credentials=credentials, + cache_discovery=False) gcp_minimal_permissions = gcp_utils.get_minimal_permissions() permissions = {'permissions': gcp_minimal_permissions} request = crm.projects().testIamPermissions(resource=project, @@ -859,13 +856,14 @@ def get_current_user_identity_str(cls) -> Optional[str]: def instance_type_exists(self, instance_type): return service_catalog.instance_type_exists(instance_type, 'gcp') - def need_cleanup_after_preemption(self, - resources: 'resources.Resources') -> bool: - """Returns whether a spot resource needs cleanup after preeemption.""" + def need_cleanup_after_preemption_or_failure( + self, resources: 'resources.Resources') -> bool: + """Whether a resource needs cleanup after preeemption or failure.""" # Spot TPU VMs require manual cleanup after preemption. # "If your Cloud TPU is preempted, # you must delete it and create a new one ..." # See: https://cloud.google.com/tpu/docs/preemptible#tpu-vm + # On-demand TPU VMs are likely to require manual cleanup as well. return gcp_utils.is_tpu_vm(resources) @@ -1087,3 +1085,25 @@ def delete_image(cls, image_id: str, region: Optional[str]) -> None: error_msg=f'Failed to delete image {image_name!r}', stderr=stderr, stream_logs=True) + + @classmethod + def is_label_valid(cls, label_key: str, + label_value: str) -> Tuple[bool, Optional[str]]: + key_regex = re.compile(r'^[a-z]([a-z0-9_-]{0,62})?$') + value_regex = re.compile(r'^[a-z0-9_-]{0,63}$') + key_valid = bool(key_regex.match(label_key)) + value_valid = bool(value_regex.match(label_value)) + error_msg = None + condition_msg = ('can include lowercase alphanumeric characters, ' + 'dashes, and underscores, with a total length of 63 ' + 'characters or less.') + if not key_valid: + error_msg = (f'Invalid label key {label_key} for GCP. ' + f'Key must start with a lowercase letter ' + f'and {condition_msg}') + if not value_valid: + error_msg = (f'Invalid label value {label_value} for GCP. Value ' + f'{condition_msg}') + if not key_valid or not value_valid: + return False, error_msg + return True, None diff --git a/sky/clouds/ibm.py b/sky/clouds/ibm.py index 880ad212e25..86e325a437b 100644 --- a/sky/clouds/ibm.py +++ b/sky/clouds/ibm.py @@ -165,9 +165,6 @@ def get_egress_cost(self, num_gigabytes: float): num_gigabytes -= (num_gigabytes - price_threshold['threshold']) return cost - def is_same_cloud(self, other): - return isinstance(other, IBM) - def make_deploy_resources_variables( self, resources: 'resources_lib.Resources', diff --git a/sky/clouds/kubernetes.py b/sky/clouds/kubernetes.py index e6b3b5bbe9e..140190d9fde 100644 --- a/sky/clouds/kubernetes.py +++ b/sky/clouds/kubernetes.py @@ -1,6 +1,7 @@ """Kubernetes.""" import json import os +import re import typing from typing import Dict, Iterator, List, Optional, Tuple @@ -13,6 +14,7 @@ from sky.provision.kubernetes import utils as kubernetes_utils from sky.utils import common_utils from sky.utils import resources_utils +from sky.utils import schemas if typing.TYPE_CHECKING: # Renaming to avoid shadowing variables. @@ -24,14 +26,17 @@ DEFAULT_KUBECONFIG_PATH = '~/.kube/config' CREDENTIAL_PATH = os.environ.get('KUBECONFIG', DEFAULT_KUBECONFIG_PATH) +# Namespace for SkyPilot resources shared across multiple tenants on the +# same cluster (even if they might be running in different namespaces). +# E.g., FUSE device manager daemonset is run in this namespace. +_SKYPILOT_SYSTEM_NAMESPACE = 'skypilot-system' + @clouds.CLOUD_REGISTRY.register class Kubernetes(clouds.Cloud): """Kubernetes.""" SKY_SSH_KEY_SECRET_NAME = 'sky-ssh-keys' - SKY_SSH_KEY_SECRET_FIELD_NAME = \ - f'ssh-publickey-{common_utils.get_user_hash()}' SKY_SSH_JUMP_NAME = 'sky-ssh-jump-pod' PORT_FORWARD_PROXY_CMD_TEMPLATE = \ 'kubernetes-port-forward-proxy-command.sh.j2' @@ -46,6 +51,15 @@ class Kubernetes(clouds.Cloud): timeout = skypilot_config.get_nested(['kubernetes', 'provision_timeout'], 10) + # Limit the length of the cluster name to avoid exceeding the limit of 63 + # characters for Kubernetes resources. We limit to 42 characters (63-21) to + # allow additional characters for creating ingress services to expose ports. + # These services are named as {cluster_name_on_cloud}--skypilot-svc--{port}, + # where the suffix is 21 characters long. + _MAX_CLUSTER_NAME_LEN_LIMIT = 42 + + _SUPPORTS_SERVICE_ACCOUNT_ON_REMOTE = True + _DEFAULT_NUM_VCPUS = 2 _DEFAULT_MEMORY_CPU_RATIO = 1 _DEFAULT_MEMORY_CPU_RATIO_WITH_GPU = 4 # Allocate more memory for GPU tasks @@ -65,13 +79,6 @@ class Kubernetes(clouds.Cloud): 'tiers are not ' 'supported in ' 'Kubernetes.', - # Kubernetes may be using exec-based auth, which may not work by - # directly copying the kubeconfig file to the controller. - # Support for service accounts for auth will be added in #3377, which - # will allow us to support hosting controllers. - clouds.CloudImplementationFeatures.HOST_CONTROLLERS: 'Kubernetes can ' - 'not host ' - 'controllers.', } IMAGE_CPU = 'skypilot:cpu-ubuntu-2004' @@ -80,13 +87,34 @@ class Kubernetes(clouds.Cloud): PROVISIONER_VERSION = clouds.ProvisionerVersion.SKYPILOT STATUS_VERSION = clouds.StatusVersion.SKYPILOT + @property + def ssh_key_secret_field_name(self): + # Use a fresh user hash to avoid conflicts in the secret object naming. + # This can happen when the controller is reusing the same user hash + # through USER_ID_ENV_VAR but has a different SSH key. + fresh_user_hash = common_utils.get_user_hash(force_fresh_hash=True) + return f'ssh-publickey-{fresh_user_hash}' + @classmethod def _unsupported_features_for_resources( cls, resources: 'resources_lib.Resources' ) -> Dict[clouds.CloudImplementationFeatures, str]: unsupported_features = cls._CLOUD_UNSUPPORTED_FEATURES + is_exec_auth, message = kubernetes_utils.is_kubeconfig_exec_auth() + if is_exec_auth: + assert isinstance(message, str), message + # Controllers cannot spin up new pods with exec auth. + unsupported_features[ + clouds.CloudImplementationFeatures.HOST_CONTROLLERS] = message + # Pod does not have permissions to terminate itself with exec auth. + unsupported_features[ + clouds.CloudImplementationFeatures.AUTO_TERMINATE] = message return unsupported_features + @classmethod + def max_cluster_name_length(cls) -> Optional[int]: + return cls._MAX_CLUSTER_NAME_LEN_LIMIT + @classmethod def regions(cls) -> List[clouds.Region]: return cls._regions @@ -123,9 +151,6 @@ def get_egress_cost(self, num_gigabytes: float) -> float: def __repr__(self): return self._REPR - def is_same_cloud(self, other: clouds.Cloud) -> bool: - return isinstance(other, Kubernetes) - @classmethod def get_port(cls, svc_name) -> int: ns = kubernetes_utils.get_current_kube_config_context_namespace() @@ -255,6 +280,27 @@ def make_deploy_resources_variables( port_mode = network_utils.get_port_mode(None) + remote_identity = skypilot_config.get_nested( + ('kubernetes', 'remote_identity'), + schemas.get_default_remote_identity('kubernetes')) + if (remote_identity == + schemas.RemoteIdentityOptions.LOCAL_CREDENTIALS.value): + # SA name doesn't matter since automounting credentials is disabled + k8s_service_account_name = 'default' + k8s_automount_sa_token = 'false' + elif (remote_identity == + schemas.RemoteIdentityOptions.SERVICE_ACCOUNT.value): + # Use the default service account + k8s_service_account_name = ( + kubernetes_utils.DEFAULT_SERVICE_ACCOUNT_NAME) + k8s_automount_sa_token = 'true' + else: + # User specified a custom service account + k8s_service_account_name = remote_identity + k8s_automount_sa_token = 'true' + + fuse_device_required = bool(resources.requires_fuse) + deploy_vars = { 'instance_type': resources.instance_type, 'custom_resources': custom_resources, @@ -271,6 +317,11 @@ def make_deploy_resources_variables( 'k8s_acc_label_value': k8s_acc_label_value, 'k8s_ssh_jump_name': self.SKY_SSH_JUMP_NAME, 'k8s_ssh_jump_image': ssh_jump_image, + 'k8s_service_account_name': k8s_service_account_name, + 'k8s_automount_sa_token': k8s_automount_sa_token, + 'k8s_fuse_device_required': fuse_device_required, + # Namespace to run the FUSE device manager in + 'k8s_skypilot_system_namespace': _SKYPILOT_SYSTEM_NAMESPACE, 'image_id': image_id, } @@ -326,30 +377,32 @@ def _make(instance_list): gpu_task_cpus, gpu_task_memory, acc_count, acc_type).name) # Check if requested instance type will fit in the cluster. - # TODO(romilb): This will fail early for autoscaling clusters. - fits, reason = kubernetes_utils.check_instance_fits( - chosen_instance_type) - if not fits: - logger.debug(f'Instance type {chosen_instance_type} does ' - 'not fit in the Kubernetes cluster. ' - f'Reason: {reason}') - return [], [] + autoscaler_type = kubernetes_utils.get_autoscaler_type() + if autoscaler_type is None: + # If autoscaler is not set, check if the instance type fits in the + # cluster. Else, rely on the autoscaler to provision the right + # instance type without running checks. Worst case, if autoscaling + # fails, the pod will be stuck in pending state until + # provision_timeout, after which failover will be triggered. + fits, reason = kubernetes_utils.check_instance_fits( + chosen_instance_type) + if not fits: + logger.debug(f'Instance type {chosen_instance_type} does ' + 'not fit in the Kubernetes cluster. ' + f'Reason: {reason}') + return [], [] # No fuzzy lists for Kubernetes return _make([chosen_instance_type]), [] @classmethod def check_credentials(cls) -> Tuple[bool, Optional[str]]: - if os.path.exists(os.path.expanduser(CREDENTIAL_PATH)): - # Test using python API - try: - return kubernetes_utils.check_credentials() - except Exception as e: # pylint: disable=broad-except - return (False, 'Credential check failed: ' - f'{common_utils.format_exception(e)}') - else: - return (False, 'Credentials not found - ' - f'check if {CREDENTIAL_PATH} exists.') + # Test using python API + try: + return kubernetes_utils.check_credentials() + except Exception as e: # pylint: disable=broad-except + return (False, 'Credential check failed: ' + f'{common_utils.format_exception(e)}') def get_credential_file_mounts(self) -> Dict[str, str]: if os.path.exists(os.path.expanduser(CREDENTIAL_PATH)): @@ -388,3 +441,33 @@ def get_current_user_identity(cls) -> Optional[List[str]]: return [f'{cluster}_{user}_{namespace}'] except k8s.config.config_exception.ConfigException: return None + + @classmethod + def is_label_valid(cls, label_key: str, + label_value: str) -> Tuple[bool, Optional[str]]: + # Kubernetes labels can be of the format /: + key_regex = re.compile( + # Look-ahead to ensure proper domain formatting up to a slash + r'^(?:(?=[a-z0-9]([-a-z0-9.]*[a-z0-9])?\/)' + # Match domain: starts and ends with alphanum up to 253 chars + # including a slash in the domain. + r'[a-z0-9]([-a-z0-9.]{0,251}[a-z0-9])?\/)?' + # Match key: starts and ends with alphanum, upto to 63 chars. + r'[a-z0-9]([-a-z0-9_.]{0,61}[a-z0-9])?$') + value_regex = re.compile( + r'^([a-zA-Z0-9]([-a-zA-Z0-9_.]{0,61}[a-zA-Z0-9])?)?$') + key_valid = bool(key_regex.match(label_key)) + value_valid = bool(value_regex.match(label_value)) + error_msg = None + condition_msg = ('Value must consist of alphanumeric characters or ' + '\'-\', \'_\', \'.\', and must be no more than 63 ' + 'characters in length.') + if not key_valid: + error_msg = (f'Invalid label key {label_key} for Kubernetes. ' + f'{condition_msg}') + if not value_valid: + error_msg = (f'Invalid label value {label_value} for Kubernetes. ' + f'{condition_msg}') + if not key_valid or not value_valid: + return False, error_msg + return True, None diff --git a/sky/clouds/lambda_cloud.py b/sky/clouds/lambda_cloud.py index 6f3e0bcea1d..979b4833354 100644 --- a/sky/clouds/lambda_cloud.py +++ b/sky/clouds/lambda_cloud.py @@ -45,6 +45,7 @@ class Lambda(clouds.Cloud): clouds.CloudImplementationFeatures.IMAGE_ID: f'Specifying image ID is not supported in {_REPR}.', clouds.CloudImplementationFeatures.CUSTOM_DISK_TIER: f'Custom disk tiers are not supported in {_REPR}.', clouds.CloudImplementationFeatures.OPEN_PORTS: f'Opening ports is currently not supported on {_REPR}.', + clouds.CloudImplementationFeatures.HOST_CONTROLLERS: f'Host controllers are not supported in {_REPR}.', } @classmethod @@ -120,10 +121,6 @@ def get_egress_cost(self, num_gigabytes: float) -> float: def __repr__(self): return 'Lambda' - def is_same_cloud(self, other: clouds.Cloud) -> bool: - # Returns true if the two clouds are the same cloud type. - return isinstance(other, Lambda) - @classmethod def get_default_instance_type( cls, diff --git a/sky/clouds/oci.py b/sky/clouds/oci.py index 03351fc4cf6..5fb0111bf01 100644 --- a/sky/clouds/oci.py +++ b/sky/clouds/oci.py @@ -160,10 +160,6 @@ def get_egress_cost(self, num_gigabytes: float) -> float: # return 0.0 return (num_gigabytes - 10 * 1024) * 0.0085 - def is_same_cloud(self, other: clouds.Cloud) -> bool: - # Returns true if the two clouds are the same cloud type. - return isinstance(other, OCI) - @classmethod def get_default_instance_type( cls, diff --git a/sky/clouds/paperspace.py b/sky/clouds/paperspace.py index f76772ab8b7..f67a9a27176 100644 --- a/sky/clouds/paperspace.py +++ b/sky/clouds/paperspace.py @@ -147,10 +147,6 @@ def get_egress_cost(self, num_gigabytes: float) -> float: def __repr__(self): return self._REPR - def is_same_cloud(self, other: clouds.Cloud) -> bool: - # Returns true if the two clouds are the same cloud type. - return isinstance(other, Paperspace) - @classmethod def get_default_instance_type( cls, diff --git a/sky/clouds/runpod.py b/sky/clouds/runpod.py index 527c45fafa0..c7a24e274dd 100644 --- a/sky/clouds/runpod.py +++ b/sky/clouds/runpod.py @@ -43,6 +43,8 @@ class RunPod(clouds.Cloud): ('Mounting object stores is not supported on RunPod. To read data ' 'from object stores on RunPod, use `mode: COPY` to copy the data ' 'to local disk.'), + clouds.CloudImplementationFeatures.HOST_CONTROLLERS: + ('Host controllers are not supported on RunPod.'), } _MAX_CLUSTER_NAME_LEN_LIMIT = 120 _regions: List[clouds.Region] = [] @@ -138,10 +140,6 @@ def accelerators_to_hourly_cost(self, def get_egress_cost(self, num_gigabytes: float) -> float: return 0.0 - def is_same_cloud(self, other: clouds.Cloud) -> bool: - # Returns true if the two clouds are the same cloud type. - return isinstance(other, RunPod) - @classmethod def get_default_instance_type( cls, diff --git a/sky/clouds/scp.py b/sky/clouds/scp.py index 1d6cb6cf20f..6a3daf2712a 100644 --- a/sky/clouds/scp.py +++ b/sky/clouds/scp.py @@ -144,10 +144,6 @@ def accelerators_to_hourly_cost(self, def get_egress_cost(self, num_gigabytes: float) -> float: return 0.0 - def is_same_cloud(self, other: clouds.Cloud) -> bool: - # Returns true if the two clouds are the same cloud type. - return isinstance(other, SCP) - @classmethod def get_default_instance_type( cls, diff --git a/sky/clouds/service_catalog/__init__.py b/sky/clouds/service_catalog/__init__.py index d380cce6757..7479cd77cf7 100644 --- a/sky/clouds/service_catalog/__init__.py +++ b/sky/clouds/service_catalog/__init__.py @@ -117,6 +117,46 @@ def list_accelerator_counts( return ret +def list_accelerator_realtime( + gpus_only: bool = True, + name_filter: Optional[str] = None, + region_filter: Optional[str] = None, + quantity_filter: Optional[int] = None, + clouds: CloudFilter = None, + case_sensitive: bool = True, +) -> Tuple[Dict[str, List[int]], Dict[str, int], Dict[str, int]]: + """List all accelerators offered by Sky with their realtime availability. + + Realtime availability is the total number of accelerators in the cluster + and number of accelerators available at the time of the call. + + Used for fixed size cluster settings, such as Kubernetes. + + Returns: + A tuple of three dictionaries mapping canonical accelerator names to: + - A list of available counts. (e.g., [1, 2, 4]) + - Total number of accelerators in the cluster (capacity). + - Number of accelerators available at the time of call (availability). + """ + qtys_map, total_accelerators_capacity, total_accelerators_available = ( + _map_clouds_catalog(clouds, + 'list_accelerators_realtime', + gpus_only, + name_filter, + region_filter, + quantity_filter, + case_sensitive=case_sensitive, + all_regions=False, + require_price=False)) + accelerator_counts: Dict[str, List[int]] = collections.defaultdict(list) + for gpu, items in qtys_map.items(): + for item in items: + accelerator_counts[gpu].append(item.accelerator_count) + accelerator_counts[gpu] = sorted(accelerator_counts[gpu]) + return (accelerator_counts, total_accelerators_capacity, + total_accelerators_available) + + def instance_type_exists(instance_type: str, clouds: CloudFilter = None) -> bool: """Check the existence of a instance type.""" diff --git a/sky/clouds/service_catalog/data_fetchers/fetch_fluidstack.py b/sky/clouds/service_catalog/data_fetchers/fetch_fluidstack.py index 3dda75e2f07..5d50399ab89 100644 --- a/sky/clouds/service_catalog/data_fetchers/fetch_fluidstack.py +++ b/sky/clouds/service_catalog/data_fetchers/fetch_fluidstack.py @@ -4,7 +4,6 @@ python fetch_fluidstack_cloud.py """ -import copy import csv import json import os @@ -19,6 +18,8 @@ GPU_MAP = { 'H100_PCIE_80GB': 'H100', + 'H100_NVLINK_80GB': 'H100', + 'A100_NVLINK_80GB': 'A100-80GB', 'A100_SXM4_80GB': 'A100-80GB', 'A100_PCIE_80GB': 'A100-80GB', 'A100_SXM4_40GB': 'A100', @@ -40,13 +41,6 @@ 'RTX_3080_10GB': 'RTX3080', } -CUSTOM_PLANS_CONFIG = [ - dict(gpu_count=1, cpu_count=8, nvme_storage=750, ram=64), - dict(gpu_count=2, cpu_count=16, nvme_storage=1024, ram=128), - #dict(gpu_count=4, cpu_count=32, nvme_storage=1200, ram=160), - #dict(gpu_count=8, cpu_count=64, nvme_storage=1500, ram=200) -] - def get_regions(plans: List) -> dict: """Return a list of regions where the plan is available.""" @@ -57,35 +51,9 @@ def get_regions(plans: List) -> dict: return regions -def plans_from_custom_plan(plan: dict) -> List[dict]: - prices = dict(cpu=plan['price']['cpu']['hourly'], - ram=plan['price']['ram']['hourly'], - storage=plan['price']['storage']['hourly'], - gpu=plan['price']['gpu']['hourly']) - new_plans = [] - for i, config in enumerate(CUSTOM_PLANS_CONFIG): - new_plan = copy.deepcopy(plan) - price = (prices['cpu'] * - config['cpu_count']) + (prices['ram'] * config['ram']) + ( - prices['storage'] * config['nvme_storage']) + ( - prices['gpu'] * config['gpu_count']) - new_plan['price']['hourly'] = price / config['gpu_count'] - new_plan['configuration']['core_count'] = config['cpu_count'] - new_plan['configuration']['ram'] = config['ram'] - new_plan['configuration']['gpu_count'] = config['gpu_count'] - new_plan['configuration']['nvme_storage'] = config['nvme_storage'] - new_plan['plan_id'] = f'custom:{i}:{plan["plan_id"]}' - new_plans.append(new_plan) - return new_plans - - def create_catalog(output_dir: str) -> None: response = requests.get(ENDPOINT) plans = response.json() - custom_plans = [ - plan for plan in plans if plan['minimum_commitment'] == 'hourly' and - plan['type'] in ['custom'] and plan['gpu_type'] != 'NO GPU' - ] #plans = [plan for plan in plans if len(plan['regions']) > 0] plans = [ plan for plan in plans if plan['minimum_commitment'] == 'hourly' and @@ -93,9 +61,6 @@ def create_catalog(output_dir: str) -> None: plan['gpu_type'] not in ['NO GPU', 'RTX_3080_10GB', 'RTX_3090_24GB'] ] - plans = plans + [ - plan for plan in custom_plans for plan in plans_from_custom_plan(plan) - ] with open(os.path.join(output_dir, 'vms.csv'), mode='w', encoding='utf-8') as f: writer = csv.writer(f, delimiter=',', quotechar='"') @@ -114,7 +79,7 @@ def create_catalog(output_dir: str) -> None: try: gpu = GPU_MAP[plan['gpu_type']] except KeyError: - print(f'Could not map {plan["gpu_type"]}') + #print(f'Could not map {plan["gpu_type"]}') continue gpu_memory = int( str(plan['configuration']['gpu_memory']).replace('GB', diff --git a/sky/clouds/service_catalog/kubernetes_catalog.py b/sky/clouds/service_catalog/kubernetes_catalog.py index bd44847016e..a64aa8f72e9 100644 --- a/sky/clouds/service_catalog/kubernetes_catalog.py +++ b/sky/clouds/service_catalog/kubernetes_catalog.py @@ -3,6 +3,7 @@ Kubernetes does not require a catalog of instances, but we need an image catalog mapping SkyPilot image tags to corresponding container image tags. """ +import re import typing from typing import Dict, List, Optional, Set, Tuple @@ -46,38 +47,107 @@ def list_accelerators( case_sensitive: bool = True, all_regions: bool = False, require_price: bool = True) -> Dict[str, List[common.InstanceTypeInfo]]: + # TODO(romilb): We should consider putting a lru_cache() with TTL to + # avoid multiple calls to kubernetes API in a short period of time (e.g., + # from the optimizer). + return list_accelerators_realtime(gpus_only, name_filter, region_filter, + quantity_filter, case_sensitive, + all_regions, require_price)[0] + + +def list_accelerators_realtime( + gpus_only: bool, + name_filter: Optional[str], + region_filter: Optional[str], + quantity_filter: Optional[int], + case_sensitive: bool = True, + all_regions: bool = False, + require_price: bool = True +) -> Tuple[Dict[str, List[common.InstanceTypeInfo]], Dict[str, int], Dict[str, + int]]: del all_regions, require_price # Unused. k8s_cloud = Kubernetes() if not any( map(k8s_cloud.is_same_cloud, sky_check.get_cached_enabled_clouds_or_refresh()) ) or not kubernetes_utils.check_credentials()[0]: - return {} + return {}, {}, {} has_gpu = kubernetes_utils.detect_gpu_resource() if not has_gpu: - return {} + return {}, {}, {} label_formatter, _ = kubernetes_utils.detect_gpu_label_formatter() if not label_formatter: - return {} + return {}, {}, {} - accelerators: Set[Tuple[str, int]] = set() + accelerators_qtys: Set[Tuple[str, int]] = set() key = label_formatter.get_label_key() nodes = kubernetes_utils.get_kubernetes_nodes() + # Get the pods to get the real-time GPU usage + pods = kubernetes_utils.get_kubernetes_pods() + # Total number of GPUs in the cluster + total_accelerators_capacity: Dict[str, int] = {} + # Total number of GPUs currently available in the cluster + total_accelerators_available: Dict[str, int] = {} + min_quantity_filter = quantity_filter if quantity_filter else 1 + for node in nodes: if key in node.metadata.labels: + allocated_qty = 0 accelerator_name = label_formatter.get_accelerator_from_label_value( node.metadata.labels.get(key)) + + # Check if name_filter regex matches the accelerator_name + regex_flags = 0 if case_sensitive else re.IGNORECASE + if name_filter and not re.match( + name_filter, accelerator_name, flags=regex_flags): + continue + accelerator_count = int( node.status.allocatable.get('nvidia.com/gpu', 0)) + # Generate the GPU quantities for the accelerators if accelerator_name and accelerator_count > 0: for count in range(1, accelerator_count + 1): - accelerators.add((accelerator_name, count)) + accelerators_qtys.add((accelerator_name, count)) + + for pod in pods: + # Get all the pods running on the node + if (pod.spec.node_name == node.metadata.name and + pod.status.phase in ['Running', 'Pending']): + # Iterate over all the containers in the pod and sum the + # GPU requests + for container in pod.spec.containers: + if container.resources.requests: + allocated_qty += int( + container.resources.requests.get( + 'nvidia.com/gpu', 0)) + + accelerators_available = accelerator_count - allocated_qty + + if accelerator_count >= min_quantity_filter: + quantized_count = (min_quantity_filter * + (accelerator_count // min_quantity_filter)) + if accelerator_name not in total_accelerators_capacity: + total_accelerators_capacity[ + accelerator_name] = quantized_count + else: + total_accelerators_capacity[ + accelerator_name] += quantized_count + + if accelerator_name not in total_accelerators_available: + total_accelerators_available[accelerator_name] = 0 + if accelerators_available >= min_quantity_filter: + quantized_availability = min_quantity_filter * ( + accelerators_available // min_quantity_filter) + total_accelerators_available[ + accelerator_name] += quantized_availability result = [] - for accelerator_name, accelerator_count in accelerators: + + # Generate dataframe for common.list_accelerators_impl + for accelerator_name, accelerator_count in accelerators_qtys: result.append( common.InstanceTypeInfo(cloud='Kubernetes', instance_type=None, @@ -98,9 +168,13 @@ def list_accelerators( ]) df['GpuInfo'] = True - return common.list_accelerators_impl('Kubernetes', df, gpus_only, - name_filter, region_filter, - quantity_filter, case_sensitive) + # Use common.list_accelerators_impl to get InstanceTypeInfo objects used + # by sky show-gpus when cloud is not specified. + qtys_map = common.list_accelerators_impl('Kubernetes', df, gpus_only, + name_filter, region_filter, + quantity_filter, case_sensitive) + + return qtys_map, total_accelerators_capacity, total_accelerators_available def validate_region_zone( diff --git a/sky/clouds/service_catalog/oci_catalog.py b/sky/clouds/service_catalog/oci_catalog.py index 2a34e739509..2561b913dcf 100644 --- a/sky/clouds/service_catalog/oci_catalog.py +++ b/sky/clouds/service_catalog/oci_catalog.py @@ -40,7 +40,7 @@ def _get_df() -> 'pd.DataFrame': df = common.read_catalog('oci/vms.csv') try: - oci_adaptor.oci + oci_adaptor.oci.load_module() except ImportError: _df = df return _df diff --git a/sky/clouds/vsphere.py b/sky/clouds/vsphere.py index 02a794d7d58..872b8df9d70 100644 --- a/sky/clouds/vsphere.py +++ b/sky/clouds/vsphere.py @@ -136,10 +136,6 @@ def get_egress_cost(self, num_gigabytes: float) -> float: def __repr__(self): return 'vSphere' - def is_same_cloud(self, other: clouds.Cloud) -> bool: - # Returns true if the two clouds are the same cloud type. - return isinstance(other, Vsphere) - @classmethod def get_default_instance_type( cls, diff --git a/sky/core.py b/sky/core.py index c93a50f0b7d..b1006fe19ab 100644 --- a/sky/core.py +++ b/sky/core.py @@ -109,6 +109,26 @@ def status(cluster_names: Optional[Union[str, List[str]]] = None, cluster_names=cluster_names) +def endpoints(cluster: str, + port: Optional[Union[int, str]] = None) -> Dict[int, str]: + """Gets the endpoint for a given cluster and port number (endpoint). + + Args: + cluster: The name of the cluster. + port: The port number to get the endpoint for. If None, endpoints + for all ports are returned.. + + Returns: A dictionary of port numbers to endpoints. If endpoint is None, + the dictionary will contain all ports:endpoints exposed on the cluster. + + Raises: + ValueError: if the cluster is not UP or the endpoint is not exposed. + RuntimeError: if the cluster has no ports to be exposed or no endpoints + are exposed yet. + """ + return backend_utils.get_endpoints(cluster=cluster, port=port) + + @usage_lib.entrypoint def cost_report() -> List[Dict[str, Any]]: # NOTE(dev): Keep the docstring consistent between the Python API and CLI. @@ -196,8 +216,6 @@ def _start( idle_minutes_to_autostop = ( constants.CONTROLLER_IDLE_MINUTES_TO_AUTOSTOP) - # NOTE: if spot_queue() calls _start() and hits here, that entrypoint - # would have a cluster name (the controller) filled in. usage_lib.record_cluster_name_for_current_operation(cluster_name) with dag.Dag(): @@ -264,7 +282,7 @@ def start( ValueError: argument values are invalid: (1) the specified cluster does not exist; (2) if ``down`` is set to True but ``idle_minutes_to_autostop`` is None; (3) if the specified cluster is - the managed spot controller, and either ``idle_minutes_to_autostop`` + the managed jobs controller, and either ``idle_minutes_to_autostop`` is not None or ``down`` is True (omit them to use the default autostop settings). sky.exceptions.NotSupportedError: if the cluster to restart was @@ -317,7 +335,7 @@ def stop(cluster_name: str, purge: bool = False) -> None: ValueError: the specified cluster does not exist. RuntimeError: failed to stop the cluster. sky.exceptions.NotSupportedError: if the specified cluster is a spot - cluster, or a TPU VM Pod cluster, or the managed spot controller. + cluster, or a TPU VM Pod cluster, or the managed jobs controller. """ if controller_utils.Controllers.from_name(cluster_name) is not None: raise exceptions.NotSupportedError( @@ -372,7 +390,7 @@ def down(cluster_name: str, purge: bool = False) -> None: ValueError: the specified cluster does not exist. RuntimeError: failed to tear down the cluster. sky.exceptions.NotSupportedError: the specified cluster is the managed - spot controller. + jobs controller. """ handle = global_user_state.get_handle_from_cluster_name(cluster_name) if handle is None: @@ -470,6 +488,19 @@ def autostop( f' {_stop_not_supported_message(handle.launched_resources)}.' ) from e + # Check if autodown is required and supported + if not is_cancel: + try: + cloud.check_features_are_supported( + handle.launched_resources, + {clouds.CloudImplementationFeatures.AUTO_TERMINATE}) + except exceptions.NotSupportedError as e: + raise exceptions.NotSupportedError( + f'{colorama.Fore.YELLOW}{operation} on cluster ' + f'{cluster_name!r}...skipped.{colorama.Style.RESET_ALL}\n' + f' Auto{option_str} is not supported on {cloud!r} - ' + f'see reason above.') from e + usage_lib.record_cluster_name_for_current_operation(cluster_name) backend.set_autostop(handle, idle_minutes, down) @@ -559,7 +590,7 @@ def cancel( Additional arguments: _try_cancel_if_cluster_is_init: (bool) whether to try cancelling the job even if the cluster is not UP, but the head node is still alive. - This is used by the spot controller to cancel the job when the + This is used by the jobs controller to cancel the job when the worker node is preempted in the spot cluster. Raises: diff --git a/sky/data/mounting_utils.py b/sky/data/mounting_utils.py index 2f4e37a1b66..d445d3d67c5 100644 --- a/sky/data/mounting_utils.py +++ b/sky/data/mounting_utils.py @@ -4,6 +4,7 @@ from typing import Optional from sky import exceptions +from sky.utils import command_runner # Values used to construct mounting commands _STAT_CACHE_TTL = '5s' @@ -11,7 +12,7 @@ _TYPE_CACHE_TTL = '5s' _RENAME_DIR_LIMIT = 10000 # https://github.com/GoogleCloudPlatform/gcsfuse/releases -GCSFUSE_VERSION = '1.3.0' +GCSFUSE_VERSION = '2.2.0' def get_s3_mount_install_cmd() -> str: @@ -129,6 +130,8 @@ def get_mounting_script( script = textwrap.dedent(f""" #!/usr/bin/env bash set -e + + {command_runner.ALIAS_SUDO_TO_EMPTY_FOR_ROOT_CMD} MOUNT_PATH={mount_path} MOUNT_BINARY={mount_binary} diff --git a/sky/data/storage.py b/sky/data/storage.py index 06cdcbca62c..e43406c3951 100644 --- a/sky/data/storage.py +++ b/sky/data/storage.py @@ -110,15 +110,24 @@ class StoreType(enum.Enum): IBM = 'IBM' @classmethod - def from_cloud(cls, cloud: clouds.Cloud) -> 'StoreType': - if isinstance(cloud, clouds.AWS): + def from_cloud(cls, cloud: str) -> 'StoreType': + if cloud.lower() == str(clouds.AWS()).lower(): return StoreType.S3 - elif isinstance(cloud, clouds.GCP): + elif cloud.lower() == str(clouds.GCP()).lower(): return StoreType.GCS - elif isinstance(cloud, clouds.Azure): - return StoreType.AZURE - elif isinstance(cloud, clouds.IBM): + elif cloud.lower() == str(clouds.IBM()).lower(): return StoreType.IBM + elif cloud.lower() == cloudflare.NAME.lower(): + return StoreType.R2 + elif cloud.lower() == str(clouds.Azure()).lower(): + with ux_utils.print_exception_no_traceback(): + raise ValueError('Azure Blob Storage is not supported yet.') + elif cloud.lower() == str(clouds.Lambda()).lower(): + with ux_utils.print_exception_no_traceback(): + raise ValueError('Lambda Cloud does not provide cloud storage.') + elif cloud.lower() == str(clouds.SCP()).lower(): + with ux_utils.print_exception_no_traceback(): + raise ValueError('SCP does not provide cloud storage.') raise ValueError(f'Unsupported cloud for StoreType: {cloud}') @@ -136,51 +145,28 @@ def from_store(cls, store: 'AbstractStore') -> 'StoreType': with ux_utils.print_exception_no_traceback(): raise ValueError(f'Unknown store type: {store}') + def store_prefix(self) -> str: + if self == StoreType.S3: + return 's3://' + elif self == StoreType.GCS: + return 'gs://' + elif self == StoreType.R2: + return 'r2://' + elif self == StoreType.IBM: + return 'cos://' + elif self == StoreType.AZURE: + with ux_utils.print_exception_no_traceback(): + raise ValueError('Azure Blob Storage is not supported yet.') + else: + with ux_utils.print_exception_no_traceback(): + raise ValueError(f'Unknown store type: {self}') + class StorageMode(enum.Enum): MOUNT = 'MOUNT' COPY = 'COPY' -def get_storetype_from_cloud(cloud: clouds.Cloud) -> StoreType: - if isinstance(cloud, clouds.AWS): - return StoreType.S3 - elif isinstance(cloud, clouds.GCP): - return StoreType.GCS - elif isinstance(cloud, clouds.IBM): - return StoreType.IBM - elif isinstance(cloud, clouds.Azure): - with ux_utils.print_exception_no_traceback(): - raise ValueError('Azure Blob Storage is not supported yet.') - elif isinstance(cloud, clouds.Lambda): - with ux_utils.print_exception_no_traceback(): - raise ValueError('Lambda Cloud does not provide cloud storage.') - elif isinstance(cloud, clouds.SCP): - with ux_utils.print_exception_no_traceback(): - raise ValueError('SCP does not provide cloud storage.') - else: - with ux_utils.print_exception_no_traceback(): - raise ValueError(f'Unknown cloud type: {cloud}') - - -def get_store_prefix(storetype: StoreType) -> str: - if storetype == StoreType.S3: - return 's3://' - elif storetype == StoreType.GCS: - return 'gs://' - # R2 storages use 's3://' as a prefix for various aws cli commands - elif storetype == StoreType.R2: - return 's3://' - elif storetype == StoreType.IBM: - return 'cos://' - elif storetype == StoreType.AZURE: - with ux_utils.print_exception_no_traceback(): - raise ValueError('Azure Blob Storage is not supported yet.') - else: - with ux_utils.print_exception_no_traceback(): - raise ValueError(f'Unknown store type: {storetype}') - - class AbstractStore: """AbstractStore abstracts away the different storage types exposed by different clouds. @@ -353,7 +339,7 @@ def _validate_existing_bucket(self): # bucket's URL as 'source'. if handle is None: with ux_utils.print_exception_no_traceback(): - store_prefix = get_store_prefix(StoreType.from_store(self)) + store_prefix = StoreType.from_store(self).store_prefix() raise exceptions.StorageSpecError( 'Attempted to mount a non-sky managed bucket ' f'{self.name!r} without specifying the storage source.' diff --git a/sky/exceptions.py b/sky/exceptions.py index 131b4675399..4fced20ce4e 100644 --- a/sky/exceptions.py +++ b/sky/exceptions.py @@ -52,11 +52,12 @@ class InvalidCloudConfigs(Exception): class ProvisionPrechecksError(Exception): - """Raised when a spot job fails prechecks before provision. + """Raised when a managed job fails prechecks before provision. + Developer note: For now this should only be used by managed - spot code path (technically, this can/should be raised by the + jobs code path (technically, this can/should be raised by the lower-level sky.launch()). Please refer to the docstring of - `spot.recovery_strategy._launch` for more details about when + `jobs.recovery_strategy._launch` for more details about when the error will be raised. Args: @@ -68,11 +69,11 @@ def __init__(self, reasons: List[Exception]) -> None: self.reasons = list(reasons) -class SpotJobReachedMaxRetriesError(Exception): - """Raised when a spot job fails to be launched after maximum retries. +class ManagedJobReachedMaxRetriesError(Exception): + """Raised when a managed job fails to be launched after maximum retries. - Developer note: For now this should only be used by managed spot code - path. Please refer to the docstring of `spot.recovery_strategy._launch` + Developer note: For now this should only be used by managed jobs code + path. Please refer to the docstring of `jobs.recovery_strategy._launch` for more details about when the error will be raised. """ pass @@ -189,8 +190,8 @@ class StorageExternalDeletionError(StorageBucketGetError): pass -class FetchIPError(Exception): - """Raised when fetching the IP fails.""" +class FetchClusterInfoError(Exception): + """Raised when fetching the cluster info fails.""" class Reason(enum.Enum): HEAD = 'HEAD' @@ -211,8 +212,8 @@ class ClusterStatusFetchingError(Exception): pass -class SpotUserCancelledError(Exception): - """Raised when a spot user cancels the job.""" +class ManagedJobUserCancelledError(Exception): + """Raised when a user cancels a managed job.""" pass diff --git a/sky/execution.py b/sky/execution.py index 25f0d8cc7a8..b1fb4ec4164 100644 --- a/sky/execution.py +++ b/sky/execution.py @@ -108,7 +108,7 @@ def _execute( clone_disk_from: Optional[str] = None, # Internal only: # pylint: disable=invalid-name - _is_launched_by_spot_controller: bool = False, + _is_launched_by_jobs_controller: bool = False, _is_launched_by_sky_serve_controller: bool = False, ) -> Tuple[Optional[int], Optional[backends.ResourceHandle]]: """Execute an entrypoint. @@ -160,11 +160,11 @@ def _execute( assert len(dag) == 1, f'We support 1 task for now. {dag}' task = dag.tasks[0] - if task.need_spot_recovery: + if any(r.job_recovery is not None for r in task.resources): with ux_utils.print_exception_no_traceback(): raise ValueError( - 'Spot recovery is specified in the task. To launch the ' - 'managed spot job, please use: sky spot launch') + 'Job recovery is specified in the task. To launch a ' + 'managed job, please use: sky jobs launch') cluster_exists = False if cluster_name is not None: @@ -207,8 +207,12 @@ def _execute( f'{colorama.Style.RESET_ALL}') idle_minutes_to_autostop = 1 stages.remove(Stage.DOWN) - if not down: - requested_features.add(clouds.CloudImplementationFeatures.STOP) + if idle_minutes_to_autostop >= 0: + requested_features.add( + clouds.CloudImplementationFeatures.AUTO_TERMINATE) + if not down: + requested_features.add( + clouds.CloudImplementationFeatures.STOP) # NOTE: in general we may not have sufficiently specified info # (cloud/resource) to check STOP_SPOT_INSTANCE here. This is checked in # the backend. @@ -225,10 +229,10 @@ def _execute( task) if not cluster_exists: - # If spot is launched by skyserve controller or managed spot controller, - # We don't need to print out the logger info. + # If spot is launched on serve or jobs controller, we don't need to + # print out the hint. if (Stage.PROVISION in stages and task.use_spot and - not _is_launched_by_spot_controller and + not _is_launched_by_jobs_controller and not _is_launched_by_sky_serve_controller): yellow = colorama.Fore.YELLOW bold = colorama.Style.BRIGHT @@ -236,9 +240,9 @@ def _execute( logger.info( f'{yellow}Launching an unmanaged spot task, which does not ' f'automatically recover from preemptions.{reset}\n{yellow}To ' - 'get automatic recovery, use managed spot instead: ' - f'{reset}{bold}sky spot launch{reset} {yellow}or{reset} ' - f'{bold}sky.spot.launch(){reset}.') + 'get automatic recovery, use managed job instead: ' + f'{reset}{bold}sky jobs launch{reset} {yellow}or{reset} ' + f'{bold}sky.jobs.launch(){reset}.') if Stage.OPTIMIZE in stages: if task.best_resources is None: @@ -318,10 +322,10 @@ def _execute( if controller is None and not _is_launched_by_sky_serve_controller: # UX: print live clusters to make users aware (to save costs). # - # Don't print if this job is launched by the spot controller, - # because spot jobs are serverless, there can be many of them, and - # users tend to continuously monitor spot jobs using `sky spot - # status`. Also don't print if this job is a skyserve controller + # Don't print if this job is launched by the jobs controller, + # because managed jobs are serverless, there can be many of them, + # and users tend to continuously monitor managed jobs using `sky + # job queue`. Also don't print if this job is a skyserve controller # job or launched by a skyserve controller job, because the # redirect for this subprocess.run won't success and it will # pollute the controller logs. @@ -330,7 +334,7 @@ def _execute( env = dict(os.environ, **{env_options.Options.DISABLE_LOGGING.value: '1'}) subprocess_utils.run( - 'sky status --no-show-spot-jobs --no-show-services', env=env) + 'sky status --no-show-managed-jobs --no-show-services', env=env) print() print('\x1b[?25h', end='') # Show cursor. return job_id, handle @@ -354,7 +358,7 @@ def launch( clone_disk_from: Optional[str] = None, # Internal only: # pylint: disable=invalid-name - _is_launched_by_spot_controller: bool = False, + _is_launched_by_jobs_controller: bool = False, _is_launched_by_sky_serve_controller: bool = False, _disable_controller_check: bool = False, ) -> Tuple[Optional[int], Optional[backends.ResourceHandle]]: @@ -464,7 +468,7 @@ def launch( idle_minutes_to_autostop=idle_minutes_to_autostop, no_setup=no_setup, clone_disk_from=clone_disk_from, - _is_launched_by_spot_controller=_is_launched_by_spot_controller, + _is_launched_by_jobs_controller=_is_launched_by_jobs_controller, _is_launched_by_sky_serve_controller= _is_launched_by_sky_serve_controller, ) diff --git a/sky/spot/README.md b/sky/jobs/README.md similarity index 52% rename from sky/spot/README.md rename to sky/jobs/README.md index ed20a77b46e..579f675a5f9 100644 --- a/sky/spot/README.md +++ b/sky/jobs/README.md @@ -1,11 +1,11 @@ -# SkyPilot Managed Spot +# SkyPilot Managed Jobs -This module is used for running user jobs on spot clusters, which automatically recovers the job from preemptions. +This module is used for running and managing user jobs, which automatically recovers failed jobs from spot preemptions and/or machine failures. ## Concepts -- Task: A task (sky.Task) is a unit of work. SkyPilot will launch a spot cluster to run the task, automatically recover the task from preemptions, and terminate the cluster when the task is done. -- Job: A job in the context of SkyPilot managed spot, is equivalent to a SkyPilot DAG (sky.Dag). A job is a collection of tasks that are executed in a specific order based on the dependencies between the tasks. Each controller process will be in charge of the whole lifecycle of a job. +- Task: A task (sky.Task) is a unit of work. SkyPilot will launch a cluster to run the task, automatically recover the task from preemptions, and terminate the cluster when the task is done. +- Job: A job in the context of SkyPilot managed jobs, is equivalent to a SkyPilot DAG (sky.Dag). A job is a collection of tasks that are executed in a specific order based on the dependencies between the tasks. Each controller process will be in charge of the whole lifecycle of a job. Note that for singleton (1-task) jobs, we will use the term "task" and "job" interchangeably. @@ -14,6 +14,6 @@ A job of n tasks (experimental; we support a pipeline of such tasks only): the j ## Architecture -![Architecture](../../docs/source/images/spot-controller.png) - +![Architecture](../../docs/source/images/managed-jobs-arch.png) + diff --git a/sky/jobs/__init__.py b/sky/jobs/__init__.py new file mode 100644 index 00000000000..922bb613ff7 --- /dev/null +++ b/sky/jobs/__init__.py @@ -0,0 +1,43 @@ +"""Managed jobs.""" +import pathlib + +from sky.jobs.constants import JOBS_CLUSTER_NAME_PREFIX_LENGTH +from sky.jobs.constants import JOBS_CONTROLLER_TEMPLATE +from sky.jobs.constants import JOBS_CONTROLLER_YAML_PREFIX +from sky.jobs.constants import JOBS_TASK_YAML_PREFIX +from sky.jobs.core import cancel +from sky.jobs.core import launch +from sky.jobs.core import queue +from sky.jobs.core import tail_logs +from sky.jobs.recovery_strategy import DEFAULT_RECOVERY_STRATEGY +from sky.jobs.recovery_strategy import RECOVERY_STRATEGIES +from sky.jobs.state import ManagedJobStatus +from sky.jobs.utils import dump_managed_job_queue +from sky.jobs.utils import format_job_table +from sky.jobs.utils import JOB_CONTROLLER_NAME +from sky.jobs.utils import load_managed_job_queue +from sky.jobs.utils import ManagedJobCodeGen + +pathlib.Path(JOBS_TASK_YAML_PREFIX).expanduser().parent.mkdir(parents=True, + exist_ok=True) +__all__ = [ + 'RECOVERY_STRATEGIES', + 'DEFAULT_RECOVERY_STRATEGY', + 'JOB_CONTROLLER_NAME', + # Constants + 'JOBS_CONTROLLER_TEMPLATE', + 'JOBS_CONTROLLER_YAML_PREFIX', + 'JOBS_TASK_YAML_PREFIX', + # Enums + 'ManagedJobStatus', + # Core + 'cancel', + 'launch', + 'queue', + 'tail_logs', + # utils + 'ManagedJobCodeGen', + 'format_job_table', + 'dump_managed_job_queue', + 'load_managed_job_queue', +] diff --git a/sky/jobs/constants.py b/sky/jobs/constants.py new file mode 100644 index 00000000000..d5f32908317 --- /dev/null +++ b/sky/jobs/constants.py @@ -0,0 +1,27 @@ +"""Constants used for Managed Jobs.""" + +JOBS_CONTROLLER_TEMPLATE = 'jobs-controller.yaml.j2' +JOBS_CONTROLLER_YAML_PREFIX = '~/.sky/jobs_controller' + +JOBS_TASK_YAML_PREFIX = '~/.sky/managed_jobs' + +# Resources as a dict for the jobs controller. +# Use default CPU instance type for jobs controller with >= 24GB, i.e. +# m6i.2xlarge (8vCPUs, 32 GB) for AWS, Standard_D8s_v4 (8vCPUs, 32 GB) +# for Azure, and n1-standard-8 (8 vCPUs, 32 GB) for GCP, etc. +# Based on profiling, memory should be at least 3x (in GB) as num vCPUs to avoid +# OOM (each vCPU can have 4 jobs controller processes as we set the CPU +# requirement to 0.25, and 3 GB is barely enough for 4 job processes). +# We use 50 GB disk size to reduce the cost. +CONTROLLER_RESOURCES = {'cpus': '8+', 'memory': '3x', 'disk_size': 50} + +# Max length of the cluster name for GCP is 35, the user hash to be attached is +# 4+1 chars, and we assume the maximum length of the job id is 4+1, so the max +# length of the cluster name prefix is 25 to avoid the cluster name being too +# long and truncated twice during the cluster creation. +JOBS_CLUSTER_NAME_PREFIX_LENGTH = 25 + +# The version of the lib files that jobs/utils use. Whenever there is an API +# change for the jobs/utils, we need to bump this version and update +# job.utils.ManagedJobCodeGen to handle the version update. +MANAGED_JOBS_VERSION = 1 diff --git a/sky/spot/controller.py b/sky/jobs/controller.py similarity index 65% rename from sky/spot/controller.py rename to sky/jobs/controller.py index 8a5a4cceb07..39c89d2784b 100644 --- a/sky/spot/controller.py +++ b/sky/jobs/controller.py @@ -1,4 +1,4 @@ -"""Controller: handles the life cycle of a managed spot cluster (job).""" +"""Controller: handles the life cycle of a managed job.""" import argparse import multiprocessing import os @@ -15,11 +15,11 @@ from sky import status_lib from sky.backends import backend_utils from sky.backends import cloud_vm_ray_backend +from sky.jobs import recovery_strategy +from sky.jobs import state as managed_job_state +from sky.jobs import utils as managed_job_utils from sky.skylet import constants from sky.skylet import job_lib -from sky.spot import recovery_strategy -from sky.spot import spot_state -from sky.spot import spot_utils from sky.usage import usage_lib from sky.utils import common_utils from sky.utils import controller_utils @@ -31,9 +31,9 @@ import sky # Use the explicit logger name so that the logger is under the -# `sky.spot.controller` namespace when executed directly, so as +# `sky.jobs.controller` namespace when executed directly, so as # to inherit the setup from the `sky` logger. -logger = sky_logging.init_logger('sky.spot.controller') +logger = sky_logging.init_logger('sky.jobs.controller') def _get_dag_and_name(dag_yaml: str) -> Tuple['sky.Dag', str]: @@ -43,8 +43,8 @@ def _get_dag_and_name(dag_yaml: str) -> Tuple['sky.Dag', str]: return dag, dag_name -class SpotController: - """Each spot controller manages the life cycle of one spot job.""" +class JobsController: + """Each jobs controller manages the life cycle of one managed job.""" def __init__(self, job_id: int, dag_yaml: str, retry_until_up: bool) -> None: @@ -60,9 +60,15 @@ def __init__(self, job_id: int, dag_yaml: str, # the user can have the same id for multiple recoveries. # Example value: sky-2022-10-04-22-46-52-467694_my-spot-name_spot_id-17-0 job_id_env_vars = [] - for i in range(len(self._dag.tasks)): - task_name = self._dag_name - if len(self._dag.tasks) > 1: + for i, task in enumerate(self._dag.tasks): + if len(self._dag.tasks) <= 1: + task_name = self._dag_name + else: + task_name = task.name + # This is guaranteed by the spot_launch API, where we fill in + # the task.name with + # dag_utils.maybe_infer_and_fill_dag_and_task_names. + assert task_name is not None, self._dag task_name = f'{self._dag_name}_{task_name}' job_id_env_var = common_utils.get_global_job_id( self._backend.run_timestamp, @@ -82,23 +88,23 @@ def __init__(self, job_id: int, dag_yaml: str, def _download_log_and_stream( self, handle: cloud_vm_ray_backend.CloudVmRayResourceHandle) -> None: - """Downloads and streams the logs of the latest job of a spot cluster. + """Downloads and streams the logs of the latest job. - We do not stream the logs from the spot cluster directly, as the + We do not stream the logs from the cluster directly, as the donwload and stream should be faster, and more robust against preemptions or ssh disconnection during the streaming. """ - spot_job_logs_dir = os.path.join(constants.SKY_LOGS_DIRECTORY, - 'spot_jobs') + managed_job_logs_dir = os.path.join(constants.SKY_LOGS_DIRECTORY, + 'managed_jobs') controller_utils.download_and_stream_latest_job_log( - self._backend, handle, spot_job_logs_dir) + self._backend, handle, managed_job_logs_dir) logger.info(f'\n== End of logs (ID: {self._job_id}) ==') def _run_one_task(self, task_id: int, task: 'sky.Task') -> bool: - """Busy loop monitoring spot cluster status and handling recovery. + """Busy loop monitoring cluster status and handling recovery. When the task is successfully completed, this function returns True, - and will terminate the spot cluster before returning. + and will terminate the cluster before returning. If the user program fails, i.e. the task is set to FAILED or FAILED_SETUP, this function will return False. @@ -124,28 +130,28 @@ def _run_one_task(self, task_id: int, task: 'sky.Task') -> bool: due to: 1. Any of the underlying failover exceptions is due to resources unavailability. - 2. The cluster is preempted before the job is submitted. + 2. The cluster is preempted or failed before the job is + submitted. 3. Any unexpected error happens during the `sky.launch`. Other exceptions may be raised depending on the backend. """ - callback_func = spot_utils.event_callback_func(job_id=self._job_id, - task_id=task_id, - task=task) + callback_func = managed_job_utils.event_callback_func( + job_id=self._job_id, task_id=task_id, task=task) if task.run is None: logger.info(f'Skip running task {task_id} ({task.name}) due to its ' 'run commands being empty.') # Call set_started first to initialize columns in the state table, # including start_at and last_recovery_at to avoid issues for # uninitialized columns. - spot_state.set_started(job_id=self._job_id, - task_id=task_id, - start_time=time.time(), - callback_func=callback_func) - spot_state.set_succeeded(job_id=self._job_id, - task_id=task_id, - end_time=time.time(), - callback_func=callback_func) + managed_job_state.set_started(job_id=self._job_id, + task_id=task_id, + start_time=time.time(), + callback_func=callback_func) + managed_job_state.set_succeeded(job_id=self._job_id, + task_id=task_id, + end_time=time.time(), + callback_func=callback_func) return True usage_lib.messages.usage.update_task_id(task_id) task_id_env_var = task.envs[constants.TASK_ID_ENV_VAR] @@ -153,64 +159,65 @@ def _run_one_task(self, task_id: int, task: 'sky.Task') -> bool: if task_id == 0: submitted_at = backend_utils.get_timestamp_from_run_timestamp( self._backend.run_timestamp) - spot_state.set_submitted( + managed_job_state.set_submitted( self._job_id, task_id, self._backend.run_timestamp, submitted_at, - resources_str=backend_utils.get_task_resources_str(task), + resources_str=backend_utils.get_task_resources_str( + task, is_managed_job=True), callback_func=callback_func) logger.info( - f'Submitted spot job {self._job_id} (task: {task_id}, name: ' + f'Submitted managed job {self._job_id} (task: {task_id}, name: ' f'{task.name!r}); {constants.TASK_ID_ENV_VAR}: {task_id_env_var}') assert task.name is not None, task - cluster_name = spot_utils.generate_spot_cluster_name( + cluster_name = managed_job_utils.generate_managed_job_cluster_name( task.name, self._job_id) self._strategy_executor = recovery_strategy.StrategyExecutor.make( cluster_name, self._backend, task, self._retry_until_up) logger.info('Started monitoring.') - spot_state.set_starting(job_id=self._job_id, - task_id=task_id, - callback_func=callback_func) + managed_job_state.set_starting(job_id=self._job_id, + task_id=task_id, + callback_func=callback_func) remote_job_submitted_at = self._strategy_executor.launch() assert remote_job_submitted_at is not None, remote_job_submitted_at - spot_state.set_started(job_id=self._job_id, - task_id=task_id, - start_time=remote_job_submitted_at, - callback_func=callback_func) + managed_job_state.set_started(job_id=self._job_id, + task_id=task_id, + start_time=remote_job_submitted_at, + callback_func=callback_func) while True: - time.sleep(spot_utils.JOB_STATUS_CHECK_GAP_SECONDS) + time.sleep(managed_job_utils.JOB_STATUS_CHECK_GAP_SECONDS) # Check the network connection to avoid false alarm for job failure. # Network glitch was observed even in the VM. try: backend_utils.check_network_connection() except exceptions.NetworkError: - logger.info( - 'Network is not available. Retrying again in ' - f'{spot_utils.JOB_STATUS_CHECK_GAP_SECONDS} seconds.') + logger.info('Network is not available. Retrying again in ' + f'{managed_job_utils.JOB_STATUS_CHECK_GAP_SECONDS} ' + 'seconds.') continue # NOTE: we do not check cluster status first because race condition # can occur, i.e. cluster can be down during the job status check. - job_status = spot_utils.get_job_status(self._backend, cluster_name) + job_status = managed_job_utils.get_job_status( + self._backend, cluster_name) if job_status == job_lib.JobStatus.SUCCEEDED: - end_time = spot_utils.get_job_timestamp(self._backend, - cluster_name, - get_end_time=True) + end_time = managed_job_utils.get_job_timestamp( + self._backend, cluster_name, get_end_time=True) # The job is done. - spot_state.set_succeeded(self._job_id, - task_id, - end_time=end_time, - callback_func=callback_func) + managed_job_state.set_succeeded(self._job_id, + task_id, + end_time=end_time, + callback_func=callback_func) logger.info( f'Spot job {self._job_id} (task: {task_id}) SUCCEEDED. ' - f'Cleaning up the spot cluster {cluster_name}.') - # Only clean up the spot cluster, not the storages, because - # tasks may share storages. + f'Cleaning up the cluster {cluster_name}.') + # Only clean up the cluster, not the storages, because tasks may + # share storages. recovery_strategy.terminate_cluster(cluster_name=cluster_name) return True @@ -218,7 +225,8 @@ def _run_one_task(self, task_id: int, task: 'sky.Task') -> bool: # healthy cluster. We can safely continue monitoring. # For multi-node jobs, since the job may not be set to FAILED # immediately (depending on user program) when only some of the - # nodes are preempted, need to check the actual cluster status. + # nodes are preempted or failed, need to check the actual cluster + # status. if (job_status is not None and not job_status.is_terminal() and task.num_nodes == 1): continue @@ -229,21 +237,28 @@ def _run_one_task(self, task_id: int, task: 'sky.Task') -> bool: # Add a grace period before the check of preemption to avoid # false alarm for job failure. time.sleep(5) + # Pull the actual cluster status from the cloud provider to - # determine whether the cluster is preempted. + # determine whether the cluster is preempted or failed. + # TODO(zhwu): For hardware failure, such as GPU failure, it may not + # be reflected in the cluster status, depending on the cloud, which + # can also cause failure of the job, and we need to recover it + # rather than fail immediately. (cluster_status, handle) = backend_utils.refresh_cluster_status_handle( cluster_name, force_refresh_statuses=set(status_lib.ClusterStatus)) if cluster_status != status_lib.ClusterStatus.UP: - # The cluster is (partially) preempted. It can be down, INIT - # or STOPPED, based on the interruption behavior of the cloud. - # Spot recovery is needed (will be done later in the code). + # The cluster is (partially) preempted or failed. It can be + # down, INIT or STOPPED, based on the interruption behavior of + # the cloud. Spot recovery is needed (will be done later in the + # code). cluster_status_str = ('' if cluster_status is None else f' (status: {cluster_status.value})') logger.info( - f'Cluster is preempted{cluster_status_str}. Recovering...') + f'Cluster is preempted or failed{cluster_status_str}. ' + 'Recovering...') else: if job_status is not None and not job_status.is_terminal(): # The multi-node job is still running, continue monitoring. @@ -252,27 +267,29 @@ def _run_one_task(self, task_id: int, task: 'sky.Task') -> bool: job_lib.JobStatus.FAILED, job_lib.JobStatus.FAILED_SETUP ]: # The user code has probably crashed, fail immediately. - end_time = spot_utils.get_job_timestamp(self._backend, - cluster_name, - get_end_time=True) + end_time = managed_job_utils.get_job_timestamp( + self._backend, cluster_name, get_end_time=True) logger.info( 'The user job failed. Please check the logs below.\n' f'== Logs of the user job (ID: {self._job_id}) ==\n') self._download_log_and_stream(handle) - spot_status_to_set = spot_state.SpotStatus.FAILED + managed_job_status = ( + managed_job_state.ManagedJobStatus.FAILED) if job_status == job_lib.JobStatus.FAILED_SETUP: - spot_status_to_set = spot_state.SpotStatus.FAILED_SETUP + managed_job_status = ( + managed_job_state.ManagedJobStatus.FAILED_SETUP) failure_reason = ( 'To see the details, run: ' - f'sky spot logs --controller {self._job_id}') - - spot_state.set_failed(self._job_id, - task_id, - failure_type=spot_status_to_set, - failure_reason=failure_reason, - end_time=end_time, - callback_func=callback_func) + f'sky jobs logs --controller {self._job_id}') + + managed_job_state.set_failed( + self._job_id, + task_id, + failure_type=managed_job_status, + failure_reason=failure_reason, + end_time=end_time, + callback_func=callback_func) return False # Although the cluster is healthy, we fail to access the # job status. Try to recover the job (will not restart the @@ -286,22 +303,24 @@ def _run_one_task(self, task_id: int, task: 'sky.Task') -> bool: if handle is not None: resources = handle.launched_resources assert resources is not None, handle - if resources.need_cleanup_after_preemption(): + if resources.need_cleanup_after_preemption_or_failure(): # Some spot resource (e.g., Spot TPU VM) may need to be - # cleaned up after preemption. - logger.info('Cleaning up the preempted spot cluster...') + # cleaned up after preemption, as running launch again on + # those clusters again may fail. + logger.info('Cleaning up the preempted or failed cluster' + '...') recovery_strategy.terminate_cluster(cluster_name) - # Try to recover the spot jobs, when the cluster is preempted - # or the job status is failed to be fetched. - spot_state.set_recovering(job_id=self._job_id, - task_id=task_id, - callback_func=callback_func) + # Try to recover the managed jobs, when the cluster is preempted or + # failed or the job status is failed to be fetched. + managed_job_state.set_recovering(job_id=self._job_id, + task_id=task_id, + callback_func=callback_func) recovered_time = self._strategy_executor.recover() - spot_state.set_recovered(self._job_id, - task_id, - recovered_time=recovered_time, - callback_func=callback_func) + managed_job_state.set_recovered(self._job_id, + task_id, + recovered_time=recovered_time, + callback_func=callback_func) def run(self): """Run controller logic and handle exceptions.""" @@ -320,27 +339,29 @@ def run(self): common_utils.format_exception(reason, use_bracket=True) for reason in e.reasons)) logger.error(failure_reason) - spot_state.set_failed( + managed_job_state.set_failed( self._job_id, task_id=task_id, - failure_type=spot_state.SpotStatus.FAILED_PRECHECKS, + failure_type=managed_job_state.ManagedJobStatus. + FAILED_PRECHECKS, failure_reason=failure_reason, - callback_func=spot_utils.event_callback_func( + callback_func=managed_job_utils.event_callback_func( job_id=self._job_id, task_id=task_id, task=self._dag.tasks[task_id])) - except exceptions.SpotJobReachedMaxRetriesError as e: + except exceptions.ManagedJobReachedMaxRetriesError as e: # Please refer to the docstring of self._run for the cases when # this exception can occur. logger.error(common_utils.format_exception(e)) - # The spot job should be marked as FAILED_NO_RESOURCE, as the - # spot job may be able to launch next time. - spot_state.set_failed( + # The managed job should be marked as FAILED_NO_RESOURCE, as the + # managed job may be able to launch next time. + managed_job_state.set_failed( self._job_id, task_id=task_id, - failure_type=spot_state.SpotStatus.FAILED_NO_RESOURCE, + failure_type=managed_job_state.ManagedJobStatus. + FAILED_NO_RESOURCE, failure_reason=common_utils.format_exception(e), - callback_func=spot_utils.event_callback_func( + callback_func=managed_job_utils.event_callback_func( job_id=self._job_id, task_id=task_id, task=self._dag.tasks[task_id])) @@ -350,12 +371,13 @@ def run(self): msg = ('Unexpected error occurred: ' f'{common_utils.format_exception(e, use_bracket=True)}') logger.error(msg) - spot_state.set_failed( + managed_job_state.set_failed( self._job_id, task_id=task_id, - failure_type=spot_state.SpotStatus.FAILED_CONTROLLER, + failure_type=managed_job_state.ManagedJobStatus. + FAILED_CONTROLLER, failure_reason=msg, - callback_func=spot_utils.event_callback_func( + callback_func=managed_job_utils.event_callback_func( job_id=self._job_id, task_id=task_id, task=self._dag.tasks[task_id])) @@ -364,27 +386,28 @@ def run(self): # affect the jobs in terminal states. # We need to call set_cancelling before set_cancelled to make sure # the table entries are correctly set. - callback_func = spot_utils.event_callback_func( + callback_func = managed_job_utils.event_callback_func( job_id=self._job_id, task_id=task_id, task=self._dag.tasks[task_id]) - spot_state.set_cancelling(job_id=self._job_id, - callback_func=callback_func) - spot_state.set_cancelled(job_id=self._job_id, - callback_func=callback_func) + managed_job_state.set_cancelling(job_id=self._job_id, + callback_func=callback_func) + managed_job_state.set_cancelled(job_id=self._job_id, + callback_func=callback_func) def _run_controller(job_id: int, dag_yaml: str, retry_until_up: bool): """Runs the controller in a remote process for interruption.""" # The controller needs to be instantiated in the remote process, since # the controller is not serializable. - spot_controller = SpotController(job_id, dag_yaml, retry_until_up) - spot_controller.run() + jobs_controller = JobsController(job_id, dag_yaml, retry_until_up) + jobs_controller.run() def _handle_signal(job_id): """Handle the signal if the user sent it.""" - signal_file = pathlib.Path(spot_utils.SIGNAL_FILE_PREFIX.format(job_id)) + signal_file = pathlib.Path( + managed_job_utils.SIGNAL_FILE_PREFIX.format(job_id)) user_signal = None if signal_file.exists(): # Filelock is needed to prevent race condition with concurrent @@ -393,7 +416,7 @@ def _handle_signal(job_id): with signal_file.open(mode='r', encoding='utf-8') as f: user_signal = f.read().strip() try: - user_signal = spot_utils.UserSignal(user_signal) + user_signal = managed_job_utils.UserSignal(user_signal) except ValueError: logger.warning( f'Unknown signal received: {user_signal}. Ignoring.') @@ -403,28 +426,29 @@ def _handle_signal(job_id): if user_signal is None: # None or empty string. return - assert user_signal == spot_utils.UserSignal.CANCEL, ( + assert user_signal == managed_job_utils.UserSignal.CANCEL, ( f'Only cancel signal is supported, but {user_signal} got.') - raise exceptions.SpotUserCancelledError( + raise exceptions.ManagedJobUserCancelledError( f'User sent {user_signal.value} signal.') def _cleanup(job_id: int, dag_yaml: str): - """Clean up the spot cluster(s) and storages. + """Clean up the cluster(s) and storages. (1) Clean up the succeeded task(s)' ephemeral storage. The storage has to be cleaned up after the whole job is finished, as the tasks may share the same storage. - (2) Clean up the spot cluster(s) that are not cleaned up yet, which - can happen when the spot task failed or cancelled. At most one - spot cluster should be left when reaching here, as we currently - only support chain DAGs, and only spot task is executed at a time. + (2) Clean up the cluster(s) that are not cleaned up yet, which can happen + when the task failed or cancelled. At most one cluster should be left + when reaching here, as we currently only support chain DAGs, and only + task is executed at a time. """ # NOTE: The code to get cluster name is same as what we did in the spot - # controller, we should keep it in sync with SpotController.__init__() + # controller, we should keep it in sync with JobsController.__init__() dag, _ = _get_dag_and_name(dag_yaml) for task in dag.tasks: - cluster_name = spot_utils.generate_spot_cluster_name(task.name, job_id) + cluster_name = managed_job_utils.generate_managed_job_cluster_name( + task.name, job_id) recovery_strategy.terminate_cluster(cluster_name) # Clean up Storages with persistent=False. # TODO(zhwu): this assumes the specific backend. @@ -451,16 +475,15 @@ def start(job_id, dag_yaml, retry_until_up): while controller_process.is_alive(): _handle_signal(job_id) time.sleep(1) - except exceptions.SpotUserCancelledError: + except exceptions.ManagedJobUserCancelledError: dag, _ = _get_dag_and_name(dag_yaml) - task_id, _ = (spot_state.get_latest_task_id_status(job_id)) + task_id, _ = managed_job_state.get_latest_task_id_status(job_id) logger.info( - f'Cancelling spot job, job_id: {job_id}, task_id: {task_id}') - spot_state.set_cancelling(job_id=job_id, - callback_func=spot_utils.event_callback_func( - job_id=job_id, - task_id=task_id, - task=dag.tasks[task_id])) + f'Cancelling managed job, job_id: {job_id}, task_id: {task_id}') + managed_job_state.set_cancelling( + job_id=job_id, + callback_func=managed_job_utils.event_callback_func( + job_id=job_id, task_id=task_id, task=dag.tasks[task_id])) cancelling = True finally: if controller_process is not None: @@ -474,37 +497,38 @@ def start(job_id, dag_yaml, retry_until_up): controller_process.join() logger.info(f'Controller process {controller_process.pid} killed.') - logger.info(f'Cleaning up any spot cluster for job {job_id}.') + logger.info(f'Cleaning up any cluster for job {job_id}.') # NOTE: Originally, we send an interruption signal to the controller # process and the controller process handles cleanup. However, we # figure out the behavior differs from cloud to cloud # (e.g., GCP ignores 'SIGINT'). A possible explanation is # https://unix.stackexchange.com/questions/356408/strange-problem-with-trap-and-sigint # But anyway, a clean solution is killing the controller process - # directly, and then cleanup the cluster state. + # directly, and then cleanup the cluster job_state. _cleanup(job_id, dag_yaml=dag_yaml) - logger.info(f'Spot cluster of job {job_id} has been cleaned up.') + logger.info(f'Cluster of managed job {job_id} has been cleaned up.') if cancelling: - spot_state.set_cancelled( + managed_job_state.set_cancelled( job_id=job_id, - callback_func=spot_utils.event_callback_func( + callback_func=managed_job_utils.event_callback_func( job_id=job_id, task_id=task_id, task=dag.tasks[task_id])) # We should check job status after 'set_cancelled', otherwise # the job status is not terminal. - job_status = spot_state.get_status(job_id) + job_status = managed_job_state.get_status(job_id) assert job_status is not None # The job can be non-terminal if the controller exited abnormally, # e.g. failed to launch cluster after reaching the MAX_RETRY. if not job_status.is_terminal(): - logger.info(f'Previous spot job status: {job_status.value}') - spot_state.set_failed( + logger.info(f'Previous job status: {job_status.value}') + managed_job_state.set_failed( job_id, task_id=None, - failure_type=spot_state.SpotStatus.FAILED_CONTROLLER, + failure_type=managed_job_state.ManagedJobStatus. + FAILED_CONTROLLER, failure_reason=('Unexpected error occurred. For details, ' - f'run: sky spot logs --controller {job_id}')) + f'run: sky jobs logs --controller {job_id}')) if __name__ == '__main__': @@ -515,10 +539,10 @@ def start(job_id, dag_yaml, retry_until_up): help='Job id for the controller job.') parser.add_argument('--retry-until-up', action='store_true', - help='Retry until the spot cluster is up.') + help='Retry until the cluster is up.') parser.add_argument('dag_yaml', type=str, - help='The path to the user spot task yaml file.') + help='The path to the user job yaml file.') args = parser.parse_args() # We start process with 'spawn', because 'fork' could result in weird # behaviors; 'spawn' is also cross-platform. diff --git a/sky/spot/core.py b/sky/jobs/core.py similarity index 67% rename from sky/spot/core.py rename to sky/jobs/core.py index 459f351b7b4..561d47f4b25 100644 --- a/sky/spot/core.py +++ b/sky/jobs/core.py @@ -1,4 +1,4 @@ -"""SDK functions for managed spot job.""" +"""SDK functions for managed jobs.""" import os import tempfile from typing import Any, Dict, List, Optional, Union @@ -14,9 +14,9 @@ from sky import task as task_lib from sky.backends import backend_utils from sky.clouds.service_catalog import common as service_catalog_common +from sky.jobs import constants as managed_job_constants +from sky.jobs import utils as managed_job_utils from sky.skylet import constants as skylet_constants -from sky.spot import constants -from sky.spot import spot_utils from sky.usage import usage_lib from sky.utils import common_utils from sky.utils import controller_utils @@ -35,18 +35,19 @@ def launch( retry_until_up: bool = False, ) -> None: # NOTE(dev): Keep the docstring consistent between the Python API and CLI. - """Launch a managed spot job. + """Launch a managed job. - Please refer to the sky.cli.spot_launch for the document. + Please refer to sky.cli.job_launch for documentation. Args: task: sky.Task, or sky.Dag (experimental; 1-task only) to launch as a - managed spot job. - name: Name of the spot job. + managed job. + name: Name of the managed job. detach_run: Whether to detach the run. Raises: - ValueError: cluster does not exist. + ValueError: cluster does not exist. Or, the entrypoint is not a valid + chain dag. sky.exceptions.NotSupportedError: the feature is not supported. """ entrypoint = task @@ -55,8 +56,8 @@ def launch( dag = dag_utils.convert_entrypoint_to_dag(entrypoint) if not dag.is_chain(): with ux_utils.print_exception_no_traceback(): - raise ValueError('Only single-task or chain DAG is allowed for ' - f'sky.spot.launch. Dag:\n{dag}') + raise ValueError('Only single-task or chain DAG is ' + f'allowed for job_launch. Dag: {dag}') dag_utils.maybe_infer_and_fill_dag_and_task_names(dag) @@ -71,56 +72,58 @@ def launch( 'will be auto-generated) .') task_names.add(task_.name) - dag_utils.fill_default_spot_config_in_dag_for_spot_launch(dag) + dag_utils.fill_default_config_in_dag_for_job_launch(dag) for task_ in dag.tasks: controller_utils.maybe_translate_local_file_mounts_and_sync_up( - task_, path='spot') + task_, path='jobs') - with tempfile.NamedTemporaryFile(prefix=f'spot-dag-{dag.name}-', + with tempfile.NamedTemporaryFile(prefix=f'managed-dag-{dag.name}-', mode='w') as f: dag_utils.dump_chain_dag_to_yaml(dag, f.name) - controller_name = spot_utils.SPOT_CONTROLLER_NAME - prefix = constants.SPOT_TASK_YAML_PREFIX + controller = controller_utils.Controllers.JOBS_CONTROLLER + controller_name = controller.value.cluster_name + prefix = managed_job_constants.JOBS_TASK_YAML_PREFIX remote_user_yaml_path = f'{prefix}/{dag.name}-{dag_uuid}.yaml' remote_user_config_path = f'{prefix}/{dag.name}-{dag_uuid}.config_yaml' controller_resources = controller_utils.get_controller_resources( - controller_type='spot', - controller_resources_config=constants.CONTROLLER_RESOURCES) + controller=controller_utils.Controllers.JOBS_CONTROLLER, + task_resources=sum([list(t.resources) for t in dag.tasks], [])) vars_to_fill = { 'remote_user_yaml_path': remote_user_yaml_path, 'user_yaml_path': f.name, - 'spot_controller': controller_name, - # Note: actual spot cluster name will be - + 'jobs_controller': controller_name, + # Note: actual cluster name will be - 'dag_name': dag.name, 'retry_until_up': retry_until_up, 'remote_user_config_path': remote_user_config_path, - 'sky_python_cmd': skylet_constants.SKY_PYTHON_CMD, 'modified_catalogs': service_catalog_common.get_modified_catalog_file_mounts(), **controller_utils.shared_controller_vars_to_fill( - 'spot', + controller_utils.Controllers.JOBS_CONTROLLER, remote_user_config_path=remote_user_config_path, ), } - yaml_path = os.path.join(constants.SPOT_CONTROLLER_YAML_PREFIX, - f'{name}-{dag_uuid}.yaml') - common_utils.fill_template(constants.SPOT_CONTROLLER_TEMPLATE, - vars_to_fill, - output_path=yaml_path) + yaml_path = os.path.join( + managed_job_constants.JOBS_CONTROLLER_YAML_PREFIX, + f'{name}-{dag_uuid}.yaml') + common_utils.fill_template( + managed_job_constants.JOBS_CONTROLLER_TEMPLATE, + vars_to_fill, + output_path=yaml_path) controller_task = task_lib.Task.from_yaml(yaml_path) controller_task.set_resources(controller_resources) - controller_task.spot_dag = dag - assert len(controller_task.resources) == 1 + controller_task.managed_job_dag = dag + assert len(controller_task.resources) == 1, controller_task sky_logging.print( f'{colorama.Fore.YELLOW}' - f'Launching managed spot job {dag.name!r} from spot controller...' + f'Launching managed job {dag.name!r} from jobs controller...' f'{colorama.Style.RESET_ALL}') - sky_logging.print('Launching spot controller...') + sky_logging.print('Launching jobs controller...') sky.launch(task=controller_task, stream_logs=stream_logs, cluster_name=controller_name, @@ -134,9 +137,9 @@ def launch( @usage_lib.entrypoint def queue(refresh: bool, skip_finished: bool = False) -> List[Dict[str, Any]]: # NOTE(dev): Keep the docstring consistent between the Python API and CLI. - """Get statuses of managed spot jobs. + """Get statuses of managed jobs. - Please refer to the sky.cli.spot_queue for the documentation. + Please refer to sky.cli.job_queue for documentation. Returns: [ @@ -148,23 +151,23 @@ def queue(refresh: bool, skip_finished: bool = False) -> List[Dict[str, Any]]: 'end_at': (float) timestamp of end, 'duration': (float) duration in seconds, 'recovery_count': (int) Number of retries, - 'status': (sky.spot.SpotStatus) of the job, + 'status': (sky.jobs.ManagedJobStatus) of the job, 'cluster_resources': (str) resources of the cluster, 'region': (str) region of the cluster, } ] Raises: - sky.exceptions.ClusterNotUpError: the spot controller is not up or + sky.exceptions.ClusterNotUpError: the jobs controller is not up or does not exist. - RuntimeError: if failed to get the spot jobs with ssh. + RuntimeError: if failed to get the managed jobs with ssh. """ + jobs_controller_type = controller_utils.Controllers.JOBS_CONTROLLER stopped_message = '' if not refresh: - stopped_message = 'No in-progress spot jobs.' + stopped_message = 'No in-progress managed jobs.' try: handle = backend_utils.is_controller_accessible( - controller_type=controller_utils.Controllers.SPOT_CONTROLLER, - stopped_message=stopped_message) + controller=jobs_controller_type, stopped_message=stopped_message) except exceptions.ClusterNotUpError as e: if not refresh: raise @@ -176,18 +179,19 @@ def queue(refresh: bool, skip_finished: bool = False) -> List[Dict[str, Any]]: 'Restarting controller for latest status...' f'{colorama.Style.RESET_ALL}') - rich_utils.force_update_status('[cyan] Checking spot jobs - restarting ' - 'controller[/]') - handle = sky.start(spot_utils.SPOT_CONTROLLER_NAME) + rich_utils.force_update_status( + '[cyan] Checking managed jobs - restarting ' + 'controller[/]') + handle = sky.start(jobs_controller_type.value.cluster_name) controller_status = status_lib.ClusterStatus.UP - rich_utils.force_update_status('[cyan] Checking spot jobs[/]') + rich_utils.force_update_status('[cyan] Checking managed jobs[/]') assert handle is not None, (controller_status, refresh) backend = backend_utils.get_backend_from_handle(handle) assert isinstance(backend, backends.CloudVmRayBackend) - code = spot_utils.SpotCodeGen.get_job_table() + code = managed_job_utils.ManagedJobCodeGen.get_job_table() returncode, job_table_payload, stderr = backend.run_on_head( handle, code, @@ -198,13 +202,13 @@ def queue(refresh: bool, skip_finished: bool = False) -> List[Dict[str, Any]]: try: subprocess_utils.handle_returncode(returncode, code, - 'Failed to fetch managed spot jobs', + 'Failed to fetch managed jobs', job_table_payload + stderr, stream_logs=False) except exceptions.CommandError as e: raise RuntimeError(str(e)) from e - jobs = spot_utils.load_spot_job_queue(job_table_payload) + jobs = managed_job_utils.load_managed_job_queue(job_table_payload) if skip_finished: # Filter out the finished jobs. If a multi-task job is partially # finished, we will include all its tasks. @@ -222,18 +226,18 @@ def cancel(name: Optional[str] = None, job_ids: Optional[List[int]] = None, all: bool = False) -> None: # NOTE(dev): Keep the docstring consistent between the Python API and CLI. - """Cancel managed spot jobs. + """Cancel managed jobs. - Please refer to the sky.cli.spot_cancel for the document. + Please refer to sky.cli.job_cancel for documentation. Raises: - sky.exceptions.ClusterNotUpError: the spot controller is not up. + sky.exceptions.ClusterNotUpError: the jobs controller is not up. RuntimeError: failed to cancel the job. """ job_ids = [] if job_ids is None else job_ids handle = backend_utils.is_controller_accessible( - controller_type=controller_utils.Controllers.SPOT_CONTROLLER, - stopped_message='All managed spot jobs should have finished.') + controller=controller_utils.Controllers.JOBS_CONTROLLER, + stopped_message='All managed jobs should have finished.') job_id_str = ','.join(map(str, job_ids)) if sum([len(job_ids) > 0, name is not None, all]) != 1: @@ -247,12 +251,12 @@ def cancel(name: Optional[str] = None, backend = backend_utils.get_backend_from_handle(handle) assert isinstance(backend, backends.CloudVmRayBackend) if all: - code = spot_utils.SpotCodeGen.cancel_jobs_by_id(None) + code = managed_job_utils.ManagedJobCodeGen.cancel_jobs_by_id(None) elif job_ids: - code = spot_utils.SpotCodeGen.cancel_jobs_by_id(job_ids) + code = managed_job_utils.ManagedJobCodeGen.cancel_jobs_by_id(job_ids) else: assert name is not None, (job_ids, name, all) - code = spot_utils.SpotCodeGen.cancel_job_by_name(name) + code = managed_job_utils.ManagedJobCodeGen.cancel_job_by_name(name) # The stderr is redirected to stdout returncode, stdout, _ = backend.run_on_head(handle, code, @@ -260,7 +264,7 @@ def cancel(name: Optional[str] = None, stream_logs=False) try: subprocess_utils.handle_returncode(returncode, code, - 'Failed to cancel managed spot job', + 'Failed to cancel managed job', stdout) except exceptions.CommandError as e: with ux_utils.print_exception_no_traceback(): @@ -274,44 +278,53 @@ def cancel(name: Optional[str] = None, @usage_lib.entrypoint -def tail_logs(name: Optional[str], job_id: Optional[int], follow: bool) -> None: +def tail_logs(name: Optional[str], job_id: Optional[int], follow: bool, + controller: bool) -> None: # NOTE(dev): Keep the docstring consistent between the Python API and CLI. - """Tail logs of managed spot jobs. + """Tail logs of managed jobs. - Please refer to the sky.cli.spot_logs for the document. + Please refer to sky.cli.job_logs for documentation. Raises: ValueError: invalid arguments. - sky.exceptions.ClusterNotUpError: the spot controller is not up. + sky.exceptions.ClusterNotUpError: the jobs controller is not up. """ - # TODO(zhwu): Automatically restart the spot controller + # TODO(zhwu): Automatically restart the jobs controller + jobs_controller_type = controller_utils.Controllers.JOBS_CONTROLLER handle = backend_utils.is_controller_accessible( - controller_type=controller_utils.Controllers.SPOT_CONTROLLER, - stopped_message=('Please restart the spot controller with ' - f'`sky start {spot_utils.SPOT_CONTROLLER_NAME}`.')) + controller=jobs_controller_type, + stopped_message=( + 'Please restart the jobs controller with ' + f'`sky start {jobs_controller_type.value.cluster_name}`.')) if name is not None and job_id is not None: raise ValueError('Cannot specify both name and job_id.') backend = backend_utils.get_backend_from_handle(handle) assert isinstance(backend, backends.CloudVmRayBackend), backend - # Stream the realtime logs - backend.tail_spot_logs(handle, job_id=job_id, job_name=name, follow=follow) + backend.tail_managed_job_logs(handle, + job_id=job_id, + job_name=name, + follow=follow, + controller=controller) -spot_launch = common_utils.deprecated_function(launch, - name='sky.spot.launch', - deprecated_name='spot_launch', - removing_version='0.7.0') + +spot_launch = common_utils.deprecated_function( + launch, + name='sky.jobs.launch', + deprecated_name='spot_launch', + removing_version='0.8.0', + override_argument={'use_spot': True}) spot_queue = common_utils.deprecated_function(queue, - name='sky.spot.queue', + name='sky.jobs.queue', deprecated_name='spot_queue', - removing_version='0.7.0') + removing_version='0.8.0') spot_cancel = common_utils.deprecated_function(cancel, - name='sky.spot.cancel', + name='sky.jobs.cancel', deprecated_name='spot_cancel', - removing_version='0.7.0') + removing_version='0.8.0') spot_tail_logs = common_utils.deprecated_function( tail_logs, - name='sky.spot.tail_logs', + name='sky.jobs.tail_logs', deprecated_name='spot_tail_logs', - removing_version='0.7.0') + removing_version='0.8.0') diff --git a/sky/jobs/dashboard/dashboard.py b/sky/jobs/dashboard/dashboard.py new file mode 100644 index 00000000000..89c97274646 --- /dev/null +++ b/sky/jobs/dashboard/dashboard.py @@ -0,0 +1,87 @@ +"""Dashboard for managed jobs based on Flask. + +TODO(zongheng): This is a basic version. In the future we can beef up the web +frameworks used (e.g., +https://github.com/ray-project/ray/tree/master/dashboard/client/src) and/or get +rid of the SSH port-forwarding business (see cli.py's job_dashboard() +comment). +""" +import datetime +import pathlib + +import flask +import yaml + +from sky import jobs as managed_jobs +from sky.utils import common_utils +from sky.utils import controller_utils + +app = flask.Flask(__name__) + + +def _is_running_on_jobs_controller() -> bool: + """Am I running on jobs controller? + + Loads ~/.sky/sky_ray.yml and check cluster_name. + """ + if pathlib.Path('~/.sky/sky_ray.yml').expanduser().exists(): + config = yaml.safe_load( + pathlib.Path('~/.sky/sky_ray.yml').expanduser().read_text()) + cluster_name = config.get('cluster_name', '') + candidate_controller_names = ( + controller_utils.Controllers.JOBS_CONTROLLER.value. + candidate_cluster_names) + # We use startswith instead of exact match because the cluster name in + # the yaml file is cluster_name_on_cloud which may have additional + # suffices. + return any( + cluster_name.startswith(name) + for name in candidate_controller_names) + return False + + +@app.route('/') +def home(): + if not _is_running_on_jobs_controller(): + # Experimental: run on laptop (refresh is very slow). + all_managed_jobs = managed_jobs.queue(refresh=True, skip_finished=False) + else: + job_table = managed_jobs.dump_managed_job_queue() + all_managed_jobs = managed_jobs.load_managed_job_queue(job_table) + + timestamp = datetime.datetime.now(datetime.timezone.utc) + rows = managed_jobs.format_job_table(all_managed_jobs, + show_all=True, + return_rows=True) + # Add an empty column for the dropdown button. This will be added in the + # jobs/templates/index.html file. + rows = [[''] + row for row in rows] + + # FIXME(zongheng): make the job table/queue funcs return structured info so + # that we don't have to do things like row[-5] below. + columns = [ + '', 'ID', 'Task', 'Name', 'Resources', 'Submitted', 'Total Duration', + 'Job Duration', 'Recoveries', 'Status', 'Started', 'Cluster', 'Region', + 'Failure' + ] + if rows and len(rows[0]) != len(columns): + raise RuntimeError( + 'Dashboard code and managed job queue code are out of sync.') + + # Fix STATUS color codes: '\x1b[33mCANCELLED\x1b[0m' -> 'CANCELLED'. + for row in rows: + row[-5] = common_utils.remove_color(row[-5]) + # Remove filler rows ([''], ..., ['-']). + rows = [row for row in rows if ''.join(map(str, row)) != ''] + + rendered_html = flask.render_template( + 'index.html', + columns=columns, + rows=rows, + last_updated_timestamp=timestamp, + ) + return rendered_html + + +if __name__ == '__main__': + app.run() diff --git a/sky/spot/dashboard/static/favicon.ico b/sky/jobs/dashboard/static/favicon.ico similarity index 100% rename from sky/spot/dashboard/static/favicon.ico rename to sky/jobs/dashboard/static/favicon.ico diff --git a/sky/spot/dashboard/templates/index.html b/sky/jobs/dashboard/templates/index.html similarity index 50% rename from sky/spot/dashboard/templates/index.html rename to sky/jobs/dashboard/templates/index.html index f8267cd3e5f..af4f5708bce 100644 --- a/sky/spot/dashboard/templates/index.html +++ b/sky/jobs/dashboard/templates/index.html @@ -6,7 +6,8 @@ SkyPilot Dashboard - +