From e870839aeed16c118c0eb1f4889efc20006c27c4 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Mon, 16 Sep 2024 13:06:54 -0700 Subject: [PATCH 01/93] [LLM] Add huggingface token due to recent change of Pixtral on HF (#3950) --- llm/pixtral/pixtral.yaml | 1 + 1 file changed, 1 insertion(+) diff --git a/llm/pixtral/pixtral.yaml b/llm/pixtral/pixtral.yaml index 5977ed0592d..260888f86cf 100644 --- a/llm/pixtral/pixtral.yaml +++ b/llm/pixtral/pixtral.yaml @@ -1,5 +1,6 @@ envs: MODEL_NAME: mistralai/Pixtral-12B-2409 + HF_TOKEN: service: replicas: 2 From 303d43f87728c63d0abf70f43236e4c40681767e Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Tue, 17 Sep 2024 23:21:47 -0700 Subject: [PATCH 02/93] [LLM] Update qwen examples (#3957) * update qwen examples * Fix misalign --- llm/qwen/README.md | 24 +++++++++---------- .../{serve-110b.yaml => qwen15-110b.yaml} | 16 ++++--------- llm/qwen/{serve-72b.yaml => qwen2-72b.yaml} | 16 ++++--------- llm/qwen/{serve-7b.yaml => qwen2-7b.yaml} | 13 +++------- 4 files changed, 23 insertions(+), 46 deletions(-) rename llm/qwen/{serve-110b.yaml => qwen15-110b.yaml} (66%) rename llm/qwen/{serve-72b.yaml => qwen2-72b.yaml} (66%) rename llm/qwen/{serve-7b.yaml => qwen2-7b.yaml} (71%) diff --git a/llm/qwen/README.md b/llm/qwen/README.md index 6a76af71287..d10b9e59c36 100644 --- a/llm/qwen/README.md +++ b/llm/qwen/README.md @@ -3,9 +3,9 @@ [Qwen2](https://github.com/QwenLM/Qwen2) is one of the top open LLMs. As of Jun 2024, Qwen1.5-110B-Chat is ranked higher than GPT-4-0613 on the [LMSYS Chatbot Arena Leaderboard](https://chat.lmsys.org/?leaderboard). -πŸ“° **Update (26 April 2024) -** SkyPilot now also supports the [**Qwen1.5-110B**](https://qwenlm.github.io/blog/qwen1.5-110b/) model! It performs competitively with Llama-3-70B across a [series of evaluations](https://qwenlm.github.io/blog/qwen1.5-110b/#model-quality). Use [serve-110b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/serve-110b.yaml) to serve the 110B model. +πŸ“° **Update (Jun 6 2024) -** SkyPilot now also supports the [**Qwen2**](https://qwenlm.github.io/blog/qwen2/) model! It further improves the competitive model, Qwen1.5. -πŸ“° **Update (6 Jun 2024) -** SkyPilot now also supports the [**Qwen2**](https://qwenlm.github.io/blog/qwen2/) model! It further improves the competitive model, Qwen1.5. +πŸ“° **Update (April 26 2024) -** SkyPilot now also supports the [**Qwen1.5-110B**](https://qwenlm.github.io/blog/qwen1.5-110b/) model! It performs competitively with Llama-3-70B across a [series of evaluations](https://qwenlm.github.io/blog/qwen1.5-110b/#model-quality). Use [qwen15-110b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/qwen15-110b.yaml) to serve the 110B model.

qwen @@ -27,16 +27,16 @@ As of Jun 2024, Qwen1.5-110B-Chat is ranked higher than GPT-4-0613 on the [LMSYS After [installing SkyPilot](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html), run your own Qwen model on vLLM with SkyPilot in 1-click: -1. Start serving Qwen 110B on a single instance with any available GPU in the list specified in [serve-110b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/serve-110b.yaml) with a vLLM powered OpenAI-compatible endpoint (You can also switch to [serve-72b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/serve-72b.yaml) or [serve-7b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/serve-7b.yaml) for a smaller model): +1. Start serving Qwen 110B on a single instance with any available GPU in the list specified in [qwen15-110b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/qwen15-110b.yaml) with a vLLM powered OpenAI-compatible endpoint (You can also switch to [qwen2-72b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/qwen2-72b.yaml) or [qwen2-7b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/qwen2-7b.yaml) for a smaller model): ```console -sky launch -c qwen serve-110b.yaml +sky launch -c qwen qwen15-110b.yaml ``` 2. Send a request to the endpoint for completion: ```bash -IP=$(sky status --ip qwen) +ENDPOINT=$(sky status --endpoint 8000 qwen) -curl http://$IP:8000/v1/completions \ +curl http://$ENDPOINT/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen1.5-110B-Chat", @@ -47,7 +47,7 @@ curl http://$IP:8000/v1/completions \ 3. Send a request for chat completion: ```bash -curl http://$IP:8000/v1/chat/completions \ +curl http://$ENDPOINT/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen1.5-110B-Chat", @@ -69,7 +69,7 @@ curl http://$IP:8000/v1/chat/completions \ 1. With [SkyPilot Serving](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html), a serving library built on top of SkyPilot, scaling up the Qwen service is as simple as running: ```bash -sky serve up -n qwen ./serve-72b.yaml +sky serve up -n qwen ./qwen2-72b.yaml ``` This will start the service with multiple replicas on the cheapest available locations and accelerators. SkyServe will automatically manage the replicas, monitor their health, autoscale based on load, and restart them when needed. @@ -82,13 +82,13 @@ sky serve status qwen After a while, you will see the following output: ```console Services -NAME VERSION UPTIME STATUS REPLICAS ENDPOINT +NAME VERSION UPTIME STATUS REPLICAS ENDPOINT Qwen 1 - READY 2/2 3.85.107.228:30002 Service Replicas -SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION -Qwen 1 1 - 2 mins ago 1x Azure({'A100-80GB': 8}) READY eastus -Qwen 2 1 - 2 mins ago 1x GCP({'L4': 8}) READY us-east4-a +SERVICE_NAME ID VERSION ENDPOINT LAUNCHED RESOURCES STATUS REGION +Qwen 1 1 - 2 mins ago 1x Azure({'A100-80GB': 8}) READY eastus +Qwen 2 1 - 2 mins ago 1x GCP({'L4': 8}) READY us-east4-a ``` As shown, the service is now backed by 2 replicas, one on Azure and one on GCP, and the accelerator type is chosen to be **the cheapest available one** on the clouds. That said, it maximizes the diff --git a/llm/qwen/serve-110b.yaml b/llm/qwen/qwen15-110b.yaml similarity index 66% rename from llm/qwen/serve-110b.yaml rename to llm/qwen/qwen15-110b.yaml index 1e98bd254e9..71f2c2b4a34 100644 --- a/llm/qwen/serve-110b.yaml +++ b/llm/qwen/qwen15-110b.yaml @@ -24,20 +24,12 @@ resources: ports: 8000 setup: | - conda activate qwen - if [ $? -ne 0 ]; then - conda create -n qwen python=3.10 -y - conda activate qwen - fi - pip install vllm==0.4.2 - pip install flash-attn==2.5.9.post1 + pip install vllm==0.6.1.post2 + pip install vllm-flash-attn run: | - conda activate qwen export PATH=$PATH:/sbin - python -u -m vllm.entrypoints.openai.api_server \ + vllm serve $MODEL_NAME \ --host 0.0.0.0 \ - --model $MODEL_NAME \ --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ - --max-num-seqs 16 | tee ~/openai_api_server.log - + --max-model-len 1024 | tee ~/openai_api_server.log diff --git a/llm/qwen/serve-72b.yaml b/llm/qwen/qwen2-72b.yaml similarity index 66% rename from llm/qwen/serve-72b.yaml rename to llm/qwen/qwen2-72b.yaml index 34e3e348f2f..00ff41506e9 100644 --- a/llm/qwen/serve-72b.yaml +++ b/llm/qwen/qwen2-72b.yaml @@ -24,20 +24,12 @@ resources: ports: 8000 setup: | - conda activate qwen - if [ $? -ne 0 ]; then - conda create -n qwen python=3.10 -y - conda activate qwen - fi - pip install vllm==0.4.2 - pip install flash-attn==2.5.9.post1 + pip install vllm==0.6.1.post2 + pip install vllm-flash-attn run: | - conda activate qwen export PATH=$PATH:/sbin - python -u -m vllm.entrypoints.openai.api_server \ + vllm serve $MODEL_NAME \ --host 0.0.0.0 \ - --model $MODEL_NAME \ --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ - --max-num-seqs 16 | tee ~/openai_api_server.log - + --max-model-len 1024 | tee ~/openai_api_server.log diff --git a/llm/qwen/serve-7b.yaml b/llm/qwen/qwen2-7b.yaml similarity index 71% rename from llm/qwen/serve-7b.yaml rename to llm/qwen/qwen2-7b.yaml index f33adcdd2cd..ccf8d62d306 100644 --- a/llm/qwen/serve-7b.yaml +++ b/llm/qwen/qwen2-7b.yaml @@ -22,19 +22,12 @@ resources: ports: 8000 setup: | - conda activate qwen - if [ $? -ne 0 ]; then - conda create -n qwen python=3.10 -y - conda activate qwen - fi - pip install vllm==0.4.2 - pip install flash-attn==2.5.9.post1 + pip install vllm==0.6.1.post2 + pip install vllm-flash-attn run: | - conda activate qwen export PATH=$PATH:/sbin - python -m vllm.entrypoints.openai.api_server \ + vllm serve $MODEL_NAME \ --host 0.0.0.0 \ - --model $MODEL_NAME \ --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ --max-model-len 1024 | tee ~/openai_api_server.log From 402382286f4488c62206967ef13f2b078a000e6b Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 18 Sep 2024 14:15:45 -0700 Subject: [PATCH 03/93] Qwen 2.5 support (#3959) * Update qwen example for 2.5 release * Add support for qwen 2.5 example --- llm/qwen/README.md | 14 ++++++++------ llm/qwen/{qwen2-72b.yaml => qwen25-72b.yaml} | 2 +- llm/qwen/{qwen2-7b.yaml => qwen25-7b.yaml} | 2 +- 3 files changed, 10 insertions(+), 8 deletions(-) rename llm/qwen/{qwen2-72b.yaml => qwen25-72b.yaml} (95%) rename llm/qwen/{qwen2-7b.yaml => qwen25-7b.yaml} (94%) diff --git a/llm/qwen/README.md b/llm/qwen/README.md index d10b9e59c36..04ecd88e9c0 100644 --- a/llm/qwen/README.md +++ b/llm/qwen/README.md @@ -3,9 +3,11 @@ [Qwen2](https://github.com/QwenLM/Qwen2) is one of the top open LLMs. As of Jun 2024, Qwen1.5-110B-Chat is ranked higher than GPT-4-0613 on the [LMSYS Chatbot Arena Leaderboard](https://chat.lmsys.org/?leaderboard). -πŸ“° **Update (Jun 6 2024) -** SkyPilot now also supports the [**Qwen2**](https://qwenlm.github.io/blog/qwen2/) model! It further improves the competitive model, Qwen1.5. +**Update (Sep 18, 2024) -** SkyPilot now supports the [**Qwen2.5**](https://qwenlm.github.io/blog/qwen2.5/) model! -πŸ“° **Update (April 26 2024) -** SkyPilot now also supports the [**Qwen1.5-110B**](https://qwenlm.github.io/blog/qwen1.5-110b/) model! It performs competitively with Llama-3-70B across a [series of evaluations](https://qwenlm.github.io/blog/qwen1.5-110b/#model-quality). Use [qwen15-110b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/qwen15-110b.yaml) to serve the 110B model. +πŸ“° **Update (Jun 6, 2024) -** SkyPilot now also supports the [**Qwen2**](https://qwenlm.github.io/blog/qwen2/) model! It further improves the competitive model, Qwen1.5. + +πŸ“° **Update (April 26, 2024) -** SkyPilot now also supports the [**Qwen1.5-110B**](https://qwenlm.github.io/blog/qwen1.5-110b/) model! It performs competitively with Llama-3-70B across a [series of evaluations](https://qwenlm.github.io/blog/qwen1.5-110b/#model-quality). Use [qwen15-110b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/qwen15-110b.yaml) to serve the 110B model.

qwen @@ -27,7 +29,7 @@ As of Jun 2024, Qwen1.5-110B-Chat is ranked higher than GPT-4-0613 on the [LMSYS After [installing SkyPilot](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html), run your own Qwen model on vLLM with SkyPilot in 1-click: -1. Start serving Qwen 110B on a single instance with any available GPU in the list specified in [qwen15-110b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/qwen15-110b.yaml) with a vLLM powered OpenAI-compatible endpoint (You can also switch to [qwen2-72b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/qwen2-72b.yaml) or [qwen2-7b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/qwen2-7b.yaml) for a smaller model): +1. Start serving Qwen 110B on a single instance with any available GPU in the list specified in [qwen15-110b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/qwen15-110b.yaml) with a vLLM powered OpenAI-compatible endpoint (You can also switch to [qwen25-72b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/qwen25-72b.yaml) or [qwen25-7b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/qwen25-7b.yaml) for a smaller model): ```console sky launch -c qwen qwen15-110b.yaml @@ -69,7 +71,7 @@ curl http://$ENDPOINT/v1/chat/completions \ 1. With [SkyPilot Serving](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html), a serving library built on top of SkyPilot, scaling up the Qwen service is as simple as running: ```bash -sky serve up -n qwen ./qwen2-72b.yaml +sky serve up -n qwen ./qwen25-72b.yaml ``` This will start the service with multiple replicas on the cheapest available locations and accelerators. SkyServe will automatically manage the replicas, monitor their health, autoscale based on load, and restart them when needed. @@ -101,7 +103,7 @@ ENDPOINT=$(sky serve status --endpoint qwen) curl http://$ENDPOINT/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ - "model": "Qwen/Qwen2-72B-Instruct", + "model": "Qwen/Qwen2.5-72B-Instruct", "messages": [ { "role": "system", @@ -123,7 +125,7 @@ It is also possible to access the Qwen service with a GUI using [vLLM](https://g 1. Start the chat web UI (change the `--env` flag to the model you are running): ```bash -sky launch -c qwen-gui ./gui.yaml --env MODEL_NAME='Qwen/Qwen2-72B-Instruct' --env ENDPOINT=$(sky serve status --endpoint qwen) +sky launch -c qwen-gui ./gui.yaml --env MODEL_NAME='Qwen/Qwen2.5-72B-Instruct' --env ENDPOINT=$(sky serve status --endpoint qwen) ``` 2. Then, we can access the GUI at the returned gradio link: diff --git a/llm/qwen/qwen2-72b.yaml b/llm/qwen/qwen25-72b.yaml similarity index 95% rename from llm/qwen/qwen2-72b.yaml rename to llm/qwen/qwen25-72b.yaml index 00ff41506e9..cfbf1d06a8c 100644 --- a/llm/qwen/qwen2-72b.yaml +++ b/llm/qwen/qwen25-72b.yaml @@ -1,5 +1,5 @@ envs: - MODEL_NAME: Qwen/Qwen2-72B-Instruct + MODEL_NAME: Qwen/Qwen2.5-72B-Instruct service: # Specifying the path to the endpoint to check the readiness of the replicas. diff --git a/llm/qwen/qwen2-7b.yaml b/llm/qwen/qwen25-7b.yaml similarity index 94% rename from llm/qwen/qwen2-7b.yaml rename to llm/qwen/qwen25-7b.yaml index ccf8d62d306..f9065e08579 100644 --- a/llm/qwen/qwen2-7b.yaml +++ b/llm/qwen/qwen25-7b.yaml @@ -1,5 +1,5 @@ envs: - MODEL_NAME: Qwen/Qwen2-7B-Instruct + MODEL_NAME: Qwen/Qwen2.5-7B-Instruct service: # Specifying the path to the endpoint to check the readiness of the replicas. From 7ca0c48c09a5ecb753024320978854d8f37c982d Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 18 Sep 2024 14:32:31 -0700 Subject: [PATCH 04/93] Qwen 2.5 k8s (#3960) * Update qwen example for 2.5 release * Add support for qwen 2.5 example * add kubernetes --- llm/qwen/README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/llm/qwen/README.md b/llm/qwen/README.md index 04ecd88e9c0..cd2c88f5e75 100644 --- a/llm/qwen/README.md +++ b/llm/qwen/README.md @@ -1,4 +1,4 @@ -# Serving Qwen2 on Your Own Cloud +# Serving Qwen2 on Your Own Kubernetes or Cloud [Qwen2](https://github.com/QwenLM/Qwen2) is one of the top open LLMs. As of Jun 2024, Qwen1.5-110B-Chat is ranked higher than GPT-4-0613 on the [LMSYS Chatbot Arena Leaderboard](https://chat.lmsys.org/?leaderboard). @@ -18,10 +18,10 @@ As of Jun 2024, Qwen1.5-110B-Chat is ranked higher than GPT-4-0613 on the [LMSYS ## Why use SkyPilot to deploy over commercial hosted solutions? -* Get the best GPU availability by utilizing multiple resources pools across multiple regions and clouds. -* Pay absolute minimum β€” SkyPilot picks the cheapest resources across regions and clouds. No managed solution markups. +* Get the best GPU availability by utilizing multiple resources pools across Kubernetes clusters and multiple regions/clouds. +* Pay absolute minimum β€” SkyPilot picks the cheapest resources across Kubernetes clusters and regions/clouds. No managed solution markups. * Scale up to multiple replicas across different locations and accelerators, all served with a single endpoint -* Everything stays in your cloud account (your VMs & buckets) +* Everything stays in your Kubernetes or cloud account (your VMs & buckets) * Completely private - no one else sees your chat history From e558ec2e08f176eb73ed04da4d86d97f24a7700f Mon Sep 17 00:00:00 2001 From: Haijian Wang <130898843+Haijian06@users.noreply.github.com> Date: Thu, 19 Sep 2024 16:08:08 +0800 Subject: [PATCH 05/93] Integrating the Yi series models (#3958) * Add files via upload * Update and rename qwen2-7b.yaml to yi15-6b.yaml * Add files via upload * Update yi15-9b.yaml * Update yi15-34b.yaml * Update yi15-6b.yaml * Add files via upload * Update yicoder-1_5b.yaml * Update yicoder-9b.yaml * Add files via upload * Update yi15-34b.yaml * Update yi15-6b.yaml * Update yi15-9b.yaml * Update yicoder-1_5b.yaml * Update yicoder-9b.yaml --- llm/yi/README.md | 60 ++++++++++++++++++++++++++++++++++++++++ llm/yi/yi15-34b.yaml | 20 ++++++++++++++ llm/yi/yi15-6b.yaml | 18 ++++++++++++ llm/yi/yi15-9b.yaml | 18 ++++++++++++ llm/yi/yicoder-1_5b.yaml | 18 ++++++++++++ llm/yi/yicoder-9b.yaml | 18 ++++++++++++ 6 files changed, 152 insertions(+) create mode 100644 llm/yi/README.md create mode 100644 llm/yi/yi15-34b.yaml create mode 100644 llm/yi/yi15-6b.yaml create mode 100644 llm/yi/yi15-9b.yaml create mode 100644 llm/yi/yicoder-1_5b.yaml create mode 100644 llm/yi/yicoder-9b.yaml diff --git a/llm/yi/README.md b/llm/yi/README.md new file mode 100644 index 00000000000..76fcf6151e6 --- /dev/null +++ b/llm/yi/README.md @@ -0,0 +1,60 @@ +# Serving Yi on Your Own Kubernetes or Cloud + +πŸ€– The Yi series models are the next generation of open-source large language models trained from scratch by [01.AI](https://www.lingyiwanwu.com/en). + +**Update (Sep 19, 2024) -** SkyPilot now supports the [**Yi**](https://01-ai.github.io/) model(Yi-Coder Yi-1.5)! + +

+ yi +

+ +## Why use SkyPilot to deploy over commercial hosted solutions? + +* Get the best GPU availability by utilizing multiple resources pools across Kubernetes clusters and multiple regions/clouds. +* Pay absolute minimum β€” SkyPilot picks the cheapest resources across Kubernetes clusters and regions/clouds. No managed solution markups. +* Scale up to multiple replicas across different locations and accelerators, all served with a single endpoint +* Everything stays in your Kubernetes or cloud account (your VMs & buckets) +* Completely private - no one else sees your chat history + + +## Running Yi model with SkyPilot + +After [installing SkyPilot](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html), run your own Yi model on vLLM with SkyPilot in 1-click: + +1. Start serving Yi-1.5 34B on a single instance with any available GPU in the list specified in [yi15-34b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/yi/yi15-34b.yaml) with a vLLM powered OpenAI-compatible endpoint (You can also switch to [yicoder-9b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/yi/yicoder-9b.yaml) or [other model](https://github.com/skypilot-org/skypilot/tree/master/llm/yi) for a smaller model): + +```console +sky launch -c yi yi15-34b.yaml +``` +2. Send a request to the endpoint for completion: +```bash +ENDPOINT=$(sky status --endpoint 8000 yi) + +curl http://$ENDPOINT/v1/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "01-ai/Yi-1.5-34B-Chat", + "prompt": "Who are you?", + "max_tokens": 512 + }' | jq -r '.choices[0].text' +``` + +3. Send a request for chat completion: +```bash +curl http://$ENDPOINT/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "01-ai/Yi-1.5-34B-Chat", + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "Who are you?" + } + ], + "max_tokens": 512 + }' | jq -r '.choices[0].message.content' +``` diff --git a/llm/yi/yi15-34b.yaml b/llm/yi/yi15-34b.yaml new file mode 100644 index 00000000000..99fe5481d7a --- /dev/null +++ b/llm/yi/yi15-34b.yaml @@ -0,0 +1,20 @@ +envs: + MODEL_NAME: 01-ai/Yi-1.5-34B-Chat + +resources: + accelerators: {A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8} + disk_size: 1024 + disk_tier: best + memory: 32+ + ports: 8000 + +setup: | + pip install vllm==0.6.1.post2 + pip install vllm-flash-attn + +run: | + export PATH=$PATH:/sbin + vllm serve $MODEL_NAME \ + --host 0.0.0.0 \ + --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ + --max-model-len 1024 | tee ~/openai_api_server.log diff --git a/llm/yi/yi15-6b.yaml b/llm/yi/yi15-6b.yaml new file mode 100644 index 00000000000..879f5ffea9c --- /dev/null +++ b/llm/yi/yi15-6b.yaml @@ -0,0 +1,18 @@ +envs: + MODEL_NAME: 01-ai/Yi-1.5-6B-Chat + +resources: + accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} + disk_tier: best + ports: 8000 + +setup: | + pip install vllm==0.6.1.post2 + pip install vllm-flash-attn + +run: | + export PATH=$PATH:/sbin + vllm serve $MODEL_NAME \ + --host 0.0.0.0 \ + --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ + --max-model-len 1024 | tee ~/openai_api_server.log diff --git a/llm/yi/yi15-9b.yaml b/llm/yi/yi15-9b.yaml new file mode 100644 index 00000000000..b7ac40b4e11 --- /dev/null +++ b/llm/yi/yi15-9b.yaml @@ -0,0 +1,18 @@ +envs: + MODEL_NAME: 01-ai/Yi-1.5-9B-Chat + +resources: + accelerators: {L4:8, A10g:8, A10:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8} + disk_tier: best + ports: 8000 + +setup: | + pip install vllm==0.6.1.post2 + pip install vllm-flash-attn + +run: | + export PATH=$PATH:/sbin + vllm serve $MODEL_NAME \ + --host 0.0.0.0 \ + --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ + --max-model-len 1024 | tee ~/openai_api_server.log diff --git a/llm/yi/yicoder-1_5b.yaml b/llm/yi/yicoder-1_5b.yaml new file mode 100644 index 00000000000..383f88b657d --- /dev/null +++ b/llm/yi/yicoder-1_5b.yaml @@ -0,0 +1,18 @@ +envs: + MODEL_NAME: 01-ai/Yi-Coder-1.5B-Chat + +resources: + accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} + disk_tier: best + ports: 8000 + +setup: | + pip install vllm==0.6.1.post2 + pip install vllm-flash-attn + +run: | + export PATH=$PATH:/sbin + vllm serve $MODEL_NAME \ + --host 0.0.0.0 \ + --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ + --max-model-len 1024 | tee ~/openai_api_server.log diff --git a/llm/yi/yicoder-9b.yaml b/llm/yi/yicoder-9b.yaml new file mode 100644 index 00000000000..28e74b45bb5 --- /dev/null +++ b/llm/yi/yicoder-9b.yaml @@ -0,0 +1,18 @@ +envs: + MODEL_NAME: 01-ai/Yi-Coder-9B-Chat + +resources: + accelerators: {L4:8, A10g:8, A10:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8} + disk_tier: best + ports: 8000 + +setup: | + pip install vllm==0.6.1.post2 + pip install vllm-flash-attn + +run: | + export PATH=$PATH:/sbin + vllm serve $MODEL_NAME \ + --host 0.0.0.0 \ + --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ + --max-model-len 1024 | tee ~/openai_api_server.log From 3871de975b4df9c24ea089698d92cd9c188cd532 Mon Sep 17 00:00:00 2001 From: Tian Xia Date: Thu, 19 Sep 2024 12:37:50 -0700 Subject: [PATCH 06/93] [Test] Fix Smoke Test `test-skyserve-fast-update` (#3956) * init * add newline --- tests/skyserve/update/bump_version_after.yaml | 3 ++- tests/skyserve/update/bump_version_before.yaml | 3 ++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/tests/skyserve/update/bump_version_after.yaml b/tests/skyserve/update/bump_version_after.yaml index e37416ed861..8709c8a9a90 100644 --- a/tests/skyserve/update/bump_version_after.yaml +++ b/tests/skyserve/update/bump_version_after.yaml @@ -24,4 +24,5 @@ setup: | run: | cd skypilot/examples/serve/http_server - python3 server.py \ No newline at end of file + python3 server.py --port 8081 + \ No newline at end of file diff --git a/tests/skyserve/update/bump_version_before.yaml b/tests/skyserve/update/bump_version_before.yaml index 8423d673c61..c38c4288538 100644 --- a/tests/skyserve/update/bump_version_before.yaml +++ b/tests/skyserve/update/bump_version_before.yaml @@ -24,4 +24,5 @@ setup: | run: | cd skypilot/examples/serve/http_server - python3 server.py \ No newline at end of file + python3 server.py --port 8081 + \ No newline at end of file From d602225c897f10ca67a2d5b5db21982c0dc8c1ec Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Thu, 19 Sep 2024 13:00:04 -0700 Subject: [PATCH 07/93] [LLM] Add Qwen2-VL multimodal example (#3961) Add multimodal example --- llm/qwen/README.md | 29 +++++++++++++++++++++++++++++ llm/qwen/qwen2-vl-7b.yaml | 36 ++++++++++++++++++++++++++++++++++++ 2 files changed, 65 insertions(+) create mode 100644 llm/qwen/qwen2-vl-7b.yaml diff --git a/llm/qwen/README.md b/llm/qwen/README.md index cd2c88f5e75..6846fc71f2f 100644 --- a/llm/qwen/README.md +++ b/llm/qwen/README.md @@ -67,6 +67,35 @@ curl http://$ENDPOINT/v1/chat/completions \ }' | jq -r '.choices[0].message.content' ``` +## Running Multimodal Qwen2-VL + + +1. Start serving Qwen2-VL: + +```console +sky launch -c qwen2-vl qwen2-vl-7b.yaml +``` +2. Send a multimodalrequest to the endpoint for completion: +```bash +ENDPOINT=$(sky status --endpoint 8000 qwen2-vl) + +curl http://$ENDPOINT/v1/chat/completions \ + -H 'Content-Type: application/json' \ + -H 'Authorization: Bearer token' \ + --data '{ + "model": "Qwen/Qwen2-VL-7B-Instruct", + "messages": [ + { + "role": "user", + "content": [ + {"type" : "text", "text": "Covert this logo to ASCII art"}, + {"type": "image_url", "image_url": {"url": "https://pbs.twimg.com/profile_images/1584596138635632640/HWexMoH5_400x400.jpg"}} + ] + }], + "max_tokens": 1024 + }' | jq . +``` + ## Scale up the service with SkyServe 1. With [SkyPilot Serving](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html), a serving library built on top of SkyPilot, scaling up the Qwen service is as simple as running: diff --git a/llm/qwen/qwen2-vl-7b.yaml b/llm/qwen/qwen2-vl-7b.yaml new file mode 100644 index 00000000000..cc7600bbd9e --- /dev/null +++ b/llm/qwen/qwen2-vl-7b.yaml @@ -0,0 +1,36 @@ +envs: + MODEL_NAME: Qwen/Qwen2-VL-7B-Instruct + +service: + # Specifying the path to the endpoint to check the readiness of the replicas. + readiness_probe: + path: /v1/chat/completions + post_data: + model: $MODEL_NAME + messages: + - role: user + content: Hello! What is your name? + max_tokens: 1 + initial_delay_seconds: 1200 + # How many replicas to manage. + replicas: 2 + + +resources: + accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} + disk_tier: best + ports: 8000 + +setup: | + # Install later transformers version for the support of + # qwen2_vl support + pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830 + pip install vllm==0.6.1.post2 + pip install vllm-flash-attn + +run: | + export PATH=$PATH:/sbin + vllm serve $MODEL_NAME \ + --host 0.0.0.0 \ + --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ + --max-model-len 2048 | tee ~/openai_api_server.log From 31c0a5ce583f4eaf9da20ee0728d34953154e9e8 Mon Sep 17 00:00:00 2001 From: Haijian Wang <130898843+Haijian06@users.noreply.github.com> Date: Tue, 24 Sep 2024 04:52:56 +0800 Subject: [PATCH 08/93] Update README.md (#3969) * Add files via upload * Update and rename qwen2-7b.yaml to yi15-6b.yaml * Add files via upload * Update yi15-9b.yaml * Update yi15-34b.yaml * Update yi15-6b.yaml * Add files via upload * Update yicoder-1_5b.yaml * Update yicoder-9b.yaml * Add files via upload * Update yi15-34b.yaml * Update yi15-6b.yaml * Update yi15-9b.yaml * Update yicoder-1_5b.yaml * Update yicoder-9b.yaml * Update README.md --- llm/yi/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/llm/yi/README.md b/llm/yi/README.md index 76fcf6151e6..1353320aa9f 100644 --- a/llm/yi/README.md +++ b/llm/yi/README.md @@ -1,4 +1,4 @@ -# Serving Yi on Your Own Kubernetes or Cloud +# Running Yi with SkyPilot on Your Cloud πŸ€– The Yi series models are the next generation of open-source large language models trained from scratch by [01.AI](https://www.lingyiwanwu.com/en). From 800f7d6971bd604f266faebb33d044c7d5baca55 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Mon, 23 Sep 2024 21:39:28 -0700 Subject: [PATCH 09/93] [Core] Admin policy enforcement plugin (#3966) * support policy hook * test task labels * Add test for policy that sets labels * Fix comment * format * use -e to make test related files visible * Add config.rst * Fix test * fix config rst * Apply policy to service * add policy for serving * Add docs * fix * format * Update interface * fix * Fix * fix * Fix test config * Fix mutated config * fix * Add policy doc * rename * minor * Add additional arguments for autostop * fix mypy * format * rejected message * format * Update sky/utils/policy_utils.py Co-authored-by: Zongheng Yang * Update sky/utils/policy_utils.py Co-authored-by: Zongheng Yang * Fix * Update examples/admin_policy/example_policy/example_policy/__init__.py Co-authored-by: Zongheng Yang * Update docs/source/reference/config.rst Co-authored-by: Zongheng Yang * Address comments * format * changes in examples * Fix enforce autostop * Fix autostop enforcement * fix test * Update docs/source/cloud-setup/policy.rst Co-authored-by: Zongheng Yang * Update sky/admin_policy.py Co-authored-by: Zongheng Yang * Update sky/admin_policy.py Co-authored-by: Zongheng Yang * wip * Update docs/source/cloud-setup/policy.rst Co-authored-by: Zongheng Yang * Update docs/source/cloud-setup/policy.rst Co-authored-by: Zongheng Yang * Update docs/source/cloud-setup/policy.rst Co-authored-by: Zongheng Yang * fix * fix * fix * Use sky.status for autostop * update policy * Update docs/source/cloud-setup/policy.rst Co-authored-by: Zongheng Yang * fix policy.rst * Add comment * Fix logging * fix CI * Update docs/source/cloud-setup/policy.rst Co-authored-by: Zongheng Yang * Use sphnix inline code * Add comment * fix skypilot config file mounts for jobs and serve --------- Co-authored-by: Zongheng Yang --- .github/workflows/pytest.yml | 2 +- docs/source/cloud-setup/policy.rst | 195 ++++++++++++++++++ docs/source/docs/index.rst | 3 +- docs/source/reference/config.rst | 11 + examples/admin_policy/add_labels.yaml | 1 + examples/admin_policy/disable_public_ip.yaml | 1 + examples/admin_policy/enforce_autostop.yaml | 1 + .../example_policy/example_policy/__init__.py | 6 + .../example_policy/skypilot_policy.py | 121 +++++++++++ .../example_policy/pyproject.toml | 7 + examples/admin_policy/reject_all.yaml | 1 + examples/admin_policy/task.yaml | 12 ++ examples/admin_policy/use_spot_for_gpu.yaml | 1 + sky/__init__.py | 9 + sky/admin_policy.py | 101 +++++++++ sky/dag.py | 30 +-- sky/exceptions.py | 5 + sky/execution.py | 16 +- sky/jobs/controller.py | 2 + sky/jobs/core.py | 4 + sky/serve/core.py | 6 + sky/skypilot_config.py | 101 +++++---- sky/templates/jobs-controller.yaml.j2 | 4 +- sky/templates/sky-serve-controller.yaml.j2 | 4 +- sky/utils/admin_policy_utils.py | 145 +++++++++++++ sky/utils/common_utils.py | 11 +- sky/utils/controller_utils.py | 71 ++++--- sky/utils/dag_utils.py | 9 +- sky/utils/schemas.py | 8 + tests/test_config.py | 25 ++- tests/unit_tests/test_admin_policy.py | 172 +++++++++++++++ tests/unit_tests/test_backend_utils.py | 39 ++-- tests/unit_tests/test_common_utils.py | 8 +- tests/unit_tests/test_resources.py | 31 ++- 34 files changed, 1024 insertions(+), 139 deletions(-) create mode 100644 docs/source/cloud-setup/policy.rst create mode 100644 examples/admin_policy/add_labels.yaml create mode 100644 examples/admin_policy/disable_public_ip.yaml create mode 100644 examples/admin_policy/enforce_autostop.yaml create mode 100644 examples/admin_policy/example_policy/example_policy/__init__.py create mode 100644 examples/admin_policy/example_policy/example_policy/skypilot_policy.py create mode 100644 examples/admin_policy/example_policy/pyproject.toml create mode 100644 examples/admin_policy/reject_all.yaml create mode 100644 examples/admin_policy/task.yaml create mode 100644 examples/admin_policy/use_spot_for_gpu.yaml create mode 100644 sky/admin_policy.py create mode 100644 sky/utils/admin_policy_utils.py create mode 100644 tests/unit_tests/test_admin_policy.py diff --git a/.github/workflows/pytest.yml b/.github/workflows/pytest.yml index ac723f35fc2..3faf75acf8d 100644 --- a/.github/workflows/pytest.yml +++ b/.github/workflows/pytest.yml @@ -53,7 +53,7 @@ jobs: - name: Install dependencies run: | python -m pip install --upgrade pip - pip install ".[all]" + pip install -e ".[all]" pip install pytest pytest-xdist pytest-env>=0.6 memory-profiler==0.61.0 - name: Run tests with pytest diff --git a/docs/source/cloud-setup/policy.rst b/docs/source/cloud-setup/policy.rst new file mode 100644 index 00000000000..0d3e3444372 --- /dev/null +++ b/docs/source/cloud-setup/policy.rst @@ -0,0 +1,195 @@ +.. _advanced-policy-config: + +Admin Policy Enforcement +======================== + + +SkyPilot provides an **admin policy** mechanism that admins can use to enforce certain policies on users' SkyPilot usage. An admin policy applies +custom validation and mutation logic to a user's tasks and SkyPilot config. + +Example usage: + +- :ref:`kubernetes-labels-policy` +- :ref:`disable-public-ip-policy` +- :ref:`use-spot-for-gpu-policy` +- :ref:`enforce-autostop-policy` + + +To implement and use an admin policy: + +- Admins writes a simple Python package with a policy class that implements SkyPilot's ``sky.AdminPolicy`` interface; +- Admins distributes this package to users; +- Users simply set the ``admin_policy`` field in the SkyPilot config file ``~/.sky/config.yaml`` for the policy to go into effect. + + +Overview +-------- + + + +User-Side +~~~~~~~~~~ + +To apply the policy, a user needs to set the ``admin_policy`` field in the SkyPilot config +``~/.sky/config.yaml`` to the path of the Python package that implements the policy. +For example: + +.. code-block:: yaml + + admin_policy: mypackage.subpackage.MyPolicy + + +.. hint:: + + SkyPilot loads the policy from the given package in the same Python environment. + You can test the existence of the policy by running: + + .. code-block:: bash + + python -c "from mypackage.subpackage import MyPolicy" + + +Admin-Side +~~~~~~~~~~ + +An admin can distribute the Python package to users with a pre-defined policy. The +policy should implement the ``sky.AdminPolicy`` `interface `_: + + +.. literalinclude:: ../../../sky/admin_policy.py + :language: python + :pyobject: AdminPolicy + :caption: `AdminPolicy Interface `_ + + +Your custom admin policy should look like this: + +.. code-block:: python + + import sky + + class MyPolicy(sky.AdminPolicy): + @classmethod + def validate_and_mutate(cls, user_request: sky.UserRequest) -> sky.MutatedUserRequest: + # Logic for validate and modify user requests. + ... + return sky.MutatedUserRequest(user_request.task, + user_request.skypilot_config) + + +``UserRequest`` and ``MutatedUserRequest`` are defined as follows (see `source code `_ for more details): + + +.. literalinclude:: ../../../sky/admin_policy.py + :language: python + :pyobject: UserRequest + :caption: `UserRequest Class `_ + +.. literalinclude:: ../../../sky/admin_policy.py + :language: python + :pyobject: MutatedUserRequest + :caption: `MutatedUserRequest Class `_ + + +In other words, an ``AdminPolicy`` can mutate any fields of a user request, including +the :ref:`task ` and the :ref:`global skypilot config `, +giving admins a lot of flexibility to control user's SkyPilot usage. + +An ``AdminPolicy`` can be used to both validate and mutate user requests. If +a request should be rejected, the policy should raise an exception. + + +The ``sky.Config`` and ``sky.RequestOptions`` classes are defined as follows: + +.. literalinclude:: ../../../sky/skypilot_config.py + :language: python + :pyobject: Config + :caption: `Config Class `_ + + +.. literalinclude:: ../../../sky/admin_policy.py + :language: python + :pyobject: RequestOptions + :caption: `RequestOptions Class `_ + + +Example Policies +---------------- + +We have provided a few example policies in `examples/admin_policy/example_policy `_. You can test these policies by installing the example policy package in your Python environment. + +.. code-block:: bash + + git clone https://github.com/skypilot-org/skypilot.git + cd skypilot + pip install examples/admin_policy/example_policy + +Reject All +~~~~~~~~~~ + +.. literalinclude:: ../../../examples/admin_policy/example_policy/example_policy/skypilot_policy.py + :language: python + :pyobject: RejectAllPolicy + :caption: `RejectAllPolicy `_ + +.. literalinclude:: ../../../examples/admin_policy/reject_all.yaml + :language: yaml + :caption: `Config YAML for using RejectAllPolicy `_ + +.. _kubernetes-labels-policy: + +Add Labels for all Tasks on Kubernetes +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. literalinclude:: ../../../examples/admin_policy/example_policy/example_policy/skypilot_policy.py + :language: python + :pyobject: AddLabelsPolicy + :caption: `AddLabelsPolicy `_ + +.. literalinclude:: ../../../examples/admin_policy/add_labels.yaml + :language: yaml + :caption: `Config YAML for using AddLabelsPolicy `_ + + +.. _disable-public-ip-policy: + +Always Disable Public IP for AWS Tasks +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. literalinclude:: ../../../examples/admin_policy/example_policy/example_policy/skypilot_policy.py + :language: python + :pyobject: DisablePublicIpPolicy + :caption: `DisablePublicIpPolicy `_ + +.. literalinclude:: ../../../examples/admin_policy/disable_public_ip.yaml + :language: yaml + :caption: `Config YAML for using DisablePublicIpPolicy `_ + +.. _use-spot-for-gpu-policy: + +Use Spot for all GPU Tasks +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. +.. literalinclude:: ../../../examples/admin_policy/example_policy/example_policy/skypilot_policy.py + :language: python + :pyobject: UseSpotForGpuPolicy + :caption: `UseSpotForGpuPolicy `_ + +.. literalinclude:: ../../../examples/admin_policy/use_spot_for_gpu.yaml + :language: yaml + :caption: `Config YAML for using UseSpotForGpuPolicy `_ + +.. _enforce-autostop-policy: + +Enforce Autostop for all Tasks +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. literalinclude:: ../../../examples/admin_policy/example_policy/example_policy/skypilot_policy.py + :language: python + :pyobject: EnforceAutostopPolicy + :caption: `EnforceAutostopPolicy `_ + +.. literalinclude:: ../../../examples/admin_policy/enforce_autostop.yaml + :language: yaml + :caption: `Config YAML for using EnforceAutostopPolicy `_ diff --git a/docs/source/docs/index.rst b/docs/source/docs/index.rst index eeef2386337..00a645a3834 100644 --- a/docs/source/docs/index.rst +++ b/docs/source/docs/index.rst @@ -201,7 +201,8 @@ Read the research: ../cloud-setup/cloud-permissions/index ../cloud-setup/cloud-auth ../cloud-setup/quota - + ../cloud-setup/policy + .. toctree:: :hidden: :maxdepth: 1 diff --git a/docs/source/reference/config.rst b/docs/source/reference/config.rst index 6c2fe2569a6..ebe8db6751f 100644 --- a/docs/source/reference/config.rst +++ b/docs/source/reference/config.rst @@ -87,6 +87,17 @@ Available fields and semantics: # Default: false. disable_ecc: false + # Admin policy to be applied to all tasks. (optional). + # + # The policy class to be applied to all tasks, which can be used to validate + # and mutate user requests. + # + # This is useful for enforcing certain policies on all tasks, e.g., + # add custom labels; enforce certain resource limits; etc. + # + # The policy class should implement the sky.AdminPolicy interface. + admin_policy: my_package.SkyPilotPolicyV1 + # Advanced AWS configurations (optional). # Apply to all new instances but not existing ones. aws: diff --git a/examples/admin_policy/add_labels.yaml b/examples/admin_policy/add_labels.yaml new file mode 100644 index 00000000000..113b3b78044 --- /dev/null +++ b/examples/admin_policy/add_labels.yaml @@ -0,0 +1 @@ +admin_policy: example_policy.AddLabelsPolicy diff --git a/examples/admin_policy/disable_public_ip.yaml b/examples/admin_policy/disable_public_ip.yaml new file mode 100644 index 00000000000..cef910cbdaf --- /dev/null +++ b/examples/admin_policy/disable_public_ip.yaml @@ -0,0 +1 @@ +admin_policy: example_policy.DisablePublicIpPolicy diff --git a/examples/admin_policy/enforce_autostop.yaml b/examples/admin_policy/enforce_autostop.yaml new file mode 100644 index 00000000000..f0194fb994e --- /dev/null +++ b/examples/admin_policy/enforce_autostop.yaml @@ -0,0 +1 @@ +admin_policy: example_policy.EnforceAutostopPolicy diff --git a/examples/admin_policy/example_policy/example_policy/__init__.py b/examples/admin_policy/example_policy/example_policy/__init__.py new file mode 100644 index 00000000000..12ca4e952e2 --- /dev/null +++ b/examples/admin_policy/example_policy/example_policy/__init__.py @@ -0,0 +1,6 @@ +"""Example admin policy moduleΒ and prebuilt policies.""" +from example_policy.skypilot_policy import AddLabelsPolicy +from example_policy.skypilot_policy import DisablePublicIpPolicy +from example_policy.skypilot_policy import EnforceAutostopPolicy +from example_policy.skypilot_policy import RejectAllPolicy +from example_policy.skypilot_policy import UseSpotForGpuPolicy diff --git a/examples/admin_policy/example_policy/example_policy/skypilot_policy.py b/examples/admin_policy/example_policy/example_policy/skypilot_policy.py new file mode 100644 index 00000000000..dc4e4b873fb --- /dev/null +++ b/examples/admin_policy/example_policy/example_policy/skypilot_policy.py @@ -0,0 +1,121 @@ +"""Example prebuilt admin policies.""" +import sky + + +class RejectAllPolicy(sky.AdminPolicy): + """Example policy: rejects all user requests.""" + + @classmethod + def validate_and_mutate( + cls, user_request: sky.UserRequest) -> sky.MutatedUserRequest: + """Rejects all user requests.""" + raise RuntimeError('Reject all policy') + + +class AddLabelsPolicy(sky.AdminPolicy): + """Example policy: adds a kubernetes label for skypilot_config.""" + + @classmethod + def validate_and_mutate( + cls, user_request: sky.UserRequest) -> sky.MutatedUserRequest: + config = user_request.skypilot_config + labels = config.get_nested(('kubernetes', 'custom_metadata', 'labels'), + {}) + labels['app'] = 'skypilot' + config.set_nested(('kubernetes', 'custom_metadata', 'labels'), labels) + return sky.MutatedUserRequest(user_request.task, config) + + +class DisablePublicIpPolicy(sky.AdminPolicy): + """Example policy: disables public IP for all AWS tasks.""" + + @classmethod + def validate_and_mutate( + cls, user_request: sky.UserRequest) -> sky.MutatedUserRequest: + config = user_request.skypilot_config + config.set_nested(('aws', 'use_internal_ip'), True) + if config.get_nested(('aws', 'vpc_name'), None) is None: + # If no VPC name is specified, it is likely a mistake. We should + # reject the request + raise RuntimeError('VPC name should be set. Check organization ' + 'wiki for more information.') + return sky.MutatedUserRequest(user_request.task, config) + + +class UseSpotForGpuPolicy(sky.AdminPolicy): + """Example policy: use spot instances for all GPU tasks.""" + + @classmethod + def validate_and_mutate( + cls, user_request: sky.UserRequest) -> sky.MutatedUserRequest: + """Sets use_spot to True for all GPU tasks.""" + task = user_request.task + new_resources = [] + for r in task.resources: + if r.accelerators: + new_resources.append(r.copy(use_spot=True)) + else: + new_resources.append(r) + + task.set_resources(type(task.resources)(new_resources)) + + return sky.MutatedUserRequest( + task=task, skypilot_config=user_request.skypilot_config) + + +class EnforceAutostopPolicy(sky.AdminPolicy): + """Example policy: enforce autostop for all tasks.""" + + @classmethod + def validate_and_mutate( + cls, user_request: sky.UserRequest) -> sky.MutatedUserRequest: + """Enforces autostop for all tasks. + + Note that with this policy enforced, users can still change the autostop + setting for an existing cluster by using `sky autostop`. + + Since we refresh the cluster status with `sky.status` whenever this + policy is applied, we should expect a few seconds latency when a user + run a request. + """ + request_options = user_request.request_options + + # Request options is None when a task is executed with `jobs launch` or + # `sky serve up`. + if request_options is None: + return sky.MutatedUserRequest( + task=user_request.task, + skypilot_config=user_request.skypilot_config) + + # Get the cluster record to operate on. + cluster_name = request_options.cluster_name + cluster_records = [] + if cluster_name is not None: + cluster_records = sky.status(cluster_name, refresh=True) + + # Check if the user request should specify autostop settings. + need_autostop = False + if not cluster_records: + # Cluster does not exist + need_autostop = True + elif cluster_records[0]['status'] == sky.ClusterStatus.STOPPED: + # Cluster is stopped + need_autostop = True + elif cluster_records[0]['autostop'] < 0: + # Cluster is running but autostop is not set + need_autostop = True + + # Check if the user request is setting autostop settings. + is_setting_autostop = False + idle_minutes_to_autostop = request_options.idle_minutes_to_autostop + is_setting_autostop = (idle_minutes_to_autostop is not None and + idle_minutes_to_autostop >= 0) + + # If the cluster requires autostop but the user request is not setting + # autostop settings, raise an error. + if need_autostop and not is_setting_autostop: + raise RuntimeError('Autostop/down must be set for all clusters.') + + return sky.MutatedUserRequest( + task=user_request.task, + skypilot_config=user_request.skypilot_config) diff --git a/examples/admin_policy/example_policy/pyproject.toml b/examples/admin_policy/example_policy/pyproject.toml new file mode 100644 index 00000000000..b4aa56be4b2 --- /dev/null +++ b/examples/admin_policy/example_policy/pyproject.toml @@ -0,0 +1,7 @@ +[build-system] +requires = ["setuptools>=61.0", "wheel"] +build-backend = "setuptools.build_meta" + +[project] +name = "example_policy" +version = "0.0.1" diff --git a/examples/admin_policy/reject_all.yaml b/examples/admin_policy/reject_all.yaml new file mode 100644 index 00000000000..fe6632089d9 --- /dev/null +++ b/examples/admin_policy/reject_all.yaml @@ -0,0 +1 @@ +admin_policy: example_policy.RejectAllPolicy diff --git a/examples/admin_policy/task.yaml b/examples/admin_policy/task.yaml new file mode 100644 index 00000000000..065b4cbfb11 --- /dev/null +++ b/examples/admin_policy/task.yaml @@ -0,0 +1,12 @@ +resources: + cloud: aws + cpus: 2 + labels: + other_labels: test + + +setup: | + echo "setup" + +run: | + echo "run" diff --git a/examples/admin_policy/use_spot_for_gpu.yaml b/examples/admin_policy/use_spot_for_gpu.yaml new file mode 100644 index 00000000000..45f257017a4 --- /dev/null +++ b/examples/admin_policy/use_spot_for_gpu.yaml @@ -0,0 +1 @@ +admin_policy: example_policy.UseSpotForGpuPolicy diff --git a/sky/__init__.py b/sky/__init__.py index a077fb8966a..37b5a1caf08 100644 --- a/sky/__init__.py +++ b/sky/__init__.py @@ -82,6 +82,9 @@ def set_proxy_env_var(proxy_var: str, urllib_var: Optional[str]): from sky import backends from sky import benchmark from sky import clouds +from sky.admin_policy import AdminPolicy +from sky.admin_policy import MutatedUserRequest +from sky.admin_policy import UserRequest from sky.clouds.service_catalog import list_accelerators from sky.core import autostop from sky.core import cancel @@ -112,6 +115,7 @@ def set_proxy_env_var(proxy_var: str, urllib_var: Optional[str]): from sky.optimizer import OptimizeTarget from sky.resources import Resources from sky.skylet.job_lib import JobStatus +from sky.skypilot_config import Config from sky.status_lib import ClusterStatus from sky.task import Task @@ -185,4 +189,9 @@ def set_proxy_env_var(proxy_var: str, urllib_var: Optional[str]): # core APIs Storage Management 'storage_ls', 'storage_delete', + # Admin Policy + 'UserRequest', + 'MutatedUserRequest', + 'AdminPolicy', + 'Config', ] diff --git a/sky/admin_policy.py b/sky/admin_policy.py new file mode 100644 index 00000000000..304285d04b7 --- /dev/null +++ b/sky/admin_policy.py @@ -0,0 +1,101 @@ +"""Interface for admin-defined policy for user requests.""" +import abc +import dataclasses +import typing +from typing import Optional + +if typing.TYPE_CHECKING: + import sky + + +@dataclasses.dataclass +class RequestOptions: + """Request options for admin policy. + + Args: + cluster_name: Name of the cluster to create/reuse. It is None if not + specified by the user. + idle_minutes_to_autostop: Autostop setting requested by a user. The + cluster will be set to autostop after this many minutes of idleness. + down: If true, use autodown rather than autostop. + dryrun: Is the request a dryrun? + """ + cluster_name: Optional[str] + idle_minutes_to_autostop: Optional[int] + down: bool + dryrun: bool + + +@dataclasses.dataclass +class UserRequest: + """A user request. + + A "user request" is defined as a `sky launch / exec` command or its API + equivalent. + + `sky jobs launch / serve up` involves multiple launch requests, including + the launch of controller and clusters for a job (which can have multiple + tasks if it is a pipeline) or service replicas. Each launch is a separate + request. + + This class wraps the underlying task, the global skypilot config used to run + a task, and the request options. + + Args: + task: User specified task. + skypilot_config: Global skypilot config to be used in this request. + request_options: Request options. It is None for jobs and services. + """ + task: 'sky.Task' + skypilot_config: 'sky.Config' + request_options: Optional['RequestOptions'] = None + + +@dataclasses.dataclass +class MutatedUserRequest: + task: 'sky.Task' + skypilot_config: 'sky.Config' + + +# pylint: disable=line-too-long +class AdminPolicy: + """Abstract interface of an admin-defined policy for all user requests. + + Admins can implement a subclass of AdminPolicy with the following signature: + + import sky + + class SkyPilotPolicyV1(sky.AdminPolicy): + def validate_and_mutate(user_request: UserRequest) -> MutatedUserRequest: + ... + return MutatedUserRequest(task=..., skypilot_config=...) + + The policy can mutate both task and skypilot_config. Admins then distribute + a simple module that contains this implementation, installable in a way + that it can be imported by users from the same Python environment where + SkyPilot is running. + + Users can register a subclass of AdminPolicy in the SkyPilot config file + under the key 'admin_policy', e.g. + + admin_policy: my_package.SkyPilotPolicyV1 + """ + + @classmethod + @abc.abstractmethod + def validate_and_mutate(cls, + user_request: UserRequest) -> MutatedUserRequest: + """Validates and mutates the user request and returns mutated request. + + Args: + user_request: The user request to validate and mutate. + UserRequest contains (sky.Task, sky.Config) + + Returns: + MutatedUserRequest: The mutated user request. + + Raises: + Exception to throw if the user request failed the validation. + """ + raise NotImplementedError( + 'Your policy must implement validate_and_mutate') diff --git a/sky/dag.py b/sky/dag.py index d1904eb9fcc..4af5adc76b5 100644 --- a/sky/dag.py +++ b/sky/dag.py @@ -1,8 +1,12 @@ """DAGs: user applications to be run.""" import pprint import threading +import typing from typing import List, Optional +if typing.TYPE_CHECKING: + from sky import task + class Dag: """Dag: a user application, represented as a DAG of Tasks. @@ -13,37 +17,37 @@ class Dag: >>> task = sky.Task(...) """ - def __init__(self): - self.tasks = [] + def __init__(self) -> None: + self.tasks: List['task.Task'] = [] import networkx as nx # pylint: disable=import-outside-toplevel self.graph = nx.DiGraph() - self.name = None + self.name: Optional[str] = None - def add(self, task): + def add(self, task: 'task.Task') -> None: self.graph.add_node(task) self.tasks.append(task) - def remove(self, task): + def remove(self, task: 'task.Task') -> None: self.tasks.remove(task) self.graph.remove_node(task) - def add_edge(self, op1, op2): + def add_edge(self, op1: 'task.Task', op2: 'task.Task') -> None: assert op1 in self.graph.nodes assert op2 in self.graph.nodes self.graph.add_edge(op1, op2) - def __len__(self): + def __len__(self) -> int: return len(self.tasks) - def __enter__(self): + def __enter__(self) -> 'Dag': push_dag(self) return self - def __exit__(self, exc_type, exc_value, traceback): + def __exit__(self, exc_type, exc_value, traceback) -> None: pop_dag() - def __repr__(self): + def __repr__(self) -> str: pformat = pprint.pformat(self.tasks) return f'DAG:\n{pformat}' @@ -70,15 +74,15 @@ def is_chain(self) -> bool: class _DagContext(threading.local): """A thread-local stack of Dags.""" - _current_dag = None + _current_dag: Optional[Dag] = None _previous_dags: List[Dag] = [] - def push_dag(self, dag): + def push_dag(self, dag: Dag): if self._current_dag is not None: self._previous_dags.append(self._current_dag) self._current_dag = dag - def pop_dag(self): + def pop_dag(self) -> Optional[Dag]: old_dag = self._current_dag if self._previous_dags: self._current_dag = self._previous_dags.pop() diff --git a/sky/exceptions.py b/sky/exceptions.py index 15f3ea3f34e..04c50ad4e08 100644 --- a/sky/exceptions.py +++ b/sky/exceptions.py @@ -286,3 +286,8 @@ class ServeUserTerminatedError(Exception): class PortDoesNotExistError(Exception): """Raised when the port does not exist.""" + + +class UserRequestRejectedByPolicy(Exception): + """Raised when a user request is rejected by an admin policy.""" + pass diff --git a/sky/execution.py b/sky/execution.py index 1f6bd09f9c3..792ca5fffc0 100644 --- a/sky/execution.py +++ b/sky/execution.py @@ -9,6 +9,7 @@ import colorama import sky +from sky import admin_policy from sky import backends from sky import clouds from sky import global_user_state @@ -16,6 +17,7 @@ from sky import sky_logging from sky.backends import backend_utils from sky.usage import usage_lib +from sky.utils import admin_policy_utils from sky.utils import controller_utils from sky.utils import dag_utils from sky.utils import env_options @@ -158,7 +160,16 @@ def _execute( handle: Optional[backends.ResourceHandle]; the handle to the cluster. None if dryrun. """ + dag = dag_utils.convert_entrypoint_to_dag(entrypoint) + dag, _ = admin_policy_utils.apply( + dag, + request_options=admin_policy.RequestOptions( + cluster_name=cluster_name, + idle_minutes_to_autostop=idle_minutes_to_autostop, + down=down, + dryrun=dryrun, + )) assert len(dag) == 1, f'We support 1 task for now. {dag}' task = dag.tasks[0] @@ -170,9 +181,8 @@ def _execute( cluster_exists = False if cluster_name is not None: - existing_handle = global_user_state.get_handle_from_cluster_name( - cluster_name) - cluster_exists = existing_handle is not None + cluster_record = global_user_state.get_cluster_from_name(cluster_name) + cluster_exists = cluster_record is not None # TODO(woosuk): If the cluster exists, print a warning that # `cpus` and `memory` are not used as a job scheduling constraint, # unlike `gpus`. diff --git a/sky/jobs/controller.py b/sky/jobs/controller.py index 39c89d2784b..f3cd81576e2 100644 --- a/sky/jobs/controller.py +++ b/sky/jobs/controller.py @@ -64,6 +64,7 @@ def __init__(self, job_id: int, dag_yaml: str, if len(self._dag.tasks) <= 1: task_name = self._dag_name else: + assert task.name is not None, task task_name = task.name # This is guaranteed by the spot_launch API, where we fill in # the task.name with @@ -447,6 +448,7 @@ def _cleanup(job_id: int, dag_yaml: str): # controller, we should keep it in sync with JobsController.__init__() dag, _ = _get_dag_and_name(dag_yaml) for task in dag.tasks: + assert task.name is not None, task cluster_name = managed_job_utils.generate_managed_job_cluster_name( task.name, job_id) recovery_strategy.terminate_cluster(cluster_name) diff --git a/sky/jobs/core.py b/sky/jobs/core.py index 561d47f4b25..c4f59f65eca 100644 --- a/sky/jobs/core.py +++ b/sky/jobs/core.py @@ -18,6 +18,7 @@ from sky.jobs import utils as managed_job_utils from sky.skylet import constants as skylet_constants from sky.usage import usage_lib +from sky.utils import admin_policy_utils from sky.utils import common_utils from sky.utils import controller_utils from sky.utils import dag_utils @@ -54,6 +55,8 @@ def launch( dag_uuid = str(uuid.uuid4().hex[:4]) dag = dag_utils.convert_entrypoint_to_dag(entrypoint) + dag, mutated_user_config = admin_policy_utils.apply( + dag, use_mutated_config_in_current_request=False) if not dag.is_chain(): with ux_utils.print_exception_no_traceback(): raise ValueError('Only single-task or chain DAG is ' @@ -103,6 +106,7 @@ def launch( **controller_utils.shared_controller_vars_to_fill( controller_utils.Controllers.JOBS_CONTROLLER, remote_user_config_path=remote_user_config_path, + local_user_config=mutated_user_config, ), } diff --git a/sky/serve/core.py b/sky/serve/core.py index 4f15413cf7f..2bb6e1384ee 100644 --- a/sky/serve/core.py +++ b/sky/serve/core.py @@ -17,6 +17,7 @@ from sky.serve import serve_utils from sky.skylet import constants from sky.usage import usage_lib +from sky.utils import admin_policy_utils from sky.utils import common_utils from sky.utils import controller_utils from sky.utils import resources_utils @@ -124,6 +125,10 @@ def up( _validate_service_task(task) + dag, mutated_user_config = admin_policy_utils.apply( + task, use_mutated_config_in_current_request=False) + task = dag.tasks[0] + controller_utils.maybe_translate_local_file_mounts_and_sync_up(task, path='serve') @@ -158,6 +163,7 @@ def up( **controller_utils.shared_controller_vars_to_fill( controller=controller_utils.Controllers.SKY_SERVE_CONTROLLER, remote_user_config_path=remote_config_yaml_path, + local_user_config=mutated_user_config, ), } common_utils.fill_template(serve_constants.CONTROLLER_TEMPLATE, diff --git a/sky/skypilot_config.py b/sky/skypilot_config.py index 52e1d0ae3d9..aae62afc616 100644 --- a/sky/skypilot_config.py +++ b/sky/skypilot_config.py @@ -61,6 +61,8 @@ from sky.utils import schemas from sky.utils import ux_utils +logger = sky_logging.init_logger(__name__) + # The config path is discovered in this order: # # (1) (Used internally) If env var {ENV_VAR_SKYPILOT_CONFIG} exists, use its @@ -78,11 +80,57 @@ # Path to the local config file. CONFIG_PATH = '~/.sky/config.yaml' -logger = sky_logging.init_logger(__name__) + +class Config(Dict[str, Any]): + """SkyPilot config that supports setting/getting values with nested keys.""" + + def get_nested(self, + keys: Tuple[str, ...], + default_value: Any, + override_configs: Optional[Dict[str, Any]] = None) -> Any: + """Gets a nested key. + + If any key is not found, or any intermediate key does not point to a + dict value, returns 'default_value'. + + Args: + keys: A tuple of strings representing the nested keys. + default_value: The default value to return if the key is not found. + override_configs: A dict of override configs with the same schema as + the config file, but only containing the keys to override. + + Returns: + The value of the nested key, or 'default_value' if not found. + """ + config = copy.deepcopy(self) + if override_configs is not None: + config = _recursive_update(config, override_configs) + return _get_nested(config, keys, default_value) + + def set_nested(self, keys: Tuple[str, ...], value: Any) -> None: + """In-place sets a nested key to value. + + Like get_nested(), if any key is not found, this will not raise an + error. + """ + override = {} + for i, key in enumerate(reversed(keys)): + if i == 0: + override = {key: value} + else: + override = {key: override} + _recursive_update(self, override) + + @classmethod + def from_dict(cls, config: Optional[Dict[str, Any]]) -> 'Config': + if config is None: + return cls() + return cls(**config) + # The loaded config. -_dict: Optional[Dict[str, Any]] = None -_loaded_config_path = None +_dict = Config() +_loaded_config_path: Optional[str] = None def _get_nested(configs: Optional[Dict[str, Any]], keys: Iterable[str], @@ -131,17 +179,11 @@ def get_nested(keys: Tuple[str, ...], ), (f'Override configs must not be provided when keys {keys} is not within ' 'constants.OVERRIDEABLE_CONFIG_KEYS: ' f'{constants.OVERRIDEABLE_CONFIG_KEYS}') - config: Dict[str, Any] = {} - if _dict is not None: - config = copy.deepcopy(_dict) - if override_configs is None: - override_configs = {} - config = _recursive_update(config, override_configs) - return _get_nested(config, keys, default_value) + return _dict.get_nested(keys, default_value, override_configs) -def _recursive_update(base_config: Dict[str, Any], - override_config: Dict[str, Any]) -> Dict[str, Any]: +def _recursive_update(base_config: Config, + override_config: Dict[str, Any]) -> Config: """Recursively updates base configuration with override configuration""" for key, value in override_config.items(): if (isinstance(value, dict) and key in base_config and @@ -157,22 +199,14 @@ def set_nested(keys: Tuple[str, ...], value: Any) -> Dict[str, Any]: Like get_nested(), if any key is not found, this will not raise an error. """ - _check_loaded_or_die() - assert _dict is not None - override = {} - for i, key in enumerate(reversed(keys)): - if i == 0: - override = {key: value} - else: - override = {key: override} - return _recursive_update(copy.deepcopy(_dict), override) + copied_dict = copy.deepcopy(_dict) + copied_dict.set_nested(keys, value) + return dict(**copied_dict) -def to_dict() -> Dict[str, Any]: +def to_dict() -> Config: """Returns a deep-copied version of the current config.""" - if _dict is not None: - return copy.deepcopy(_dict) - return {} + return copy.deepcopy(_dict) def _try_load_config() -> None: @@ -192,13 +226,14 @@ def _try_load_config() -> None: config_path = os.path.expanduser(config_path) if os.path.exists(config_path): logger.debug(f'Using config path: {config_path}') - _loaded_config_path = config_path try: - _dict = common_utils.read_yaml(config_path) + config = common_utils.read_yaml(config_path) + _dict = Config.from_dict(config) + _loaded_config_path = config_path logger.debug(f'Config loaded:\n{pprint.pformat(_dict)}') except yaml.YAMLError as e: logger.error(f'Error in loading config file ({config_path}):', e) - if _dict is not None: + if _dict: common_utils.validate_schema( _dict, schemas.get_config_schema(), @@ -219,14 +254,6 @@ def loaded_config_path() -> Optional[str]: _try_load_config() -def _check_loaded_or_die(): - """Checks loaded() is true; otherwise raises RuntimeError.""" - if _dict is None: - raise RuntimeError( - f'No user configs loaded. Check {CONFIG_PATH} exists and ' - 'can be loaded.') - - def loaded() -> bool: """Returns if the user configurations are loaded.""" - return _dict is not None + return bool(_dict) diff --git a/sky/templates/jobs-controller.yaml.j2 b/sky/templates/jobs-controller.yaml.j2 index 51083e84a59..45cdb5141d4 100644 --- a/sky/templates/jobs-controller.yaml.j2 +++ b/sky/templates/jobs-controller.yaml.j2 @@ -4,7 +4,9 @@ name: {{dag_name}} file_mounts: {{remote_user_yaml_path}}: {{user_yaml_path}} - {{remote_user_config_path}}: skypilot:local_skypilot_config_path + {%- if local_user_config_path is not none %} + {{remote_user_config_path}}: {{local_user_config_path}} + {%- endif %} {%- for remote_catalog_path, local_catalog_path in modified_catalogs.items() %} {{remote_catalog_path}}: {{local_catalog_path}} {%- endfor %} diff --git a/sky/templates/sky-serve-controller.yaml.j2 b/sky/templates/sky-serve-controller.yaml.j2 index a20c2d680aa..507a6e3a325 100644 --- a/sky/templates/sky-serve-controller.yaml.j2 +++ b/sky/templates/sky-serve-controller.yaml.j2 @@ -23,7 +23,9 @@ setup: | file_mounts: {{remote_task_yaml_path}}: {{local_task_yaml_path}} - {{remote_user_config_path}}: skypilot:local_skypilot_config_path + {%- if local_user_config_path is not none %} + {{remote_user_config_path}}: {{local_user_config_path}} + {%- endif %} {%- for remote_catalog_path, local_catalog_path in modified_catalogs.items() %} {{remote_catalog_path}}: {{local_catalog_path}} {%- endfor %} diff --git a/sky/utils/admin_policy_utils.py b/sky/utils/admin_policy_utils.py new file mode 100644 index 00000000000..09db2fc4be8 --- /dev/null +++ b/sky/utils/admin_policy_utils.py @@ -0,0 +1,145 @@ +"""Admin policy utils.""" +import copy +import importlib +import os +import tempfile +from typing import Optional, Tuple, Union + +import colorama + +from sky import admin_policy +from sky import dag as dag_lib +from sky import exceptions +from sky import sky_logging +from sky import skypilot_config +from sky import task as task_lib +from sky.utils import common_utils +from sky.utils import ux_utils + +logger = sky_logging.init_logger(__name__) + + +def _get_policy_cls( + policy: Optional[str]) -> Optional[admin_policy.AdminPolicy]: + """Gets admin-defined policy.""" + if policy is None: + return None + try: + module_path, class_name = policy.rsplit('.', 1) + module = importlib.import_module(module_path) + except ImportError as e: + with ux_utils.print_exception_no_traceback(): + raise ImportError( + f'Failed to import policy module: {policy}. ' + 'Please check if the module is installed in your Python ' + 'environment.') from e + + try: + policy_cls = getattr(module, class_name) + except AttributeError as e: + with ux_utils.print_exception_no_traceback(): + raise AttributeError( + f'Could not find {class_name} class in module {module_path}. ' + 'Please check with your policy admin for details.') from e + + # Check if the module implements the AdminPolicy interface. + if not issubclass(policy_cls, admin_policy.AdminPolicy): + with ux_utils.print_exception_no_traceback(): + raise ValueError( + f'Policy class {policy!r} does not implement the AdminPolicy ' + 'interface. Please check with your policy admin for details.') + return policy_cls + + +def apply( + entrypoint: Union['dag_lib.Dag', 'task_lib.Task'], + use_mutated_config_in_current_request: bool = True, + request_options: Optional[admin_policy.RequestOptions] = None, +) -> Tuple['dag_lib.Dag', skypilot_config.Config]: + """Applies an admin policy (if registered) to a DAG or a task. + + It mutates a Dag by applying any registered admin policy and also + potentially updates (controlled by `use_mutated_config_in_current_request`) + the global SkyPilot config if there is any changes made by the policy. + + Args: + dag: The dag to be mutated by the policy. + use_mutated_config_in_current_request: Whether to use the mutated + config in the current request. + request_options: Additional options user passed for the current request. + + Returns: + - The new copy of dag after applying the policy + - The new copy of skypilot config after applying the policy. + """ + if isinstance(entrypoint, task_lib.Task): + dag = dag_lib.Dag() + dag.add(entrypoint) + else: + dag = entrypoint + + policy = skypilot_config.get_nested(('admin_policy',), None) + policy_cls = _get_policy_cls(policy) + if policy_cls is None: + return dag, skypilot_config.to_dict() + + logger.info(f'Applying policy: {policy}') + original_config = skypilot_config.to_dict() + config = copy.deepcopy(original_config) + mutated_dag = dag_lib.Dag() + mutated_dag.name = dag.name + + mutated_config = None + for task in dag.tasks: + user_request = admin_policy.UserRequest(task, config, request_options) + try: + mutated_user_request = policy_cls.validate_and_mutate(user_request) + except Exception as e: # pylint: disable=broad-except + with ux_utils.print_exception_no_traceback(): + raise exceptions.UserRequestRejectedByPolicy( + f'{colorama.Fore.RED}User request rejected by policy ' + f'{policy!r}{colorama.Fore.RESET}: ' + f'{common_utils.format_exception(e, use_bracket=True)}' + ) from e + if mutated_config is None: + mutated_config = mutated_user_request.skypilot_config + else: + if mutated_config != mutated_user_request.skypilot_config: + # In the case of a pipeline of tasks, the mutated config + # generated should remain the same for all tasks for now for + # simplicity. + # TODO(zhwu): We should support per-task mutated config or + # allowing overriding required global config in task YAML. + with ux_utils.print_exception_no_traceback(): + raise exceptions.UserRequestRejectedByPolicy( + 'All tasks must have the same SkyPilot config after ' + 'applying the policy. Please check with your policy ' + 'admin for details.') + mutated_dag.add(mutated_user_request.task) + assert mutated_config is not None, dag + + # Update the new_dag's graph with the old dag's graph + for u, v in dag.graph.edges: + u_idx = dag.tasks.index(u) + v_idx = dag.tasks.index(v) + mutated_dag.graph.add_edge(mutated_dag.tasks[u_idx], + mutated_dag.tasks[v_idx]) + + if (use_mutated_config_in_current_request and + original_config != mutated_config): + with tempfile.NamedTemporaryFile( + delete=False, + mode='w', + prefix='policy-mutated-skypilot-config-', + suffix='.yaml') as temp_file: + + common_utils.dump_yaml(temp_file.name, dict(**mutated_config)) + os.environ[skypilot_config.ENV_VAR_SKYPILOT_CONFIG] = temp_file.name + logger.debug(f'Updated SkyPilot config: {temp_file.name}') + # TODO(zhwu): This is not a clean way to update the SkyPilot config, + # because we are resetting the global context for a single DAG, + # which is conceptually weird. + importlib.reload(skypilot_config) + + logger.debug(f'Mutated user request: {mutated_user_request}') + return mutated_dag, mutated_config diff --git a/sky/utils/common_utils.py b/sky/utils/common_utils.py index a9227fb4c20..dffe784cc33 100644 --- a/sky/utils/common_utils.py +++ b/sky/utils/common_utils.py @@ -300,7 +300,7 @@ def user_and_hostname_hash() -> str: return f'{getpass.getuser()}-{hostname_hash}' -def read_yaml(path) -> Dict[str, Any]: +def read_yaml(path: str) -> Dict[str, Any]: with open(path, 'r', encoding='utf-8') as f: config = yaml.safe_load(f) return config @@ -316,12 +316,13 @@ def read_yaml_all(path: str) -> List[Dict[str, Any]]: return configs -def dump_yaml(path, config) -> None: +def dump_yaml(path: str, config: Union[List[Dict[str, Any]], + Dict[str, Any]]) -> None: with open(path, 'w', encoding='utf-8') as f: f.write(dump_yaml_str(config)) -def dump_yaml_str(config): +def dump_yaml_str(config: Union[List[Dict[str, Any]], Dict[str, Any]]) -> str: # https://github.com/yaml/pyyaml/issues/127 class LineBreakDumper(yaml.SafeDumper): @@ -331,9 +332,9 @@ def write_line_break(self, data=None): super().write_line_break() if isinstance(config, list): - dump_func = yaml.dump_all + dump_func = yaml.dump_all # type: ignore else: - dump_func = yaml.dump + dump_func = yaml.dump # type: ignore return dump_func(config, Dumper=LineBreakDumper, sort_keys=False, diff --git a/sky/utils/controller_utils.py b/sky/utils/controller_utils.py index 866aaf1ee1a..118f9a2b718 100644 --- a/sky/utils/controller_utils.py +++ b/sky/utils/controller_utils.py @@ -44,8 +44,12 @@ '{controller_type}.controller.resources is a valid resources spec. ' 'Details:\n {err}') -# The placeholder for the local skypilot config path in file mounts. -LOCAL_SKYPILOT_CONFIG_PATH_PLACEHOLDER = 'skypilot:local_skypilot_config_path' +# The suffix for local skypilot config path for a job/service in file mounts +# that tells the controller logic to update the config with specific settings, +# e.g., removing the ssh_proxy_command when a job/service is launched in a same +# cloud as controller. +_LOCAL_SKYPILOT_CONFIG_PATH_SUFFIX = ( + '__skypilot:local_skypilot_config_path.yaml') @dataclasses.dataclass @@ -350,8 +354,21 @@ def download_and_stream_latest_job_log( def shared_controller_vars_to_fill( - controller: Controllers, - remote_user_config_path: str) -> Dict[str, str]: + controller: Controllers, remote_user_config_path: str, + local_user_config: Dict[str, Any]) -> Dict[str, str]: + if not local_user_config: + local_user_config_path = None + else: + # Remove admin_policy from local_user_config so that it is not applied + # again on the controller. This is required since admin_policy is not + # installed on the controller. + local_user_config.pop('admin_policy', None) + with tempfile.NamedTemporaryFile( + delete=False, + suffix=_LOCAL_SKYPILOT_CONFIG_PATH_SUFFIX) as temp_file: + common_utils.dump_yaml(temp_file.name, dict(**local_user_config)) + local_user_config_path = temp_file.name + vars_to_fill: Dict[str, Any] = { 'cloud_dependencies_installation_commands': _get_cloud_dependencies_installation_commands(controller), @@ -360,6 +377,7 @@ def shared_controller_vars_to_fill( # accessed. 'sky_activate_python_env': constants.ACTIVATE_SKY_REMOTE_PYTHON_ENV, 'sky_python_cmd': constants.SKY_PYTHON_CMD, + 'local_user_config_path': local_user_config_path, } env_vars: Dict[str, str] = { env.value: '1' for env in env_options.Options if env.get() @@ -481,7 +499,8 @@ def get_controller_resources( def _setup_proxy_command_on_controller( - controller_launched_cloud: 'clouds.Cloud') -> Dict[str, Any]: + controller_launched_cloud: 'clouds.Cloud', + user_config: Dict[str, Any]) -> skypilot_config.Config: """Sets up proxy command on the controller. This function should be called on the controller (remote cluster), which @@ -515,21 +534,20 @@ def _setup_proxy_command_on_controller( # (or name). It may not be a sufficient check (as it's always # possible that peering is not set up), but it may catch some # obvious errors. + config = skypilot_config.Config.from_dict(user_config) proxy_command_key = (str(controller_launched_cloud).lower(), 'ssh_proxy_command') - ssh_proxy_command = skypilot_config.get_nested(proxy_command_key, None) - config_dict = skypilot_config.to_dict() + ssh_proxy_command = config.get_nested(proxy_command_key, None) if isinstance(ssh_proxy_command, str): - config_dict = skypilot_config.set_nested(proxy_command_key, None) + config.set_nested(proxy_command_key, None) elif isinstance(ssh_proxy_command, dict): # Instead of removing the key, we set the value to empty string # so that the controller will only try the regions specified by # the keys. ssh_proxy_command = {k: None for k in ssh_proxy_command} - config_dict = skypilot_config.set_nested(proxy_command_key, - ssh_proxy_command) + config.set_nested(proxy_command_key, ssh_proxy_command) - return config_dict + return config def replace_skypilot_config_path_in_file_mounts( @@ -543,25 +561,20 @@ def replace_skypilot_config_path_in_file_mounts( if file_mounts is None: return replaced = False - to_replace = True - with tempfile.NamedTemporaryFile('w', delete=False) as f: - if skypilot_config.loaded(): - new_skypilot_config = _setup_proxy_command_on_controller(cloud) - common_utils.dump_yaml(f.name, new_skypilot_config) - to_replace = True - else: - # Empty config. Remove the placeholder below. - to_replace = False - for remote_path, local_path in list(file_mounts.items()): - if local_path == LOCAL_SKYPILOT_CONFIG_PATH_PLACEHOLDER: - if to_replace: - file_mounts[remote_path] = f.name - replaced = True - else: - del file_mounts[remote_path] + for remote_path, local_path in list(file_mounts.items()): + if local_path is None: + del file_mounts[remote_path] + continue + if local_path.endswith(_LOCAL_SKYPILOT_CONFIG_PATH_SUFFIX): + with tempfile.NamedTemporaryFile('w', delete=False) as f: + user_config = common_utils.read_yaml(local_path) + config = _setup_proxy_command_on_controller(cloud, user_config) + common_utils.dump_yaml(f.name, dict(**config)) + file_mounts[remote_path] = f.name + replaced = True if replaced: - logger.debug(f'Replaced {LOCAL_SKYPILOT_CONFIG_PATH_PLACEHOLDER} with ' - f'the real path in file mounts: {file_mounts}') + logger.debug(f'Replaced {_LOCAL_SKYPILOT_CONFIG_PATH_SUFFIX} ' + f'with the real path in file mounts: {file_mounts}') def maybe_translate_local_file_mounts_and_sync_up(task: 'task_lib.Task', diff --git a/sky/utils/dag_utils.py b/sky/utils/dag_utils.py index 7a4fe90e7fb..e6b491c3168 100644 --- a/sky/utils/dag_utils.py +++ b/sky/utils/dag_utils.py @@ -36,30 +36,33 @@ def convert_entrypoint_to_dag(entrypoint: Any) -> 'dag_lib.Dag': - """Convert the entrypoint to a sky.Dag. + """Converts the entrypoint to a sky.Dag and applies the policy. Raises TypeError if 'entrypoint' is not a 'sky.Task' or 'sky.Dag'. """ # Not suppressing stacktrace: when calling this via API user may want to # see their own program in the stacktrace. Our CLI impl would not trigger # these errors. + converted_dag: 'dag_lib.Dag' if isinstance(entrypoint, str): with ux_utils.print_exception_no_traceback(): raise TypeError(_ENTRYPOINT_STRING_AS_DAG_MESSAGE) elif isinstance(entrypoint, dag_lib.Dag): - return copy.deepcopy(entrypoint) + converted_dag = copy.deepcopy(entrypoint) elif isinstance(entrypoint, task_lib.Task): entrypoint = copy.deepcopy(entrypoint) with dag_lib.Dag() as dag: dag.add(entrypoint) dag.name = entrypoint.name - return dag + converted_dag = dag else: with ux_utils.print_exception_no_traceback(): raise TypeError( 'Expected a sky.Task or sky.Dag but received argument of type: ' f'{type(entrypoint)}') + return converted_dag + def load_chain_dag_from_yaml( path: str, diff --git a/sky/utils/schemas.py b/sky/utils/schemas.py index 01dc14f617c..a50c400b805 100644 --- a/sky/utils/schemas.py +++ b/sky/utils/schemas.py @@ -848,6 +848,13 @@ def get_config_schema(): }, } + admin_policy_schema = { + 'type': 'string', + # Check regex to be a valid python module path + 'pattern': (r'^[a-zA-Z_][a-zA-Z0-9_]*' + r'(\.[a-zA-Z_][a-zA-Z0-9_]*)+$'), + } + allowed_clouds = { # A list of cloud names that are allowed to be used 'type': 'array', @@ -905,6 +912,7 @@ def get_config_schema(): 'spot': controller_resources_schema, 'serve': controller_resources_schema, 'allowed_clouds': allowed_clouds, + 'admin_policy': admin_policy_schema, 'docker': docker_configs, 'nvidia_gpus': gpu_configs, **cloud_configs, diff --git a/tests/test_config.py b/tests/test_config.py index 0cae5f9befb..5789214dc61 100644 --- a/tests/test_config.py +++ b/tests/test_config.py @@ -1,4 +1,5 @@ import copy +import importlib import pathlib import textwrap @@ -21,19 +22,19 @@ def _reload_config() -> None: - skypilot_config._dict = None + skypilot_config._dict = skypilot_config.Config() + skypilot_config._loaded_config_path = None skypilot_config._try_load_config() def _check_empty_config() -> None: """Check that the config is empty.""" - assert not skypilot_config.loaded() + assert not skypilot_config.loaded(), (skypilot_config._dict, + skypilot_config._loaded_config_path) assert skypilot_config.get_nested( ('aws', 'ssh_proxy_command'), None) is None assert skypilot_config.get_nested(('aws', 'ssh_proxy_command'), 'default') == 'default' - with pytest.raises(RuntimeError): - skypilot_config.set_nested(('aws', 'ssh_proxy_command'), 'value') def _create_config_file(config_file_path: pathlib.Path) -> None: @@ -98,6 +99,22 @@ def _create_task_yaml_file(task_file_path: pathlib.Path) -> None: """)) +def test_nested_config(monkeypatch) -> None: + """Test that the nested config works.""" + config = skypilot_config.Config() + config.set_nested(('aws', 'ssh_proxy_command'), 'value') + assert config == {'aws': {'ssh_proxy_command': 'value'}} + + assert config.get_nested(('admin_policy',), 'default') == 'default' + config.set_nested(('aws', 'use_internal_ips'), True) + assert config == { + 'aws': { + 'ssh_proxy_command': 'value', + 'use_internal_ips': True + } + } + + def test_no_config(monkeypatch) -> None: """Test that the config is not loaded if the config file does not exist.""" monkeypatch.setattr(skypilot_config, 'CONFIG_PATH', '/tmp/does_not_exist') diff --git a/tests/unit_tests/test_admin_policy.py b/tests/unit_tests/test_admin_policy.py new file mode 100644 index 00000000000..96b666493d3 --- /dev/null +++ b/tests/unit_tests/test_admin_policy.py @@ -0,0 +1,172 @@ +import importlib +import os +import sys +from typing import Optional, Tuple +from unittest import mock + +import pytest + +import sky +from sky import exceptions +from sky import sky_logging +from sky import skypilot_config +from sky.utils import admin_policy_utils + +logger = sky_logging.init_logger(__name__) + +POLICY_PATH = os.path.join(os.path.dirname(os.path.dirname(sky.__file__)), + 'examples', 'admin_policy') + + +@pytest.fixture +def add_example_policy_paths(): + # Add to path to be able to import + sys.path.append(os.path.join(POLICY_PATH, 'example_policy')) + + +@pytest.fixture +def task(): + return sky.Task.from_yaml(os.path.join(POLICY_PATH, 'task.yaml')) + + +def _load_task_and_apply_policy( + task: sky.Task, + config_path: str, + idle_minutes_to_autostop: Optional[int] = None, +) -> Tuple[sky.Dag, skypilot_config.Config]: + os.environ['SKYPILOT_CONFIG'] = config_path + importlib.reload(skypilot_config) + return admin_policy_utils.apply( + task, + request_options=sky.admin_policy.RequestOptions( + cluster_name='test', + idle_minutes_to_autostop=idle_minutes_to_autostop, + down=False, + dryrun=False, + )) + + +def test_use_spot_for_all_gpus_policy(add_example_policy_paths, task): + dag, _ = _load_task_and_apply_policy( + task, os.path.join(POLICY_PATH, 'use_spot_for_gpu.yaml')) + assert not any(r.use_spot for r in dag.tasks[0].resources), ( + 'use_spot should be False as GPU is not specified') + + task.set_resources([ + sky.Resources(cloud='gcp', accelerators={'A100': 1}), + sky.Resources(accelerators={'L4': 1}) + ]) + dag, _ = _load_task_and_apply_policy( + task, os.path.join(POLICY_PATH, 'use_spot_for_gpu.yaml')) + assert all( + r.use_spot for r in dag.tasks[0].resources), 'use_spot should be True' + + task.set_resources([ + sky.Resources(accelerators={'A100': 1}), + sky.Resources(accelerators={'L4': 1}, use_spot=True), + sky.Resources(cpus='2+'), + ]) + dag, _ = _load_task_and_apply_policy( + task, os.path.join(POLICY_PATH, 'use_spot_for_gpu.yaml')) + for r in dag.tasks[0].resources: + if r.accelerators: + assert r.use_spot, 'use_spot should be True' + else: + assert not r.use_spot, 'use_spot should be False' + + +def test_add_labels_policy(add_example_policy_paths, task): + dag, _ = _load_task_and_apply_policy( + task, os.path.join(POLICY_PATH, 'add_labels.yaml')) + assert 'app' in skypilot_config.get_nested( + ('kubernetes', 'custom_metadata', 'labels'), + {}), ('label should be set') + + +def test_reject_all_policy(add_example_policy_paths, task): + with pytest.raises(exceptions.UserRequestRejectedByPolicy, + match='Reject all policy'): + _load_task_and_apply_policy( + task, os.path.join(POLICY_PATH, 'reject_all.yaml')) + + +def test_enforce_autostop_policy(add_example_policy_paths, task): + + def _gen_cluster_record(status: sky.ClusterStatus, autostop: int) -> dict: + return { + 'name': 'test', + 'status': status, + 'autostop': autostop, + } + + # Cluster does not exist + with mock.patch('sky.status', return_value=[]): + _load_task_and_apply_policy(task, + os.path.join(POLICY_PATH, + 'enforce_autostop.yaml'), + idle_minutes_to_autostop=10) + + with pytest.raises(exceptions.UserRequestRejectedByPolicy, + match='Autostop/down must be set'): + _load_task_and_apply_policy(task, + os.path.join(POLICY_PATH, + 'enforce_autostop.yaml'), + idle_minutes_to_autostop=None) + + # Cluster is stopped + with mock.patch( + 'sky.status', + return_value=[_gen_cluster_record(sky.ClusterStatus.STOPPED, 10)]): + _load_task_and_apply_policy(task, + os.path.join(POLICY_PATH, + 'enforce_autostop.yaml'), + idle_minutes_to_autostop=10) + with pytest.raises(exceptions.UserRequestRejectedByPolicy, + match='Autostop/down must be set'): + _load_task_and_apply_policy(task, + os.path.join(POLICY_PATH, + 'enforce_autostop.yaml'), + idle_minutes_to_autostop=None) + + # Cluster is running but autostop is not set + with mock.patch( + 'sky.status', + return_value=[_gen_cluster_record(sky.ClusterStatus.UP, -1)]): + _load_task_and_apply_policy(task, + os.path.join(POLICY_PATH, + 'enforce_autostop.yaml'), + idle_minutes_to_autostop=10) + with pytest.raises(exceptions.UserRequestRejectedByPolicy, + match='Autostop/down must be set'): + _load_task_and_apply_policy(task, + os.path.join(POLICY_PATH, + 'enforce_autostop.yaml'), + idle_minutes_to_autostop=None) + + # Cluster is init but autostop is not set + with mock.patch( + 'sky.status', + return_value=[_gen_cluster_record(sky.ClusterStatus.INIT, -1)]): + _load_task_and_apply_policy(task, + os.path.join(POLICY_PATH, + 'enforce_autostop.yaml'), + idle_minutes_to_autostop=10) + with pytest.raises(exceptions.UserRequestRejectedByPolicy, + match='Autostop/down must be set'): + _load_task_and_apply_policy(task, + os.path.join(POLICY_PATH, + 'enforce_autostop.yaml'), + idle_minutes_to_autostop=None) + + # Cluster is running and autostop is set + with mock.patch( + 'sky.status', + return_value=[_gen_cluster_record(sky.ClusterStatus.UP, 10)]): + _load_task_and_apply_policy(task, + os.path.join(POLICY_PATH, + 'enforce_autostop.yaml'), + idle_minutes_to_autostop=10) + _load_task_and_apply_policy(task, + os.path.join(POLICY_PATH, + 'enforce_autostop.yaml'), + idle_minutes_to_autostop=None) diff --git a/tests/unit_tests/test_backend_utils.py b/tests/unit_tests/test_backend_utils.py index cb1b83f1999..5da4410abb9 100644 --- a/tests/unit_tests/test_backend_utils.py +++ b/tests/unit_tests/test_backend_utils.py @@ -1,34 +1,31 @@ +import os import pathlib -from typing import Dict -from unittest.mock import Mock -from unittest.mock import patch - -import pytest +from unittest import mock from sky import clouds from sky import skypilot_config from sky.backends import backend_utils from sky.resources import Resources -from sky.resources import resources_utils -@patch.object(skypilot_config, 'CONFIG_PATH', - './tests/test_yamls/test_aws_config.yaml') -@patch.object(skypilot_config, '_dict', None) -@patch.object(skypilot_config, '_loaded_config_path', None) -@patch('sky.clouds.service_catalog.instance_type_exists', return_value=True) -@patch('sky.clouds.service_catalog.get_accelerators_from_instance_type', - return_value={'fake-acc': 2}) -@patch('sky.clouds.service_catalog.get_image_id_from_tag', - return_value='fake-image') -@patch.object(clouds.aws, 'DEFAULT_SECURITY_GROUP_NAME', 'fake-default-sg') -@patch('sky.check.get_cloud_credential_file_mounts', - return_value='~/.aws/credentials') -@patch('sky.backends.backend_utils._get_yaml_path_from_cluster_name', - return_value='/tmp/fake/path') -@patch('sky.utils.common_utils.fill_template') +# Set env var to test config file. +@mock.patch.object(skypilot_config, '_dict', None) +@mock.patch.object(skypilot_config, '_loaded_config_path', None) +@mock.patch('sky.clouds.service_catalog.instance_type_exists', + return_value=True) +@mock.patch('sky.clouds.service_catalog.get_accelerators_from_instance_type', + return_value={'fake-acc': 2}) +@mock.patch('sky.clouds.service_catalog.get_image_id_from_tag', + return_value='fake-image') +@mock.patch.object(clouds.aws, 'DEFAULT_SECURITY_GROUP_NAME', 'fake-default-sg') +@mock.patch('sky.check.get_cloud_credential_file_mounts', + return_value='~/.aws/credentials') +@mock.patch('sky.backends.backend_utils._get_yaml_path_from_cluster_name', + return_value='/tmp/fake/path') +@mock.patch('sky.utils.common_utils.fill_template') def test_write_cluster_config_w_remote_identity(mock_fill_template, *mocks) -> None: + os.environ['SKYPILOT_CONFIG'] = './tests/test_yamls/test_aws_config.yaml' skypilot_config._try_load_config() cloud = clouds.AWS() diff --git a/tests/unit_tests/test_common_utils.py b/tests/unit_tests/test_common_utils.py index f38e14069e5..38c31263baa 100644 --- a/tests/unit_tests/test_common_utils.py +++ b/tests/unit_tests/test_common_utils.py @@ -1,4 +1,4 @@ -from unittest.mock import patch +from unittest import mock import pytest @@ -33,18 +33,18 @@ def test_check_when_none(self): class TestMakeClusterNameOnCloud: - @patch('sky.utils.common_utils.get_user_hash') + @mock.patch('sky.utils.common_utils.get_user_hash') def test_make(self, mock_get_user_hash): mock_get_user_hash.return_value = MOCKED_USER_HASH assert "lora-ab12" == common_utils.make_cluster_name_on_cloud("lora") - @patch('sky.utils.common_utils.get_user_hash') + @mock.patch('sky.utils.common_utils.get_user_hash') def test_make_with_hyphen(self, mock_get_user_hash): mock_get_user_hash.return_value = MOCKED_USER_HASH assert "seed-1-ab12" == common_utils.make_cluster_name_on_cloud( "seed-1") - @patch('sky.utils.common_utils.get_user_hash') + @mock.patch('sky.utils.common_utils.get_user_hash') def test_make_with_characters_to_transform(self, mock_get_user_hash): mock_get_user_hash.return_value = MOCKED_USER_HASH assert "cuda-11-8-ab12" == common_utils.make_cluster_name_on_cloud( diff --git a/tests/unit_tests/test_resources.py b/tests/unit_tests/test_resources.py index 70da0532e9b..01b83132a1b 100644 --- a/tests/unit_tests/test_resources.py +++ b/tests/unit_tests/test_resources.py @@ -1,6 +1,7 @@ +import importlib +import os from typing import Dict -from unittest.mock import Mock -from unittest.mock import patch +from unittest import mock import pytest @@ -23,12 +24,12 @@ def test_get_reservations_available_resources(): - mock = Mock() - r = Resources(cloud=mock, instance_type="instance_type") + mock_cloud = mock.Mock() + r = Resources(cloud=mock_cloud, instance_type="instance_type") r._region = "region" r._zone = "zone" r.get_reservations_available_resources() - mock.get_reservations_available_resources.assert_called_once_with( + mock_cloud.get_reservations_available_resources.assert_called_once_with( "instance_type", "region", "zone", set()) @@ -91,18 +92,16 @@ def test_kubernetes_labels_resources(): _run_label_test(allowed_labels, invalid_labels, cloud) -@patch.object(skypilot_config, 'CONFIG_PATH', - './tests/test_yamls/test_aws_config.yaml') -@patch.object(skypilot_config, '_dict', None) -@patch.object(skypilot_config, '_loaded_config_path', None) -@patch('sky.clouds.service_catalog.instance_type_exists', return_value=True) -@patch('sky.clouds.service_catalog.get_accelerators_from_instance_type', - return_value={'fake-acc': 2}) -@patch('sky.clouds.service_catalog.get_image_id_from_tag', - return_value='fake-image') -@patch.object(clouds.aws, 'DEFAULT_SECURITY_GROUP_NAME', 'fake-default-sg') +@mock.patch('sky.clouds.service_catalog.instance_type_exists', + return_value=True) +@mock.patch('sky.clouds.service_catalog.get_accelerators_from_instance_type', + return_value={'fake-acc': 2}) +@mock.patch('sky.clouds.service_catalog.get_image_id_from_tag', + return_value='fake-image') +@mock.patch.object(clouds.aws, 'DEFAULT_SECURITY_GROUP_NAME', 'fake-default-sg') def test_aws_make_deploy_variables(*mocks) -> None: - skypilot_config._try_load_config() + os.environ['SKYPILOT_CONFIG'] = './tests/test_yamls/test_aws_config.yaml' + importlib.reload(skypilot_config) cloud = clouds.AWS() cluster_name = resources_utils.ClusterName(display_name='display', From e13c39104cc3fca974e2afa207bcca24817f4e17 Mon Sep 17 00:00:00 2001 From: Romil Bhardwaj Date: Tue, 24 Sep 2024 23:20:27 -0700 Subject: [PATCH 10/93] [k8s] Autodown Serve controller on Kubernetes (#3984) * Add autodown for skyserve on k8s * lint --- sky/backends/cloud_vm_ray_backend.py | 20 +++++++++++++++----- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/sky/backends/cloud_vm_ray_backend.py b/sky/backends/cloud_vm_ray_backend.py index 191a09438aa..0831bad65fb 100644 --- a/sky/backends/cloud_vm_ray_backend.py +++ b/sky/backends/cloud_vm_ray_backend.py @@ -4147,11 +4147,21 @@ def set_autostop(self, idle_minutes_to_autostop >= 0): # We should hit this code path only for the controllers on # Kubernetes and RunPod clusters. - assert (controller_utils.Controllers.from_name( - handle.cluster_name) is not None), handle.cluster_name - logger.info('Auto-stop is not supported for Kubernetes ' - 'and RunPod clusters. Skipping.') - return + controller = controller_utils.Controllers.from_name( + handle.cluster_name) + assert (controller is not None), handle.cluster_name + if (controller + == controller_utils.Controllers.SKY_SERVE_CONTROLLER and + isinstance(handle.launched_resources.cloud, + clouds.Kubernetes)): + # For SkyServe controllers on Kubernetes: override autostop + # behavior to force autodown (instead of no-op) + # to avoid dangling controllers. + down = True + else: + logger.info('Auto-stop is not supported for Kubernetes ' + 'and RunPod clusters. Skipping.') + return # Check if we're stopping spot assert (handle.launched_resources is not None and From be92944e77a6be8d91602afa10b4705bfaded2ca Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 25 Sep 2024 09:25:41 -0700 Subject: [PATCH 11/93] [Tests] Add missing changes from #3966 for fast service update test (#3976) Use wget instead of git clone for faster downloading --- tests/skyserve/update/bump_version_after.yaml | 3 +-- tests/skyserve/update/bump_version_before.yaml | 3 +-- 2 files changed, 2 insertions(+), 4 deletions(-) diff --git a/tests/skyserve/update/bump_version_after.yaml b/tests/skyserve/update/bump_version_after.yaml index 8709c8a9a90..6e845f54b9e 100644 --- a/tests/skyserve/update/bump_version_after.yaml +++ b/tests/skyserve/update/bump_version_after.yaml @@ -20,9 +20,8 @@ resources: cpus: 2+ setup: | - git clone https://github.com/skypilot-org/skypilot.git + wget https://raw.githubusercontent.com/skypilot-org/skypilot/refs/heads/master/examples/serve/http_server/server.py run: | - cd skypilot/examples/serve/http_server python3 server.py --port 8081 \ No newline at end of file diff --git a/tests/skyserve/update/bump_version_before.yaml b/tests/skyserve/update/bump_version_before.yaml index c38c4288538..c9fd957e41a 100644 --- a/tests/skyserve/update/bump_version_before.yaml +++ b/tests/skyserve/update/bump_version_before.yaml @@ -20,9 +20,8 @@ resources: cpus: 2+ setup: | - git clone https://github.com/skypilot-org/skypilot.git + wget https://raw.githubusercontent.com/skypilot-org/skypilot/refs/heads/master/examples/serve/http_server/server.py run: | - cd skypilot/examples/serve/http_server python3 server.py --port 8081 \ No newline at end of file From 026886d73b9d34a1d3db70a54a5691471e742120 Mon Sep 17 00:00:00 2001 From: Andrew Aikawa Date: Wed, 25 Sep 2024 14:36:59 -0700 Subject: [PATCH 12/93] [Paperspace] add A4000, P4000, GPU+ (#3991) add A4000, P4000, GPU+ --- sky/provision/paperspace/constants.py | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/sky/provision/paperspace/constants.py b/sky/provision/paperspace/constants.py index 8c6084d80b7..0acd659d663 100644 --- a/sky/provision/paperspace/constants.py +++ b/sky/provision/paperspace/constants.py @@ -19,6 +19,12 @@ 'V100-32Gx2': 'twnlo3zj', 'V100-32G': 'twnlo3zj', 'V100': 'twnlo3zj', + 'GPU+': 'twnlo3zj', + 'P4000': 'twnlo3zj', + 'P4000x2': 'twnlo3zj', + 'A4000': 'twnlo3zj', + 'A4000x2': 'twnlo3zj', + 'A4000x4': 'twnlo3zj', **CPU_INSTANCES_TEMPLATEID } NVLINK_INSTANCES = { From 82aa0c3d9cc8cbf6039c18d9e5b01a9256040a33 Mon Sep 17 00:00:00 2001 From: Andy Lee Date: Wed, 25 Sep 2024 15:51:25 -0700 Subject: [PATCH 13/93] [Docs] Fix highlighting in code block (#3994) Fix highlighting in code block Fixes #3993 --- docs/source/serving/update.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/serving/update.rst b/docs/source/serving/update.rst index 80fdd480c5e..2e34036dc69 100644 --- a/docs/source/serving/update.rst +++ b/docs/source/serving/update.rst @@ -69,7 +69,7 @@ the task yaml ``examples/serve/http_server/task.yaml``, by changing the ``replic field: .. code-block:: yaml - :emphasize-lines: 10 + :emphasize-lines: 6 # examples/serve/http_server/task.yaml service: From 396e0fc7fba68681e729a4c64c21f904a465e07d Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 25 Sep 2024 17:38:16 -0700 Subject: [PATCH 14/93] [LLM] Llama 3.2 guide (#3990) * Add llama 3.2 example * update * length * fix * update * update cpus limit * Use 11B instead for better performance * update * update * Add link * Fix reference * Fix vllm version * Update llm/llama-3_2/README.md Co-authored-by: Zongheng Yang * Update llm/llama-3_2/README.md Co-authored-by: Zongheng Yang * Update llm/llama-3_2/README.md Co-authored-by: Zongheng Yang * Update llm/llama-3_2/README.md Co-authored-by: Zongheng Yang * Fix title * news * no need to pin transformers * remove cover photo for now --------- Co-authored-by: Zongheng Yang --- README.md | 5 +- docs/source/_gallery_original/index.rst | 1 + .../_gallery_original/llms/llama-3_2.md | 1 + docs/source/_static/custom.js | 1 + llm/llama-3/README.md | 2 - llm/llama-3_2/README.md | 354 ++++++++++++++++++ llm/llama-3_2/llama3_2-vision-11b.yaml | 95 +++++ llm/llama-3_2/llama3_2.yaml | 94 +++++ 8 files changed, 549 insertions(+), 4 deletions(-) create mode 120000 docs/source/_gallery_original/llms/llama-3_2.md create mode 100644 llm/llama-3_2/README.md create mode 100644 llm/llama-3_2/llama3_2-vision-11b.yaml create mode 100644 llm/llama-3_2/llama3_2.yaml diff --git a/README.md b/README.md index e3d53f3d5ec..f887c6d690f 100644 --- a/README.md +++ b/README.md @@ -26,10 +26,10 @@ ---- :fire: *News* :fire: +- [Sep, 2024] Point, Lanuch and Serve **Llama 3.2** on on Kubernetes or Any Cloud: [**example**](./llm/llama-3_2/) - [Sep, 2024] Run and deploy [Pixtral](./llm/pixtral), the first open-source multimodal model from Mistral AI. - [Jul, 2024] [Finetune](./llm/llama-3_1-finetuning/) and [serve](./llm/llama-3_1/) **Llama 3.1** on your infra - [Jun, 2024] Reproduce **GPT** with [llm.c](https://github.com/karpathy/llm.c/discussions/481) on any cloud: [**guide**](./llm/gpt-2/) -- [Apr, 2024] Serve and finetune [**Llama 3**](https://skypilot.readthedocs.io/en/latest/gallery/llms/llama-3.html) on any cloud or Kubernetes: [**example**](./llm/llama-3/) - [Apr, 2024] Serve [**Qwen-110B**](https://qwenlm.github.io/blog/qwen1.5-110b/) on your infra: [**example**](./llm/qwen/) - [Apr, 2024] Using [**Ollama**](https://github.com/ollama/ollama) to deploy quantized LLMs on CPUs and GPUs: [**example**](./llm/ollama/) - [Feb, 2024] Deploying and scaling [**Gemma**](https://blog.google/technology/developers/gemma-open-models/) with SkyServe: [**example**](./llm/gemma/) @@ -41,7 +41,8 @@
Archived - + +- [Apr, 2024] Serve and finetune [**Llama 3**](https://skypilot.readthedocs.io/en/latest/gallery/llms/llama-3.html) on any cloud or Kubernetes: [**example**](./llm/llama-3/) - [Mar, 2024] Serve and deploy [**Databricks DBRX**](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm) on your infra: [**example**](./llm/dbrx/) - [Feb, 2024] Speed up your LLM deployments with [**SGLang**](https://github.com/sgl-project/sglang) for 5x throughput on SkyServe: [**example**](./llm/sglang/) - [Dec, 2023] Using [**LoRAX**](https://github.com/predibase/lorax) to serve 1000s of finetuned LLMs on a single instance in the cloud: [**example**](./llm/lorax/) diff --git a/docs/source/_gallery_original/index.rst b/docs/source/_gallery_original/index.rst index 0f465446b98..8613bfb649d 100644 --- a/docs/source/_gallery_original/index.rst +++ b/docs/source/_gallery_original/index.rst @@ -41,6 +41,7 @@ Contents Llama-2 (Meta) Llama-3 (Meta) Llama-3.1 (Meta) + Vision Llama-3.2 (Meta) Qwen (Alibaba) CodeLlama (Meta) Gemma (Google) diff --git a/docs/source/_gallery_original/llms/llama-3_2.md b/docs/source/_gallery_original/llms/llama-3_2.md new file mode 120000 index 00000000000..2ec005dcc0d --- /dev/null +++ b/docs/source/_gallery_original/llms/llama-3_2.md @@ -0,0 +1 @@ +../../../../llm/llama-3_2/README.md \ No newline at end of file diff --git a/docs/source/_static/custom.js b/docs/source/_static/custom.js index 0a12994f5bd..b10d157ed00 100644 --- a/docs/source/_static/custom.js +++ b/docs/source/_static/custom.js @@ -31,6 +31,7 @@ document.addEventListener('DOMContentLoaded', () => { { selector: '.toctree-l1 > a', text: 'Pixtral (Mistral AI)' }, { selector: '.toctree-l1 > a', text: 'Many Parallel Jobs' }, { selector: '.toctree-l1 > a', text: 'Reserved, Capacity Blocks, DWS' }, + { selector: '.toctree-l1 > a', text: 'Llama-3.2 (Meta)' }, ]; newItems.forEach(({ selector, text }) => { document.querySelectorAll(selector).forEach((el) => { diff --git a/llm/llama-3/README.md b/llm/llama-3/README.md index ef19d94b5c0..ae5c10dc62b 100644 --- a/llm/llama-3/README.md +++ b/llm/llama-3/README.md @@ -15,8 +15,6 @@ - - ## Why use SkyPilot vs. commercial hosted solutions? * No lock-in: run on any supported cloud - AWS, Azure, GCP, Lambda Cloud, IBM, Samsung, OCI diff --git a/llm/llama-3_2/README.md b/llm/llama-3_2/README.md new file mode 100644 index 00000000000..8e4b9820a88 --- /dev/null +++ b/llm/llama-3_2/README.md @@ -0,0 +1,354 @@ + + +# Point, Launch, and Serve Vision Llama 3.2 on Kubernetes or Any Cloud + + + + +[Llama 3.2](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/) family was released by Meta on Sep 25, 2024. It not only includes the latest improved (and smaller) LLM models for chat, but also includes multimodal vision-language models. Let's _point and launch_ it with SkyPilot. + +* [Llama 3.2 release](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/) + + + +## Why use SkyPilot? + +* **Point, launch, and serve**: simply point to the cloud/Kubernetes cluster you have access to, and launch the model there with a single command. +* No lock-in: run on any supported cloud β€” AWS, Azure, GCP, Lambda Cloud, IBM, Samsung, OCI +* Everything stays in your cloud account (your VMs & buckets) +* No one else sees your chat history +* Pay absolute minimum β€” no managed solution markups +* Freely choose your own model size, GPU type, number of GPUs, etc, based on scale and budget. + +…and you get all of this with 1 click β€” let SkyPilot automate the infra. + + +## Prerequisites + +- Go to the [HuggingFace model page](https://huggingface.co/meta-llama/) and request access to the model [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) and [meta-llama/Llama-3.2-11B-Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision). +- Check that you have installed SkyPilot ([docs](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html)). +- Check that `sky check` shows clouds or Kubernetes are enabled. + +## SkyPilot YAML + +
+Click to see the full recipe YAML + +```yaml +envs: + MODEL_NAME: meta-llama/Llama-3.2-3B-Instruct + # MODEL_NAME: meta-llama/Llama-3.2-3B-Vision + HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass. + +service: + replicas: 2 + # An actual request for readiness probe. + readiness_probe: + path: /v1/chat/completions + post_data: + model: $MODEL_NAME + messages: + - role: user + content: Hello! What is your name? + max_tokens: 1 + +resources: + accelerators: {L4:1, L40S:1, L40:1, A10g:1, A10:1, A100:1, H100:1} + # accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model. + cpus: 8+ + disk_size: 512 # Ensure model checkpoints can fit. + disk_tier: best + ports: 8081 # Expose to internet traffic. + +setup: | + # Install huggingface transformers for the support of Llama 3.2 + pip install git+https://github.com/huggingface/transformers.git@f0eabf6c7da2afbe8425546c092fa3722f9f219e + pip install vllm==0.6.2 + +run: | + echo 'Starting vllm api server...' + + vllm serve $MODEL_NAME \ + --port 8081 \ + --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ + --max-model-len 4096 \ + 2>&1 + +``` + +
+ +You can also get the full YAML file [here](https://github.com/skypilot-org/skypilot/blob/master/llm/llama-3_2/llama3_2.yaml). + +## Point and Launch Llama 3.2 + +Launch a single spot instance to serve Llama 3.2 on your infra: +```console +$ HF_TOKEN=xxx sky launch llama3_2.yaml -c llama3_2 --env HF_TOKEN +``` + +```console +... +------------------------------------------------------------------------------------------------------------------ + CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN +------------------------------------------------------------------------------------------------------------------ + Kubernetes 4CPU--16GB--1L4 4 16 L4:1 kubernetes 0.00 βœ” + RunPod 1x_L4_SECURE 4 24 L4:1 CA 0.44 + GCP g2-standard-4 4 16 L4:1 us-east4-a 0.70 + AWS g6.xlarge 4 16 L4:1 us-east-1 0.80 + AWS g5.xlarge 4 16 A10G:1 us-east-1 1.01 + RunPod 1x_L40_SECURE 16 48 L40:1 CA 1.14 + Fluidstack L40_48GB::1 32 60 L40:1 CANADA 1.15 + AWS g6e.xlarge 4 32 L40S:1 us-east-1 1.86 + Cudo sapphire-rapids-h100_1x4v8gb 4 8 H100:1 ca-montreal-3 2.86 + Fluidstack H100_PCIE_80GB::1 28 180 H100:1 CANADA 2.89 + Azure Standard_NV36ads_A10_v5 36 440 A10:1 eastus 3.20 + GCP a2-highgpu-1g 12 85 A100:1 us-central1-a 3.67 + RunPod 1x_H100_SECURE 16 80 H100:1 CA 4.49 + Azure Standard_NC40ads_H100_v5 40 320 H100:1 eastus 6.98 +------------------------------------------------------------------------------------------------------------------ +``` + + +Wait until the model is ready (this can take 10+ minutes). + +πŸŽ‰ **Congratulations!** πŸŽ‰ You have now launched the Llama 3.2 Instruct LLM on your infra. + +### Chat with Llama 3.2 with OpenAI API + +To curl `/v1/chat/completions`: +```console +ENDPOINT=$(sky status --endpoint 8081 llama3_2) + +curl http://$ENDPOINT/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "meta-llama/Llama-3.2-3B-Instruct", + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "Who are you?" + } + ] + }' | jq . +``` +Example outputs: +```console +{ + "id": "chat-e7b6d2a2d2934bcab169f82812601baf", + "object": "chat.completion", + "created": 1727291780, + "model": "meta-llama/Llama-3.2-3B-Instruct", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "I'm an artificial intelligence model known as Llama. Llama stands for \"Large Language Model Meta AI.\"", + "tool_calls": [] + }, + "logprobs": null, + "finish_reason": "stop", + "stop_reason": null + } + ], + "usage": { + "prompt_tokens": 45, + "total_tokens": 68, + "completion_tokens": 23 + }, + "prompt_logprobs": null +} +``` + +To stop the instance: +```console +sky stop llama3_2 +``` + +To shut down all resources: +```console +sky down llama3_2 +``` + +## Point and Launch Vision Llama 3.2 + +Let's launch a vision llama now! The multimodal capacity of Llama-3.2 could open up a lot of new use cases. We will go with the largest 11B model here. + +```console +$ HF_TOKEN=xxx sky launch llama3_2-vision-11b.yaml -c llama3_2-vision --env HF_TOKEN +``` + +```console +------------------------------------------------------------------------------------------------------------------ + CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN +------------------------------------------------------------------------------------------------------------------ + Kubernetes 2CPU--8GB--1H100 2 8 H100:1 kubernetes 0.00 βœ” + RunPod 1x_L40_SECURE 16 48 L40:1 CA 1.14 + Fluidstack L40_48GB::1 32 60 L40:1 CANADA 1.15 + AWS g6e.xlarge 4 32 L40S:1 us-east-1 1.86 + RunPod 1x_A100-80GB_SECURE 8 80 A100-80GB:1 CA 1.99 + Cudo sapphire-rapids-h100_1x2v4gb 2 4 H100:1 ca-montreal-3 2.83 + Fluidstack H100_PCIE_80GB::1 28 180 H100:1 CANADA 2.89 + GCP a2-highgpu-1g 12 85 A100:1 us-central1-a 3.67 + Azure Standard_NC24ads_A100_v4 24 220 A100-80GB:1 eastus 3.67 + RunPod 1x_H100_SECURE 16 80 H100:1 CA 4.49 + GCP a2-ultragpu-1g 12 170 A100-80GB:1 us-central1-a 5.03 + Azure Standard_NC40ads_H100_v5 40 320 H100:1 eastus 6.98 +------------------------------------------------------------------------------------------------------------------ +``` + + +### Chat with Vision Llama 3.2 + +```console +ENDPOINT=$(sky status --endpoint 8081 llama3_2-vision) + +curl http://$ENDPOINT/v1/chat/completions \ + -H 'Content-Type: application/json' \ + -H 'Authorization: Bearer token' \ + --data '{ + "model": "meta-llama/Llama-3.2-11B-Vision-Instruct", + "messages": [ + { + "role": "user", + "content": [ + {"type" : "text", "text": "Turn this logo into ASCII art."}, + {"type": "image_url", "image_url": {"url": "https://pbs.twimg.com/profile_images/1584596138635632640/HWexMoH5_400x400.jpg"}} + ] + }], + "max_tokens": 1024 + }' | jq . +``` + +Example output (parsed): + +1. Output 1 +```console +------------- +- - +- - - +- - - +- - +------------- +``` + +2. Output 2 +``` + ^_________ + / \\ + / \\ + /______________\\ + | | + | | + |_______________| + \\ / + \\ / + \\________/ +``` + +
+Raw output + +```console +{ + "id": "chat-c341b8a0b40543918f3bb2fef68b0952", + "object": "chat.completion", + "created": 1727295337, + "model": "meta-llama/Llama-3.2-11B-Vision-Instruct", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "Sure, here is the logo in ASCII art:\n\n------------- \n- - \n- - - \n- - - \n- - \n------------- \n\nNote that this is a very simple representation and does not capture all the details of the original logo.", + "tool_calls": [] + }, + "logprobs": null, + "finish_reason": "stop", + "stop_reason": null + } + ], + "usage": { + "prompt_tokens": 18, + "total_tokens": 73, + "completion_tokens": 55 + }, + "prompt_logprobs": null +} +``` + +
+ + +## Serving Llama-3: scaling up with SkyServe + +After playing with the model, you can deploy the model with autoscaling and load-balancing using SkyServe. + +With no change to the YAML, launch a fully managed service on your infra: +```console +HF_TOKEN=xxx sky serve up llama3_2-vision-11b.yaml -n llama3_2 --env HF_TOKEN +``` + +Wait until the service is ready: +```console +watch -n10 sky serve status llama3_2 +``` + +
+Example outputs: + +```console +Services +NAME VERSION UPTIME STATUS REPLICAS ENDPOINT +llama3_2 1 35s READY 2/2 xx.yy.zz.100:30001 + +Service Replicas +SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION +llama3_2 1 1 xx.yy.zz.121 18 mins ago 1x GCP([Spot]{'A100-80GB': 8}) READY us-east4 +llama3_2 2 1 xx.yy.zz.245 18 mins ago 1x GCP([Spot]{'A100-80GB': 8}) READY us-east4 +``` +
+ + +Get a single endpoint that load-balances across replicas: +```console +ENDPOINT=$(sky serve status --endpoint llama3_2) +``` + +> **Tip:** SkyServe fully manages the lifecycle of your replicas. For example, if a spot replica is preempted, the controller will automatically replace it. This significantly reduces the operational burden while saving costs. + +To curl the endpoint: +```console +curl http://$ENDPOINT/v1/chat/completions \ + -H 'Content-Type: application/json' \ + -H 'Authorization: Bearer token' \ + --data '{ + "model": "meta-llama/Llama-3.2-11B-Vision-Instruct", + "messages": [ + { + "role": "user", + "content": [ + {"type" : "text", "text": "Covert this logo to ASCII art"}, + {"type": "image_url", "image_url": {"url": "https://pbs.twimg.com/profile_images/1584596138635632640/HWexMoH5_400x400.jpg"}} + ] + }], + "max_tokens": 2048 + }' | jq . +``` + +To shut down all resources: +```console +sky serve down llama3 +``` + +See more details in [SkyServe docs](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html). + + +## Developing and Finetuning Llama 3 series + +SkyPilot also simplifies the development and finetuning of Llama 3 series. Check out the development and finetuning guides: [Develop](https://github.com/skypilot-org/skypilot/blob/master/llm/llama-3_1/README.md) and [Finetune](https://github.com/skypilot-org/skypilot/blob/master/llm/llama-3_1-finetuning/README.md). diff --git a/llm/llama-3_2/llama3_2-vision-11b.yaml b/llm/llama-3_2/llama3_2-vision-11b.yaml new file mode 100644 index 00000000000..59f823ac875 --- /dev/null +++ b/llm/llama-3_2/llama3_2-vision-11b.yaml @@ -0,0 +1,95 @@ +# Serving Meta Llama 3.2 on your own infra. +# +# Usage: +# +# HF_TOKEN=xxx sky launch llama3_2.yaml -c llama3_2 --env HF_TOKEN +# +# curl /v1/chat/completions: +# +# ENDPOINT=$(sky status --endpoint 8081 llama3_2) +# +# # We need to manually specify the stop_token_ids to make sure the model finish +# # on <|eot_id|>. +# curl http://$ENDPOINT/v1/chat/completions \ +# -H "Content-Type: application/json" \ +# -d '{ +# "model": "meta-llama/Meta-Llama-3-8B-Instruct", +# "messages": [ +# { +# "role": "system", +# "content": "You are a helpful assistant." +# }, +# { +# "role": "user", +# "content": "Who are you?" +# } +# ], +# "stop_token_ids": [128009, 128001] +# }' +# +# Chat with model with Gradio UI: +# +# Running on local URL: http://127.0.0.1:8811 +# Running on public URL: https://.gradio.live +# +# Scale up with SkyServe: +# HF_TOKEN=xxx sky serve up llama3_2.yaml -n llama3_2 --env HF_TOKEN +# +# curl /v1/chat/completions: +# +# ENDPOINT=$(sky serve status --endpoint llama3_2) +# curl -L $ENDPOINT/v1/models +# curl -L http://$ENDPOINT/v1/chat/completions \ +# -H "Content-Type: application/json" \ +# -d '{ +# "model": "databricks/llama3-instruct", +# "messages": [ +# { +# "role": "system", +# "content": "You are a helpful assistant." +# }, +# { +# "role": "user", +# "content": "Who are you?" +# } +# ] +# }' + + +envs: + MODEL_NAME: meta-llama/Llama-3.2-11B-Vision-Instruct + HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass. + +service: + replicas: 2 + # An actual request for readiness probe. + readiness_probe: + path: /v1/chat/completions + post_data: + model: $MODEL_NAME + messages: + - role: user + content: Hello! What is your name? + max_tokens: 1 + +resources: + accelerators: {L40, L40S, A100, A100-80GB, H100} + disk_size: 1000 # Ensure model checkpoints can fit. + disk_tier: best + ports: 8081 # Expose to internet traffic. + +setup: | + pip install vllm==0.6.2 + + +run: | + echo 'Starting vllm api server...' + + vllm serve $MODEL_NAME \ + --enforce-eager \ + --limit-mm-per-prompt "image=1" \ + --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ + --max-model-len 4096 \ + --max-num-seqs 40 \ + --port 8081 \ + --disable-log-requests diff --git a/llm/llama-3_2/llama3_2.yaml b/llm/llama-3_2/llama3_2.yaml new file mode 100644 index 00000000000..60fe36cce29 --- /dev/null +++ b/llm/llama-3_2/llama3_2.yaml @@ -0,0 +1,94 @@ +# Serving Meta Llama 3.2 on your own infra. +# +# Usage: +# +# HF_TOKEN=xxx sky launch llama3_2.yaml -c llama3_2 --env HF_TOKEN +# +# curl /v1/chat/completions: +# +# ENDPOINT=$(sky status --endpoint 8081 llama3_2) +# +# # We need to manually specify the stop_token_ids to make sure the model finish +# # on <|eot_id|>. +# curl http://$ENDPOINT/v1/chat/completions \ +# -H "Content-Type: application/json" \ +# -d '{ +# "model": "meta-llama/Meta-Llama-3-8B-Instruct", +# "messages": [ +# { +# "role": "system", +# "content": "You are a helpful assistant." +# }, +# { +# "role": "user", +# "content": "Who are you?" +# } +# ], +# "stop_token_ids": [128009, 128001] +# }' +# +# Chat with model with Gradio UI: +# +# Running on local URL: http://127.0.0.1:8811 +# Running on public URL: https://.gradio.live +# +# Scale up with SkyServe: +# HF_TOKEN=xxx sky serve up llama3_2.yaml -n llama3_2 --env HF_TOKEN +# +# curl /v1/chat/completions: +# +# ENDPOINT=$(sky serve status --endpoint llama3_2) +# curl -L $ENDPOINT/v1/models +# curl -L http://$ENDPOINT/v1/chat/completions \ +# -H "Content-Type: application/json" \ +# -d '{ +# "model": "databricks/llama3-instruct", +# "messages": [ +# { +# "role": "system", +# "content": "You are a helpful assistant." +# }, +# { +# "role": "user", +# "content": "Who are you?" +# } +# ] +# }' + + +envs: + MODEL_NAME: meta-llama/Llama-3.2-3B-Instruct + # MODEL_NAME: meta-llama/Llama-3.2-3B-Vision + HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass. + +service: + replicas: 2 + # An actual request for readiness probe. + readiness_probe: + path: /v1/chat/completions + post_data: + model: $MODEL_NAME + messages: + - role: user + content: Hello! What is your name? + max_tokens: 1 + +resources: + accelerators: {L4:1, L40S:1, L40:1, A10g:1, A10:1, A100:1, H100:1} + # accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model. + cpus: 8+ + disk_size: 512 # Ensure model checkpoints can fit. + disk_tier: best + ports: 8081 # Expose to internet traffic. + +setup: | + pip install vllm==0.6.2 + + +run: | + echo 'Starting vllm api server...' + + vllm serve $MODEL_NAME \ + --port 8081 \ + --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ + --max-model-len 4096 From d4f96e6d7931c37d6d9c7cd4b05565223063130c Mon Sep 17 00:00:00 2001 From: landscapepainter <34902420+landscapepainter@users.noreply.github.com> Date: Wed, 25 Sep 2024 19:00:36 -0700 Subject: [PATCH 15/93] [k8s] Add cluster attributes(autodown, idle-minutes-to-autostop) as annotations to the pod (#3870) * add autodown annotations to the k8s pod * revert kubernetes ray template * revert backend_utils from invasive approach * nit * revert from invasive approaches * revert * updated approach * nit * nit * Use constant to represent idle_minutes_to_autostop for cancellation * revert using constants for cancel * nit * nit * add smoke tests * Update sky/provision/kubernetes/utils.py Co-authored-by: Romil Bhardwaj * fix comments * nit * remove loops and annotate one by one * format * update with autodown annotation with context * format --------- Co-authored-by: Romil Bhardwaj --- sky/backends/cloud_vm_ray_backend.py | 8 ++ sky/provision/kubernetes/instance.py | 61 +++--------- sky/provision/kubernetes/utils.py | 143 +++++++++++++++++++++++++++ tests/test_smoke.py | 58 +++++++++++ 4 files changed, 224 insertions(+), 46 deletions(-) diff --git a/sky/backends/cloud_vm_ray_backend.py b/sky/backends/cloud_vm_ray_backend.py index 0831bad65fb..e580b9ba550 100644 --- a/sky/backends/cloud_vm_ray_backend.py +++ b/sky/backends/cloud_vm_ray_backend.py @@ -48,6 +48,7 @@ from sky.provision import instance_setup from sky.provision import metadata_utils from sky.provision import provisioner +from sky.provision.kubernetes import utils as kubernetes_utils from sky.skylet import autostop_lib from sky.skylet import constants from sky.skylet import job_lib @@ -4180,6 +4181,13 @@ def set_autostop(self, global_user_state.set_cluster_autostop_value( handle.cluster_name, idle_minutes_to_autostop, down) + # Add/Remove autodown annotations to/from Kubernetes pods. + if isinstance(handle.launched_resources.cloud, clouds.Kubernetes): + kubernetes_utils.set_autodown_annotations( + handle=handle, + idle_minutes_to_autostop=idle_minutes_to_autostop, + down=down) + def is_definitely_autostopping(self, handle: CloudVmRayResourceHandle, stream_logs: bool = True) -> bool: diff --git a/sky/provision/kubernetes/instance.py b/sky/provision/kubernetes/instance.py index 83f9c34592e..f9ee75e466b 100644 --- a/sky/provision/kubernetes/instance.py +++ b/sky/provision/kubernetes/instance.py @@ -28,42 +28,6 @@ TAG_SKYPILOT_CLUSTER_NAME = 'skypilot-cluster-name' TAG_POD_INITIALIZED = 'skypilot-initialized' -POD_STATUSES = { - 'Pending', 'Running', 'Succeeded', 'Failed', 'Unknown', 'Terminating' -} - - -def to_label_selector(tags): - label_selector = '' - for k, v in tags.items(): - if label_selector != '': - label_selector += ',' - label_selector += '{}={}'.format(k, v) - return label_selector - - -def _filter_pods(namespace: str, context: str, tag_filters: Dict[str, str], - status_filters: Optional[List[str]]) -> Dict[str, Any]: - """Filters pods by tags and status.""" - non_included_pod_statuses = POD_STATUSES.copy() - - field_selector = '' - if status_filters is not None: - non_included_pod_statuses -= set(status_filters) - field_selector = ','.join( - [f'status.phase!={status}' for status in non_included_pod_statuses]) - - label_selector = to_label_selector(tag_filters) - pod_list = kubernetes.core_api(context).list_namespaced_pod( - namespace, field_selector=field_selector, label_selector=label_selector) - - # Don't return pods marked for deletion, - # i.e. pods with non-null metadata.DeletionTimestamp. - pods = [ - pod for pod in pod_list.items if pod.metadata.deletion_timestamp is None - ] - return {pod.metadata.name: pod for pod in pods} - def _get_head_pod_name(pods: Dict[str, Any]) -> Optional[str]: head_pod_name = None @@ -475,7 +439,8 @@ def _create_pods(region: str, cluster_name_on_cloud: str, pod_spec['metadata']['labels'].update( {TAG_SKYPILOT_CLUSTER_NAME: cluster_name_on_cloud}) - terminating_pods = _filter_pods(namespace, context, tags, ['Terminating']) + terminating_pods = kubernetes_utils.filter_pods(namespace, context, tags, + ['Terminating']) start_time = time.time() while (len(terminating_pods) > 0 and time.time() - start_time < _TIMEOUT_FOR_POD_TERMINATION): @@ -483,8 +448,8 @@ def _create_pods(region: str, cluster_name_on_cloud: str, 'terminating pods. Waiting them to finish: ' f'{list(terminating_pods.keys())}') time.sleep(POLL_INTERVAL) - terminating_pods = _filter_pods(namespace, context, tags, - ['Terminating']) + terminating_pods = kubernetes_utils.filter_pods(namespace, context, + tags, ['Terminating']) if len(terminating_pods) > 0: # If there are still terminating pods, we force delete them. @@ -501,8 +466,8 @@ def _create_pods(region: str, cluster_name_on_cloud: str, _request_timeout=config_lib.DELETION_TIMEOUT, grace_period_seconds=0) - running_pods = _filter_pods(namespace, context, tags, - ['Pending', 'Running']) + running_pods = kubernetes_utils.filter_pods(namespace, context, tags, + ['Pending', 'Running']) head_pod_name = _get_head_pod_name(running_pods) logger.debug(f'Found {len(running_pods)} existing pods: ' f'{list(running_pods.keys())}') @@ -583,7 +548,8 @@ def _create_pods(region: str, cluster_name_on_cloud: str, if head_pod_name is None: head_pod_name = pod.metadata.name - wait_pods_dict = _filter_pods(namespace, context, tags, ['Pending']) + wait_pods_dict = kubernetes_utils.filter_pods(namespace, context, tags, + ['Pending']) wait_pods = list(wait_pods_dict.values()) networking_mode = network_utils.get_networking_mode( @@ -613,8 +579,9 @@ def _create_pods(region: str, cluster_name_on_cloud: str, logger.debug(f'run_instances: all pods are scheduled and running: ' f'{list(wait_pods_dict.keys())}') - running_pods = _filter_pods(namespace, context, tags, ['Running']) - initialized_pods = _filter_pods(namespace, context, { + running_pods = kubernetes_utils.filter_pods(namespace, context, tags, + ['Running']) + initialized_pods = kubernetes_utils.filter_pods(namespace, context, { TAG_POD_INITIALIZED: 'true', **tags }, ['Running']) @@ -722,7 +689,7 @@ def terminate_instances( tag_filters = { TAG_RAY_CLUSTER_NAME: cluster_name_on_cloud, } - pods = _filter_pods(namespace, context, tag_filters, None) + pods = kubernetes_utils.filter_pods(namespace, context, tag_filters, None) def _is_head(pod) -> bool: return pod.metadata.labels[constants.TAG_RAY_NODE_KIND] == 'head' @@ -746,7 +713,9 @@ def get_cluster_info( TAG_RAY_CLUSTER_NAME: cluster_name_on_cloud, } - running_pods = _filter_pods(namespace, context, tag_filters, ['Running']) + running_pods = kubernetes_utils.filter_pods(namespace, context, tag_filters, + ['Running']) + pods: Dict[str, List[common.InstanceInfo]] = {} head_pod_name = None diff --git a/sky/provision/kubernetes/utils.py b/sky/provision/kubernetes/utils.py index 6aa6400dfa1..a8abb24b917 100644 --- a/sky/provision/kubernetes/utils.py +++ b/sky/provision/kubernetes/utils.py @@ -6,6 +6,7 @@ import re import shutil import subprocess +import typing from typing import Any, Dict, List, Optional, Set, Tuple, Union from urllib.parse import urlparse @@ -17,6 +18,7 @@ from sky import sky_logging from sky import skypilot_config from sky.adaptors import kubernetes +from sky.provision import constants as provision_constants from sky.provision.kubernetes import network_utils from sky.skylet import constants from sky.utils import common_utils @@ -25,6 +27,9 @@ from sky.utils import schemas from sky.utils import ux_utils +if typing.TYPE_CHECKING: + from sky import backends + # TODO(romilb): Move constants to constants.py DEFAULT_NAMESPACE = 'default' @@ -64,6 +69,16 @@ PORT_FORWARD_PROXY_CMD_PATH = ('~/.sky/kubernetes-port-forward-proxy-command-' f'v{PORT_FORWARD_PROXY_CMD_VERSION}.sh') +POD_STATUSES = { + 'Pending', 'Running', 'Succeeded', 'Failed', 'Unknown', 'Terminating' +} +AUTODOWN_ANNOTATION_KEY = 'skypilot.co/autodown' +IDLE_MINUTES_TO_AUTOSTOP_ANNOTATION_KEY = ( + 'skypilot.co/idle_minutes_to_autostop') +ANNOTATIONS_POD_NOT_FOUND_ERROR_MSG = ('Pod {pod_name} not found in namespace ' + '{namespace} while trying to {action} ' + 'an annotation {annotation}.') + logger = sky_logging.init_logger(__name__) @@ -1748,11 +1763,139 @@ def get_kubernetes_node_info() -> Dict[str, KubernetesNodeInfo]: return node_info_dict +def to_label_selector(tags): + label_selector = '' + for k, v in tags.items(): + if label_selector != '': + label_selector += ',' + label_selector += '{}={}'.format(k, v) + return label_selector + + def get_namespace_from_config(provider_config: Dict[str, Any]) -> str: return provider_config.get('namespace', get_current_kube_config_context_namespace()) +def filter_pods(namespace: str, + context: str, + tag_filters: Dict[str, str], + status_filters: Optional[List[str]] = None) -> Dict[str, Any]: + """Filters pods by tags and status.""" + non_included_pod_statuses = POD_STATUSES.copy() + + field_selector = '' + if status_filters is not None: + non_included_pod_statuses -= set(status_filters) + field_selector = ','.join( + [f'status.phase!={status}' for status in non_included_pod_statuses]) + + label_selector = to_label_selector(tag_filters) + pod_list = kubernetes.core_api(context).list_namespaced_pod( + namespace, field_selector=field_selector, label_selector=label_selector) + + # Don't return pods marked for deletion, + # i.e. pods with non-null metadata.DeletionTimestamp. + pods = [ + pod for pod in pod_list.items if pod.metadata.deletion_timestamp is None + ] + return {pod.metadata.name: pod for pod in pods} + + +def _remove_pod_annotation(pod: Any, annotation_key: str, + namespace: str) -> None: + """Removes specified Annotations from a Kubernetes pod.""" + try: + # Remove the specified annotation + if pod.metadata.annotations: + if annotation_key in pod.metadata.annotations: + # Patch the pod with the updated metadata. + body = {'metadata': {'annotations': {annotation_key: None}}} + kubernetes.core_api().patch_namespaced_pod( + name=pod.metadata.name, + namespace=namespace, + body=body, + _request_timeout=kubernetes.API_TIMEOUT) + + except kubernetes.api_exception() as e: + if e.status == 404: + logger.warning( + ANNOTATIONS_POD_NOT_FOUND_ERROR_MSG.format( + pod_name=pod.metadata.name, + namespace=namespace, + action='remove', + annotation=annotation_key)) + else: + with ux_utils.print_exception_no_traceback(): + raise + + +def _add_pod_annotation(pod: Any, annotation: Dict[str, str], + namespace: str) -> None: + """Adds specified Annotations on a Kubernetes pod.""" + try: + # Patch the pod with the updated metadata + body = {'metadata': {'annotations': annotation}} + kubernetes.core_api().patch_namespaced_pod( + name=pod.metadata.name, + namespace=namespace, + body=body, + _request_timeout=kubernetes.API_TIMEOUT) + + except kubernetes.api_exception() as e: + if e.status == 404: + logger.warning( + ANNOTATIONS_POD_NOT_FOUND_ERROR_MSG.format( + pod_name=pod.metadata.name, + namespace=namespace, + action='add', + annotation=annotation)) + else: + with ux_utils.print_exception_no_traceback(): + raise + + +def set_autodown_annotations(handle: 'backends.CloudVmRayResourceHandle', + idle_minutes_to_autostop: Optional[int], + down: bool = False) -> None: + """Adds or removes Annotations of autodown on Kubernetes pods.""" + tags = { + provision_constants.TAG_RAY_CLUSTER_NAME: handle.cluster_name_on_cloud, + } + ray_config = common_utils.read_yaml(handle.cluster_yaml) + provider_config = ray_config['provider'] + namespace = get_namespace_from_config(provider_config) + context = get_context_from_config(provider_config) + running_pods = filter_pods(namespace, context, tags) + + for _, pod in running_pods.items(): + if down: + idle_minutes_to_autostop_annotation = { + IDLE_MINUTES_TO_AUTOSTOP_ANNOTATION_KEY: + str(idle_minutes_to_autostop) + } + autodown_annotation = {AUTODOWN_ANNOTATION_KEY: 'true'} + _add_pod_annotation(pod=pod, + annotation=idle_minutes_to_autostop_annotation, + namespace=namespace) + _add_pod_annotation(pod=pod, + annotation=autodown_annotation, + namespace=namespace) + + # If idle_minutes_to_autostop is negative, it indicates a request to + # cancel autostop using the --cancel flag with the `sky autostop` + # command. + elif (idle_minutes_to_autostop is not None and + idle_minutes_to_autostop < 0): + _remove_pod_annotation( + pod=pod, + annotation_key=IDLE_MINUTES_TO_AUTOSTOP_ANNOTATION_KEY, + namespace=namespace) + _remove_pod_annotation(pod=pod, + annotation_key=AUTODOWN_ANNOTATION_KEY, + namespace=namespace) + + def get_context_from_config(provider_config: Dict[str, Any]) -> str: return provider_config.get('context', get_current_kube_config_context_name()) diff --git a/tests/test_smoke.py b/tests/test_smoke.py index 3b2bba72e8a..c616d9a8b30 100644 --- a/tests/test_smoke.py +++ b/tests/test_smoke.py @@ -2121,6 +2121,64 @@ def test_task_labels_kubernetes(): run_one_test(test) +# ---------- Pod Annotations on Kubernetes ---------- +@pytest.mark.kubernetes +def test_add_pod_annotations_for_autodown_with_launch(): + name = _get_cluster_name() + test = Test( + 'add_pod_annotations_for_autodown_with_launch', + [ + # Launch Kubernetes cluster with two nodes, each being head node and worker node. + # Autodown is set. + f'sky launch -y -c {name} -i 10 --down --num-nodes 2 --cpus=1 --cloud kubernetes', + # Get names of the pods containing cluster name. + f'pod_1=$(kubectl get pods -o name | grep {name} | sed -n 1p)', + f'pod_2=$(kubectl get pods -o name | grep {name} | sed -n 2p)', + # Describe the first pod and check for annotations. + 'kubectl describe pod $pod_1 | grep -q skypilot.co/autodown', + 'kubectl describe pod $pod_1 | grep -q skypilot.co/idle_minutes_to_autostop', + # Describe the second pod and check for annotations. + 'kubectl describe pod $pod_2 | grep -q skypilot.co/autodown', + 'kubectl describe pod $pod_2 | grep -q skypilot.co/idle_minutes_to_autostop' + ], + f'sky down -y {name}', + ) + run_one_test(test) + + +@pytest.mark.kubernetes +def test_add_and_remove_pod_annotations_with_autostop(): + name = _get_cluster_name() + test = Test( + 'add_and_remove_pod_annotations_with_autostop', + [ + # Launch Kubernetes cluster with two nodes, each being head node and worker node. + f'sky launch -y -c {name} --num-nodes 2 --cpus=1 --cloud kubernetes', + # Set autodown on the cluster with 'autostop' command. + f'sky autostop -y {name} -i 20 --down', + # Get names of the pods containing cluster name. + f'pod_1=$(kubectl get pods -o name | grep {name} | sed -n 1p)', + f'pod_2=$(kubectl get pods -o name | grep {name} | sed -n 2p)', + # Describe the first pod and check for annotations. + 'kubectl describe pod $pod_1 | grep -q skypilot.co/autodown', + 'kubectl describe pod $pod_1 | grep -q skypilot.co/idle_minutes_to_autostop', + # Describe the second pod and check for annotations. + 'kubectl describe pod $pod_2 | grep -q skypilot.co/autodown', + 'kubectl describe pod $pod_2 | grep -q skypilot.co/idle_minutes_to_autostop', + # Cancel the set autodown to remove the annotations from the pods. + f'sky autostop -y {name} --cancel', + # Describe the first pod and check if annotations are removed. + '! kubectl describe pod $pod_1 | grep -q skypilot.co/autodown', + '! kubectl describe pod $pod_1 | grep -q skypilot.co/idle_minutes_to_autostop', + # Describe the second pod and check if annotations are removed. + '! kubectl describe pod $pod_2 | grep -q skypilot.co/autodown', + '! kubectl describe pod $pod_2 | grep -q skypilot.co/idle_minutes_to_autostop', + ], + f'sky down -y {name}', + ) + run_one_test(test) + + # ---------- Container logs from task on Kubernetes ---------- @pytest.mark.kubernetes def test_container_logs_multinode_kubernetes(): From e95332b9eb8de4cdcac464ff704bf64f3285e776 Mon Sep 17 00:00:00 2001 From: Romil Bhardwaj Date: Wed, 25 Sep 2024 20:49:00 -0700 Subject: [PATCH 16/93] [Examples] Add airflow example (#3982) * Airflow example * Airflow example * Airflow example * Airflow example * wip * Update airflow examples * Update airflow examples * Update airflow examples * Add to readme * Add to readme * Add to readme * lint * updates * less salesy * comments * comments * comments --- README.md | 2 +- docs/source/docs/index.rst | 2 +- examples/airflow/README.md | 9 + examples/airflow/shared_state/README.md | 174 ++++++++++++++++++ examples/airflow/shared_state/sky-pv.yaml | 11 ++ examples/airflow/shared_state/sky-sa.yaml | 18 ++ .../airflow/shared_state/sky_k8s_example.py | 64 +++++++ .../shared_state/sky_k8s_example_xcoms.py | 87 +++++++++ examples/airflow/training_workflow/README.md | 166 +++++++++++++++++ .../training_workflow/create_gcloud_secret.sh | 30 +++ .../airflow/training_workflow/sky-sa.yaml | 18 ++ .../sky_k8s_train_pipeline.py | 87 +++++++++ 12 files changed, 666 insertions(+), 2 deletions(-) create mode 100644 examples/airflow/README.md create mode 100644 examples/airflow/shared_state/README.md create mode 100644 examples/airflow/shared_state/sky-pv.yaml create mode 100644 examples/airflow/shared_state/sky-sa.yaml create mode 100644 examples/airflow/shared_state/sky_k8s_example.py create mode 100644 examples/airflow/shared_state/sky_k8s_example_xcoms.py create mode 100644 examples/airflow/training_workflow/README.md create mode 100755 examples/airflow/training_workflow/create_gcloud_secret.sh create mode 100644 examples/airflow/training_workflow/sky-sa.yaml create mode 100644 examples/airflow/training_workflow/sky_k8s_train_pipeline.py diff --git a/README.md b/README.md index f887c6d690f..1f646b0e995 100644 --- a/README.md +++ b/README.md @@ -180,7 +180,7 @@ Runnable examples: - [LocalGPT](./llm/localgpt) - [Falcon](./llm/falcon) - Add yours here & see more in [`llm/`](./llm)! -- Framework examples: [PyTorch DDP](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_torch.yaml), [DeepSpeed](./examples/deepspeed-multinode/sky.yaml), [JAX/Flax on TPU](https://github.com/skypilot-org/skypilot/blob/master/examples/tpu/tpuvm_mnist.yaml), [Stable Diffusion](https://github.com/skypilot-org/skypilot/tree/master/examples/stable_diffusion), [Detectron2](https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml), [Distributed](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_tf_app.py) [TensorFlow](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_app_storage.yaml), [Ray Train](examples/distributed_ray_train/ray_train.yaml), [NeMo](https://github.com/skypilot-org/skypilot/blob/master/examples/nemo/nemo.yaml), [programmatic grid search](https://github.com/skypilot-org/skypilot/blob/master/examples/huggingface_glue_imdb_grid_search_app.py), [Docker](https://github.com/skypilot-org/skypilot/blob/master/examples/docker/echo_app.yaml), [Cog](https://github.com/skypilot-org/skypilot/blob/master/examples/cog/), [Unsloth](https://github.com/skypilot-org/skypilot/blob/master/examples/unsloth/unsloth.yaml), [Ollama](https://github.com/skypilot-org/skypilot/blob/master/llm/ollama), [llm.c](https://github.com/skypilot-org/skypilot/tree/master/llm/gpt-2) and [many more (`examples/`)](./examples). +- Framework examples: [PyTorch DDP](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_torch.yaml), [DeepSpeed](./examples/deepspeed-multinode/sky.yaml), [JAX/Flax on TPU](https://github.com/skypilot-org/skypilot/blob/master/examples/tpu/tpuvm_mnist.yaml), [Stable Diffusion](https://github.com/skypilot-org/skypilot/tree/master/examples/stable_diffusion), [Detectron2](https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml), [Distributed](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_tf_app.py) [TensorFlow](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_app_storage.yaml), [Ray Train](examples/distributed_ray_train/ray_train.yaml), [NeMo](https://github.com/skypilot-org/skypilot/blob/master/examples/nemo/nemo.yaml), [programmatic grid search](https://github.com/skypilot-org/skypilot/blob/master/examples/huggingface_glue_imdb_grid_search_app.py), [Docker](https://github.com/skypilot-org/skypilot/blob/master/examples/docker/echo_app.yaml), [Cog](https://github.com/skypilot-org/skypilot/blob/master/examples/cog/), [Unsloth](https://github.com/skypilot-org/skypilot/blob/master/examples/unsloth/unsloth.yaml), [Ollama](https://github.com/skypilot-org/skypilot/blob/master/llm/ollama), [llm.c](https://github.com/skypilot-org/skypilot/tree/master/llm/gpt-2), [Airflow](./examples/airflow/training_workflow) and [many more (`examples/`)](./examples). Case Studies and Integrations: [Community Spotlights](https://blog.skypilot.co/community/) diff --git a/docs/source/docs/index.rst b/docs/source/docs/index.rst index 00a645a3834..c219fcd5c85 100644 --- a/docs/source/docs/index.rst +++ b/docs/source/docs/index.rst @@ -103,7 +103,7 @@ Runnable examples: * `Falcon `_ * Add yours here & see more in `llm/ `_! -* Framework examples: `PyTorch DDP `_, `DeepSpeed `_, `JAX/Flax on TPU `_, `Stable Diffusion `_, `Detectron2 `_, `Distributed `_ `TensorFlow `_, `NeMo `_, `programmatic grid search `_, `Docker `_, `Cog `_, `Unsloth `_, `Ollama `_, `llm.c `__ and `many more `_. +* Framework examples: `PyTorch DDP `_, `DeepSpeed `_, `JAX/Flax on TPU `_, `Stable Diffusion `_, `Detectron2 `_, `Distributed `_ `TensorFlow `_, `NeMo `_, `programmatic grid search `_, `Docker `_, `Cog `_, `Unsloth `_, `Ollama `_, `llm.c `__, `Airflow `_ and `many more `_. Case Studies and Integrations: `Community Spotlights `_ diff --git a/examples/airflow/README.md b/examples/airflow/README.md new file mode 100644 index 00000000000..80d86f22b97 --- /dev/null +++ b/examples/airflow/README.md @@ -0,0 +1,9 @@ +# SkyPilot Airflow integration examples + +This directory contains two examples of integrating SkyPilot with Apache Airflow: +1. [training_workflow](training_workflow) + * A simple training workflow that preprocesses data, trains a model, and evaluates it. + * Showcases how SkyPilot can help easily transition from dev to production in Airflow. +2. [shared_state](shared_state) + * An example showing how SkyPilot state can be persisted across Airflow tasks. + * Useful for operating on the same shared SkyPilot clusters from different Airflow tasks. \ No newline at end of file diff --git a/examples/airflow/shared_state/README.md b/examples/airflow/shared_state/README.md new file mode 100644 index 00000000000..5f39471351a --- /dev/null +++ b/examples/airflow/shared_state/README.md @@ -0,0 +1,174 @@ +# Running SkyPilot tasks in an Airflow DAG + +SkyPilot can be used in an orchestration framework like Airflow to launch tasks as a part of a DAG. + +In this guide, we demonstrate how some simple SkyPilot operations, such as launching a cluster, getting its logs and tearing it down, can be orchestrated using Airflow. + +

+ +

+ +## Prerequisites + +* Airflow installed on a [Kubernetes cluster](https://airflow.apache.org/docs/helm-chart/stable/index.html) or [locally](https://airflow.apache.org/docs/apache-airflow/stable/start.html) (`SequentialExecutor`) +* A Kubernetes cluster to run tasks on. We'll use GKE in this example. + * You can use our guide on [setting up a Kubernetes cluster](https://skypilot.readthedocs.io/en/latest/reference/kubernetes/kubernetes-setup.html). + * A persistent volume storage class should be available that supports at least `ReadWriteOnce` access mode. GKE has this supported by default. + +## Preparing the Kubernetes Cluster + +1. Provision a service account on your Kubernetes cluster for SkyPilot to use to launch tasks. + ```bash + kubectl apply -f sky-sa.yaml + ``` + For reference, here are the contents of `sky-sa.yaml`: + ```yaml + # sky-sa.yaml + apiVersion: v1 + kind: ServiceAccount + metadata: + name: sky-airflow-sa + namespace: default + --- + apiVersion: rbac.authorization.k8s.io/v1 + kind: ClusterRoleBinding + metadata: + name: sky-airflow-sa-binding + subjects: + - kind: ServiceAccount + name: sky-airflow-sa + namespace: default + roleRef: + # For minimal permissions, refer to https://skypilot.readthedocs.io/en/latest/cloud-setup/cloud-permissions/kubernetes.html + kind: ClusterRole + name: cluster-admin + apiGroup: rbac.authorization.k8s.io + ``` + +2. Provision a persistent volume for SkyPilot to store state across runs. + ```bash + kubectl apply -f sky-pv.yaml + ``` + For reference, here are the contents of `sky-pv.yaml`: + ```yaml + # sky-pv.yaml + apiVersion: v1 + kind: PersistentVolumeClaim + metadata: + name: sky-pvc + spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 10Gi # 10Gi is minimum for GKE pd-balanced + storageClassName: standard-rwo + ``` + Note: The `storageClassName` should be set to the appropriate storage class that's supported on your cluster. If you have many concurrent tasks, you may want to use a storage class that supports `ReadWriteMany` access mode. + +## Writing the Airflow DAG + +We provide an example DAG in `sky_k8s_example.py` that: +1. Launches a SkyPilot cluster. +2. Writes logs from the cluster to a local file +3. Checks the status of the cluster and prints to Airflow logs +4. Tears down the cluster. + +The DAG is defined in `sky_k8s_example.py`: + +```python +# sky_k8s_example.py +from airflow import DAG +from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator +from airflow.utils.dates import days_ago + +from kubernetes.client import models as k8s + +default_args = { + 'owner': 'airflow', + 'start_date': days_ago(1), +} + +def get_skypilot_task(task_id: str, sky_command: str): + skypilot_task = KubernetesPodOperator( + task_id=task_id, + name="skypilot-pod", + namespace="default", + image="us-central1-docker.pkg.dev/skypilot-375900/skypilotk8s/skypilot:20240613", + cmds=["/bin/bash", "-i", "-c"], + arguments=[ + "chown -R 1000:1000 /home/sky/.sky /home/sky/.ssh && " + "pip install skypilot-nightly[kubernetes] && " + f"{sky_command}"], + service_account_name="sky-airflow-sa", + env_vars={"HOME": "/home/sky"}, + volumes=[ + k8s.V1Volume( + name="sky-pvc", + persistent_volume_claim=k8s.V1PersistentVolumeClaimVolumeSource( + claim_name="sky-pvc" + ), + ), + ], + volume_mounts=[ + k8s.V1VolumeMount(name="sky-pvc", mount_path="/home/sky/.sky", + sub_path="sky"), + k8s.V1VolumeMount(name="sky-pvc", mount_path="/home/sky/.ssh", + sub_path="ssh"), + ], + is_delete_operator_pod=True, + get_logs=True, + ) + return skypilot_task + + +with DAG(dag_id='sky_k8s_example', + default_args=default_args, + schedule_interval=None, + catchup=False) as dag: + # Task to launch a SkyPilot cluster + cmds = ("git clone https://github.com/skypilot-org/skypilot.git && " + "sky launch -y -c train --cloud kubernetes skypilot/examples/minimal.yaml") + sky_launch = get_skypilot_task("sky_launch", cmds) + # Task to get the logs of the SkyPilot cluster + sky_logs = get_skypilot_task("sky_logs", "sky logs train > task_logs.txt") + # Task to get the list of SkyPilot clusters + sky_status = get_skypilot_task("sky_status", "sky status") + # Task to delete the SkyPilot cluster + sky_down = get_skypilot_task("sky_down", "sky down train") + + sky_launch >> sky_logs >> sky_status >> sky_down +``` + +## Running the DAG + +1. Copy the DAG file to the Airflow DAGs directory. + ```bash + cp sky_k8s_example.py /path/to/airflow/dags + # If your Airflow is running on Kubernetes, you may use kubectl cp to copy the file to the pod + # kubectl cp sky_k8s_example.py :/opt/airflow/dags + ``` +2. Run `airflow dags list` to confirm that the DAG is loaded. +3. Find the DAG in the Airflow UI (typically http://localhost:8080) and enable it. The UI may take a couple of minutes to reflect the changes. +4. Trigger the DAG from the Airflow UI using the `Trigger DAG` button. +5. Navigate to the run in the Airflow UI to see the DAG progress and logs of each task. + +

+ +

+

+ +

+ +## Tips + +1. **Persistent Volume**: If you have many concurrent tasks, you may want to use a storage class that supports [`ReadWriteMany`](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes) access mode. +2. **Cloud credentials**: If you wish to run tasks on different clouds, you can configure cloud credentials in Kubernetes secrets and mount them in the Sky pod defined in the DAG. See [SkyPilot docs on setting up cloud credentials](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#cloud-account-setup) for more on how to configure credentials in the pod. +3. **Logging**: All SkyPilot logs are written to container stdout, which is captured as task logs in Airflow and displayed in the UI. You can also write logs to a file and read them in subsequent tasks. +4. **XComs for shared state**: Airflow also provides [XComs](https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html) for cross-task communication. [`sky_k8s_example_xcoms.py`](sky_k8s_example_xcoms.py) demonstrates how to use XComs to share state between tasks. + +## Future work: a native Airflow Executor built on SkyPilot + +SkyPilot can in the future provide a native Airflow Executor, that provides an operator similar to the `KubernetesPodOperator` but runs the task as native SkyPilot task. + +In such a setup, SkyPilot state management would no longer be required, as the executor will handle SkyPilot cluster launching and termination. \ No newline at end of file diff --git a/examples/airflow/shared_state/sky-pv.yaml b/examples/airflow/shared_state/sky-pv.yaml new file mode 100644 index 00000000000..c17198515c4 --- /dev/null +++ b/examples/airflow/shared_state/sky-pv.yaml @@ -0,0 +1,11 @@ +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: sky-pvc +spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 10Gi + storageClassName: standard-rwo diff --git a/examples/airflow/shared_state/sky-sa.yaml b/examples/airflow/shared_state/sky-sa.yaml new file mode 100644 index 00000000000..b791bafdec1 --- /dev/null +++ b/examples/airflow/shared_state/sky-sa.yaml @@ -0,0 +1,18 @@ +apiVersion: v1 +kind: ServiceAccount +metadata: + name: sky-airflow-sa + namespace: default +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRoleBinding +metadata: + name: sky-airflow-sa-binding +subjects: +- kind: ServiceAccount + name: sky-airflow-sa + namespace: default +roleRef: + kind: ClusterRole + name: cluster-admin + apiGroup: rbac.authorization.k8s.io diff --git a/examples/airflow/shared_state/sky_k8s_example.py b/examples/airflow/shared_state/sky_k8s_example.py new file mode 100644 index 00000000000..e61b4e92e5c --- /dev/null +++ b/examples/airflow/shared_state/sky_k8s_example.py @@ -0,0 +1,64 @@ +from airflow import DAG +from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import ( + KubernetesPodOperator) +from airflow.utils.dates import days_ago +from kubernetes.client import models as k8s + +default_args = { + 'owner': 'airflow', + 'start_date': days_ago(1), +} + + +def get_skypilot_task(task_id: str, sky_command: str): + skypilot_task = KubernetesPodOperator( + task_id=task_id, + name="skypilot-pod", + namespace="default", + image= + "us-central1-docker.pkg.dev/skypilot-375900/skypilotk8s/skypilot:20240613", + cmds=["/bin/bash", "-i", "-c"], + arguments=[ + "chown -R 1000:1000 /home/sky/.sky /home/sky/.ssh && " + "pip install skypilot-nightly[kubernetes] && " + f"{sky_command}" + ], + service_account_name="sky-airflow-sa", + env_vars={"HOME": "/home/sky"}, + volumes=[ + k8s.V1Volume( + name="sky-pvc", + persistent_volume_claim=k8s.V1PersistentVolumeClaimVolumeSource( + claim_name="sky-pvc"), + ), + ], + volume_mounts=[ + k8s.V1VolumeMount(name="sky-pvc", + mount_path="/home/sky/.sky", + sub_path="sky"), + k8s.V1VolumeMount(name="sky-pvc", + mount_path="/home/sky/.ssh", + sub_path="ssh"), + ], + is_delete_operator_pod=True, + get_logs=True, + ) + return skypilot_task + + +with DAG(dag_id='sky_k8s_example', + default_args=default_args, + schedule_interval=None, + catchup=False) as dag: + # Task to launch a SkyPilot cluster + sky_launch = get_skypilot_task( + "sky_launch", + "sky launch -y -c train --cloud kubernetes -- echo training the model") + # Task to get the logs of the SkyPilot cluster + sky_logs = get_skypilot_task("sky_logs", "sky logs train > task_logs.txt") + # Task to get the list of SkyPilot clusters + sky_status = get_skypilot_task("sky_status", "sky status") + # Task to delete the SkyPilot cluster + sky_down = get_skypilot_task("sky_down", "sky down train") + + sky_launch >> sky_logs >> sky_status >> sky_down diff --git a/examples/airflow/shared_state/sky_k8s_example_xcoms.py b/examples/airflow/shared_state/sky_k8s_example_xcoms.py new file mode 100644 index 00000000000..3bbac3299b3 --- /dev/null +++ b/examples/airflow/shared_state/sky_k8s_example_xcoms.py @@ -0,0 +1,87 @@ +# This is a WIP example that uses xcom serialization to pass state.db and sky keys between tasks. +# This should not required PVCs to be mounted to the pod, and should be able to run on any Kubernetes cluster. +from airflow import DAG +from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import ( + KubernetesPodOperator) +from airflow.utils.dates import days_ago + +default_args = { + 'owner': 'airflow', + 'start_date': days_ago(1), +} + + +def get_skypilot_task(task_id: str, + sky_command: str, + previous_task_id: str = None, + serialize_xcom: bool = False): + cmds = [ + "/bin/bash", "-i", "-c", + "chown -R 1000:1000 /home/sky/.sky /home/sky/.ssh && " + ] + + if previous_task_id is not None: + # Deserialize state.db and sky keys from xcom (if needed) + # TODO(romilb): Implement this using {{ ti.xcom_pull() }} templating + cmds.append(' echo \'{{ ti.xcom_pull(task_ids="' + previous_task_id + + '")["state_db"] }}\' > /home/sky/.sky/state.db &&' + ' echo \'{{ ti.xcom_pull(task_ids="' + previous_task_id + + '")["sky_key"] }}\' > /home/sky/.ssh/sky-key &&' + ' echo \'{{ ti.xcom_pull(task_ids="' + previous_task_id + + '")["sky_key_pub"] }}\' > /home/sky/.ssh/sky-key.pub') + + cmds.append( + f"pip install skypilot-nightly[kubernetes] && {sky_command} && ") + + if serialize_xcom: + # Serialize state.db and sky keys into xcom + cmds.append( + 'echo \'{"state_db": "$(cat /home/sky/.sky/state.db)", ' + '"sky_key": "$(cat /home/sky/.ssh/sky-key)", ' + '"sky_key_pub": "$(cat /home/sky/.ssh/sky-key.pub)"}\' > /airflow/xcom/return.json' + ) + + task = KubernetesPodOperator( + task_id=task_id, + name="skypilot-pod", + namespace="default", + image= + "us-central1-docker.pkg.dev/skypilot-375900/skypilotk8s/skypilot:20240613", + cmds=["/bin/bash", "-i", "-c"], + arguments=["".join(cmds)], + service_account_name="sky-airflow-sa", + env_vars={"HOME": "/home/sky"}, + is_delete_operator_pod=True, + get_logs=True, + do_xcom_push=serialize_xcom # Only push XCom if we're serializing data + ) + return task + + +with DAG(dag_id='sky_k8s_example_xcoms', + default_args=default_args, + schedule_interval=None, + catchup=False) as dag: + # Task to launch a SkyPilot cluster + sky_launch = get_skypilot_task( + "sky_launch", + "sky launch -y -c train --cloud kubernetes -- echo training the model", + previous_task_id=None, + serialize_xcom=True) + # Task to get the logs of the SkyPilot cluster + sky_logs = get_skypilot_task("sky_logs", + "sky logs train > task_logs.txt", + previous_task_id='sky_launch', + serialize_xcom=True) + # Task to get the list of SkyPilot clusters + sky_status = get_skypilot_task("sky_status", + "sky status", + previous_task_id='sky_logs', + serialize_xcom=True) + # Task to delete the SkyPilot cluster + sky_down = get_skypilot_task("sky_down", + "sky down train", + previous_task_id='sky_status', + serialize_xcom=False) + + sky_launch >> sky_logs >> sky_status >> sky_down diff --git a/examples/airflow/training_workflow/README.md b/examples/airflow/training_workflow/README.md new file mode 100644 index 00000000000..dad08d8d3b0 --- /dev/null +++ b/examples/airflow/training_workflow/README.md @@ -0,0 +1,166 @@ +# Running SkyPilot tasks in Airflow + + +In this guide, we show how a training workflow involving data preprocessing, training and evaluation can be first easily developed with SkyPilot, and then orchestrated in Airflow. + +

+ +

+ +**πŸ’‘ Tip:** SkyPilot also supports defining and running pipelines without Airflow. Check out [Jobs Pipelines](https://skypilot.readthedocs.io/en/latest/examples/managed-jobs.html#job-pipelines) for more information. + +## Why use SkyPilot with Airflow? +In AI workflows, **the transition from development to production is hard**. + +Workflow development happens ad-hoc, with a lot of interaction required +with the code and data. When moving this to an Airflow DAG in production, managing dependencies, environments and the +infra requirements of the workflow gets complex. Porting the code to an airflow requires significant time to test and +validate any changes, often requiring re-writing the code as Airflow operators. + +**SkyPilot seamlessly bridges the dev -> production gap**. + +SkyPilot can operate on any of your infra, allowing you to package and run the same code that you ran during development on a +production Airflow cluster. Behind the scenes, SkyPilot handles environment setup, dependency management, and infra orchestration, allowing you to focus on your code. + +Here's how you can use SkyPilot to take your dev workflows to production in Airflow: +1. **Define and test your workflow as SkyPilot tasks**. + - Use `sky launch` and [Sky VSCode integration](https://skypilot.readthedocs.io/en/latest/examples/interactive-development.html#dev-vscode) to run, debug and iterate on your code. +2. **Orchestrate SkyPilot tasks in Airflow** by invoking `sky launch` on their YAMLs as a task in the Airflow DAG. + - Airflow does the scheduling, logging, and monitoring, while SkyPilot handles the infra setup and task execution. + + +## Prerequisites + +* Airflow installed on a [Kubernetes cluster](https://airflow.apache.org/docs/helm-chart/stable/index.html) or [locally](https://airflow.apache.org/docs/apache-airflow/stable/start.html) (`SequentialExecutor`) +* A Kubernetes cluster to run tasks on. We'll use GKE in this example. +* A Google cloud account with GCS access to store the data for task. + * Follow [SkyPilot instructions](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#google-cloud-platform-gcp) to set up Google Cloud credentials. + +## Preparing the Kubernetes Cluster + +1. Provision a service account on your Kubernetes cluster for SkyPilot to use to launch tasks. + ```bash + kubectl apply -f sky-sa.yaml + ``` + For reference, here are the contents of `sky-sa.yaml`: + ```yaml + # sky-sa.yaml + apiVersion: v1 + kind: ServiceAccount + metadata: + name: sky-airflow-sa + namespace: default + --- + apiVersion: rbac.authorization.k8s.io/v1 + kind: ClusterRoleBinding + metadata: + name: sky-airflow-sa-binding + subjects: + - kind: ServiceAccount + name: sky-airflow-sa + namespace: default + roleRef: + # For minimal permissions, refer to https://skypilot.readthedocs.io/en/latest/cloud-setup/cloud-permissions/kubernetes.html + kind: ClusterRole + name: cluster-admin + apiGroup: rbac.authorization.k8s.io + ``` + +2. We will store intermediate task outputs in a google cloud bucket. Use the following command to create a unique bucket: + ```bash + gsutil mb gs:// + ``` + Take note of the bucket name, as it will be used in the task YAMLs. + +3. To provide SkyPilot GCP access, we will create GCP credentials as secrets that will be mounted in SkyPilot's pods. We provide a helper script `create_gcloud_secret.sh` to create the secret: + ```bash + ./create_gcloud_secret.sh + ``` + You can also use other methods, such as GKE workload identity federation, to provide SkyPilot pods access to GCP credentials. + +## Defining the tasks + +We will define the following tasks to mock a training workflow: +1. `data_preprocessing.yaml`: Generates data and writes it to a bucket. +2. `train.yaml`: Trains a model on the data in the bucket. +3. `eval.yaml`: Evaluates the model and writes evaluation results to the bucket. + +We have defined these tasks in the [mock_training_workflow](https://github.com/romilbhardwaj/mock_train_workflow) repository. Clone the repository and follow the instructions in the README to run the tasks. + +When developing the workflow, you can run the tasks independently using `sky launch`: + +```bash +git clone https://github.com/romilbhardwaj/mock_train_workflow.git +cd mock_train_workflow +# Run the data preprocessing task, replacing with the bucket you created above +sky launch -c data --env DATA_BUCKET_URL=gs:// data_preprocessing.yaml +``` + +The train and eval step can be run in a similar way: + +```bash +# Run the train task +sky launch -c train --env DATA_BUCKET_URL=gs:// train.yaml +``` + +Hint: You can use `ssh` and VSCode to [interactively develop](https://skypilot.readthedocs.io/en/latest/examples/interactive-development.html) and debug the tasks. + +Note: `eval` can be optionally run on the same cluster as `train` with `sky exec`. Refer to the `shared_state` airflow example on how to do this. + +## Writing the Airflow DAG + +Once we have developed the tasks, we can seamlessly port them to Airflow. + +1. **No changes required to our tasks -** we use the same YAMLs we wrote in the previous step to create an Airflow DAG in `sky_k8s_train_pipeline.py`. +2. **Airflow native logging** - SkyPilot logs are written to container stdout, which is captured as task logs in Airflow and displayed in the UI. +3. **Easy debugging** - If a task fails, you can independently run the task using `sky launch` to debug the issue. SkyPilot will recreate the environment in which the task failed. + +Here's a snippet of the DAG declaration in `sky_k8s_train_pipeline.py`: +```python +with DAG(dag_id='sky_k8s_train_pipeline', ...) as dag: + # Make sure bucket exists with gsutil mb -l us-central1 gs:// + bucket_url = "gs://sky-data-demo" + + # Launch data preprocessing task. We use --down to clean up the SkyPilot cluster after the task is done. + data_preprocess = get_skypilot_task("sky_data_preprocess", + f"sky launch -y -c data --down --cloud kubernetes --env DATA_BUCKET_URL={bucket_url} mock_train_workflow/data_preprocessing.yaml") + + # Task to train the model + train = get_skypilot_task("sky_train", + f"sky launch -y -c train --down --cloud kubernetes --env DATA_BUCKET_URL={bucket_url} mock_train_workflow/train.yaml") + + # Task to evaluate the trained model. This can optionally be run on the same cluster as the training task using `sky exec` + eval = get_skypilot_task("sky_eval", + f"sky launch -y -c eval --down --cloud kubernetes --env DATA_BUCKET_URL={bucket_url} mock_train_workflow/eval.yaml") + + data_preprocess >> train >> eval +``` + +Behind the scenes, the `get_skypilot_task` uses the `KubernetesPodOperator` to run the `sky` CLI in an ephemeral pod. All clusters are set to auto-down after the task is done, so no dangling clusters are left behind. + +## Running the DAG + +1. Copy the DAG file to the Airflow DAGs directory. + ```bash + cp sky_k8s_train_pipeline.py /path/to/airflow/dags + # If your Airflow is running on Kubernetes, you may use kubectl cp to copy the file to the pod + # kubectl cp sky_k8s_example.py :/opt/airflow/dags + ``` +2. Run `airflow dags list` to confirm that the DAG is loaded. +3. Find the DAG in the Airflow UI (typically http://localhost:8080) and enable it. The UI may take a couple of minutes to reflect the changes. +4. Trigger the DAG from the Airflow UI using the `Trigger DAG` button. +5. Navigate to the run in the Airflow UI to see the DAG progress and logs of each task. + +

+ +

+

+ +

+ +## Future work: a native Airflow Executor built on SkyPilot + +Currently this example relies on a helper `get_skypilot_task` method to wrap SkyPilot invocation in a `KubernetesPodOperator`, but in the future SkyPilot can +provide a native Airflow Executor. + +In such a setup, SkyPilot state management also not be required, as the executor will handle SkyPilot cluster launching and termination. \ No newline at end of file diff --git a/examples/airflow/training_workflow/create_gcloud_secret.sh b/examples/airflow/training_workflow/create_gcloud_secret.sh new file mode 100755 index 00000000000..fa9e7d902a9 --- /dev/null +++ b/examples/airflow/training_workflow/create_gcloud_secret.sh @@ -0,0 +1,30 @@ +#!/bin/bash + +# Define variables +GCLOUD_DIR="$HOME/.config/gcloud" +TAR_FILE="gcloud-config.tar.gz" +SECRET_NAME="gcloud-secret" + +# List of files and directories to include in the tarball +FILES_TO_TAR=( + "credentials.db" + "access_tokens.db" + "configurations" + "legacy_credentials" + "active_config" + "application_default_credentials.json" +) + +# Create a tarball with the specified files and directories +echo "Creating tarball..." +tar -czvf $TAR_FILE -C $GCLOUD_DIR "${FILES_TO_TAR[@]}" + +# Create the Kubernetes Secret using the tarball +echo "Creating Kubernetes secret..." +kubectl create secret generic $SECRET_NAME --from-file=gcloud-config.tar.gz=$TAR_FILE + +# Remove the tarball after the secret is created +echo "Cleaning up tarball..." +rm -f $TAR_FILE + +echo "Secret '$SECRET_NAME' created successfully and temporary tarball removed." diff --git a/examples/airflow/training_workflow/sky-sa.yaml b/examples/airflow/training_workflow/sky-sa.yaml new file mode 100644 index 00000000000..b791bafdec1 --- /dev/null +++ b/examples/airflow/training_workflow/sky-sa.yaml @@ -0,0 +1,18 @@ +apiVersion: v1 +kind: ServiceAccount +metadata: + name: sky-airflow-sa + namespace: default +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRoleBinding +metadata: + name: sky-airflow-sa-binding +subjects: +- kind: ServiceAccount + name: sky-airflow-sa + namespace: default +roleRef: + kind: ClusterRole + name: cluster-admin + apiGroup: rbac.authorization.k8s.io diff --git a/examples/airflow/training_workflow/sky_k8s_train_pipeline.py b/examples/airflow/training_workflow/sky_k8s_train_pipeline.py new file mode 100644 index 00000000000..ca00926aed9 --- /dev/null +++ b/examples/airflow/training_workflow/sky_k8s_train_pipeline.py @@ -0,0 +1,87 @@ +from airflow import DAG +from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import ( + KubernetesPodOperator) +from airflow.utils.dates import days_ago +from kubernetes.client import models as k8s + +default_args = { + 'owner': 'airflow', + 'start_date': days_ago(1), +} + + +def get_skypilot_task(task_id: str, sky_command: str): + INIT_COMMANDS = ( + # Install gcloud CLI and source the bashrc for accessing buckets in tasks + 'sudo conda install -y -c conda-forge google-cloud-sdk ') + + # Install SkyPilot and clone the mock train workflow repo + # In your workflow, you can have skypilot and the code baked into the image + SETUP_COMMAND = ( + "pip install skypilot-nightly[kubernetes,gcp] &&" + "git clone https://github.com/romilbhardwaj/mock_train_workflow.git /home/sky/mock_train_workflow" + ) + + # Command to extract the gcloud secrets tarball + EXTRACT_GCLOUD = ( + "mkdir -p /home/sky/.config/gcloud && " + "tar -xzf /tmp/gcloud-secrets/gcloud-config.tar.gz -C /home/sky/.config/gcloud " + ) + + skypilot_task = KubernetesPodOperator( + task_id=task_id, + name="skypilot-pod", + namespace="default", + image= + "us-central1-docker.pkg.dev/skypilot-375900/skypilotk8s/skypilot:20240613", + cmds=["/bin/bash", "-i", "-c"], + arguments=[ + f"{INIT_COMMANDS} && " + f"{EXTRACT_GCLOUD} && " + f"{SETUP_COMMAND} && " + f"{sky_command}" + ], + service_account_name="sky-airflow-sa", + env_vars={"HOME": "/home/sky"}, + volumes=[ + k8s.V1Volume( + name="gcloud-secret-volume", + secret=k8s.V1SecretVolumeSource(secret_name="gcloud-secret"), + ), + ], + volume_mounts=[ + k8s.V1VolumeMount(name="gcloud-secret-volume", + mount_path="/tmp/gcloud-secrets"), + ], + is_delete_operator_pod=True, + get_logs=True, + ) + return skypilot_task + + +with DAG(dag_id='sky_k8s_train_pipeline', + default_args=default_args, + schedule_interval=None, + catchup=False) as dag: + # Make sure bucket exists with gsutil mb -l us-central1 gs:// + bucket_url = "gs://sky-data-demo" + + # Launch data preprocessing task. We use --down to clean up the SkyPilot cluster after the task is done. + data_preprocess = get_skypilot_task( + "sky_data_preprocess", + f"sky launch -y -c data --down --cloud kubernetes --env DATA_BUCKET_URL={bucket_url} mock_train_workflow/data_preprocessing.yaml" + ) + + # Task to train the model + train = get_skypilot_task( + "sky_train", + f"sky launch -y -c train --down --cloud kubernetes --env DATA_BUCKET_URL={bucket_url} mock_train_workflow/train.yaml" + ) + + # Task to evaluate the trained model. This can optionally be run on the same cluster as the training task using `sky exec` + eval = get_skypilot_task( + "sky_eval", + f"sky launch -y -c eval --down --cloud kubernetes --env DATA_BUCKET_URL={bucket_url} mock_train_workflow/eval.yaml" + ) + + data_preprocess >> train >> eval From b96a5b42b65a7f08d41fd57e508052e9b20a2041 Mon Sep 17 00:00:00 2001 From: Zongheng Yang Date: Thu, 26 Sep 2024 07:50:26 -0700 Subject: [PATCH 17/93] [UX] default to minimal logging (no module/line number/timestamp). (#3980) * [UX] default to minimal logging (no module/line number/timestamp). * Fix mypy. * Fix typing * Update sky/utils/env_options.py Co-authored-by: Tian Xia * Update sky/utils/env_options.py Co-authored-by: Tian Xia * Account for debug flag. * Remove prefixes from docs. --------- Co-authored-by: Tian Xia --- docs/source/examples/auto-failover.rst | 91 ++++++++++++++------------ sky/execution.py | 2 +- sky/optimizer.py | 7 +- sky/sky_logging.py | 9 +-- sky/utils/controller_utils.py | 4 +- sky/utils/env_options.py | 24 +++++-- 6 files changed, 77 insertions(+), 60 deletions(-) diff --git a/docs/source/examples/auto-failover.rst b/docs/source/examples/auto-failover.rst index 99ee5703738..c23f6273697 100644 --- a/docs/source/examples/auto-failover.rst +++ b/docs/source/examples/auto-failover.rst @@ -60,18 +60,22 @@ provisioner handles such a request: .. code-block:: console $ sky launch -c gpu --gpus V100 - ... # optimizer output - I 02-11 21:17:43 cloud_vm_ray_backend.py:1034] Creating a new cluster: "gpu" [1x GCP(n1-highmem-8, {'V100': 1.0})]. - I 02-11 21:17:43 cloud_vm_ray_backend.py:1034] Tip: to reuse an existing cluster, specify --cluster-name (-c) in the CLI or use sky.launch(.., cluster_name=..) in the Python API. Run `sky status` to see existing clusters. - I 02-11 21:17:43 cloud_vm_ray_backend.py:614] To view detailed progress: tail -n100 -f sky_logs/sky-2022-02-11-21-17-43-171661/provision.log - I 02-11 21:17:43 cloud_vm_ray_backend.py:624] - I 02-11 21:17:43 cloud_vm_ray_backend.py:624] Launching on GCP us-central1 (us-central1-a) - W 02-11 21:17:56 cloud_vm_ray_backend.py:358] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-a (message: The zone 'projects/intercloud-320520/zones/us-central1-a' does not have enough resources available to fulfill the request. Try a different zone, or try again later.) + + ... + Creating a new cluster: "gpu" [1x GCP(n1-highmem-8, {'V100': 1.0})]. + Tip: to reuse an existing cluster, specify --cluster-name (-c) in the CLI or use sky.launch(.., cluster_name=..) in the Python API. Run `sky status` to see existing clusters. + To view detailed progress: tail -n100 -f sky_logs/sky-2022-02-11-21-17-43-171661/provision.log + + Launching on GCP us-central1 (us-central1-a) + Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-a (message: The zone 'projects/intercloud-320520/zones/us-central1-a' does not have enough resources available to fulfill the request. Try a different zone, or try again later.) + ... + + Launching on GCP us-central1 (us-central1-f) + Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-f (message: The zone 'projects/intercloud-320520/zones/us-central1-f' does not have enough resources available to fulfill the request. Try a different zone, or try again later.) + ... + + Launching on GCP us-west1 (us-west1-a) ... - I 02-11 21:18:24 cloud_vm_ray_backend.py:624] Launching on GCP us-central1 (us-central1-f) - W 02-11 21:18:38 cloud_vm_ray_backend.py:358] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-f (message: The zone 'projects/intercloud-320520/zones/us-central1-f' does not have enough resources available to fulfill the request. Try a different zone, or try again later.) - I 02-11 21:18:38 cloud_vm_ray_backend.py:624] - I 02-11 21:18:38 cloud_vm_ray_backend.py:624] Launching on GCP us-west1 (us-west1-a) Successfully connected to 35.230.120.87. GCP was chosen as the best cloud to run the task. There was no capacity in any of the regions in US Central, so the auto-failover provisioner moved to US West instead, allowing for our instance to be successfully provisioned. @@ -88,21 +92,24 @@ AWS, where it succeeded after two regions: .. code-block:: console $ sky launch -c v100-8 --gpus V100:8 - ... # optimizer output - I 02-23 16:39:59 cloud_vm_ray_backend.py:1010] Creating a new cluster: "v100-8" [1x GCP(n1-highmem-8, {'V100': 8.0})]. - I 02-23 16:39:59 cloud_vm_ray_backend.py:1010] Tip: to reuse an existing cluster, specify --cluster-name (-c) in the CLI or use sky.launch(.., cluster_name=..) in the Python API. Run `sky status` to see existing clusters. - I 02-23 16:39:59 cloud_vm_ray_backend.py:658] To view detailed progress: tail -n100 -f sky_logs/sky-2022-02-23-16-39-58-577551/provision.log - I 02-23 16:39:59 cloud_vm_ray_backend.py:668] - I 02-23 16:39:59 cloud_vm_ray_backend.py:668] Launching on GCP us-central1 (us-central1-a) - W 02-23 16:40:17 cloud_vm_ray_backend.py:403] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-a (message: The zone 'projects/intercloud-320520/zones/us-central1-a' does not have enough resources available to fulfill the request. Try a different zone, or try again later.) + ... - I 02-23 16:42:15 cloud_vm_ray_backend.py:668] Launching on AWS us-east-2 (us-east-2a,us-east-2b,us-east-2c) - W 02-23 16:42:26 cloud_vm_ray_backend.py:477] Got error(s) in all zones of us-east-2: - W 02-23 16:42:26 cloud_vm_ray_backend.py:479] create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.16xlarge capacity in the Availability Zone you requested (us-east-2a). Our system will be working on provisioning additional capacity. You can currently get p3.16xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-2b., retrying. + Creating a new cluster: "v100-8" [1x GCP(n1-highmem-8, {'V100': 8.0})]. + Tip: to reuse an existing cluster, specify --cluster-name (-c) in the CLI or use sky.launch(.., cluster_name=..) in the Python API. Run `sky status` to see existing clusters. + To view detailed progress: tail -n100 -f sky_logs/sky-2022-02-23-16-39-58-577551/provision.log + + Launching on GCP us-central1 (us-central1-a) + Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-a (message: The zone 'projects/intercloud-320520/zones/us-central1-a' does not have enough resources available to fulfill the request. Try a different zone, or try again later.) + ... + + Launching on AWS us-east-2 (us-east-2a,us-east-2b,us-east-2c) + Got error(s) in all zones of us-east-2: + create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.16xlarge capacity in the Availability Zone you requested (us-east-2a). Our system will be working on provisioning additional capacity. You can currently get p3.16xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-2b., retrying. ... - I 02-23 16:42:26 cloud_vm_ray_backend.py:668] - I 02-23 16:42:26 cloud_vm_ray_backend.py:668] Launching on AWS us-west-2 (us-west-2a,us-west-2b,us-west-2c,us-west-2d) - I 02-23 16:47:04 cloud_vm_ray_backend.py:740] Successfully provisioned or found existing VM. Setup completed. + + Launching on AWS us-west-2 (us-west-2a,us-west-2b,us-west-2c,us-west-2d) + ... + Successfully provisioned or found existing VM. Setup completed. Multiple Candidate GPUs @@ -125,13 +132,13 @@ A10, L4, and A10g GPUs, using :code:`sky launch task.yaml`. $ sky launch task.yaml ... - I 11-19 08:07:45 optimizer.py:910] ----------------------------------------------------------------------------------------------------- - I 11-19 08:07:45 optimizer.py:910] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN - I 11-19 08:07:45 optimizer.py:910] ----------------------------------------------------------------------------------------------------- - I 11-19 08:07:45 optimizer.py:910] Azure Standard_NV6ads_A10_v5 6 55 A10:1 eastus 0.45 βœ” - I 11-19 08:07:45 optimizer.py:910] GCP g2-standard-4 4 16 L4:1 us-east4-a 0.70 - I 11-19 08:07:45 optimizer.py:910] AWS g5.xlarge 4 16 A10G:1 us-east-1 1.01 - I 11-19 08:07:45 optimizer.py:910] ----------------------------------------------------------------------------------------------------- + ----------------------------------------------------------------------------------------------------- + CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN + ----------------------------------------------------------------------------------------------------- + Azure Standard_NV6ads_A10_v5 6 55 A10:1 eastus 0.45 βœ” + GCP g2-standard-4 4 16 L4:1 us-east4-a 0.70 + AWS g5.xlarge 4 16 A10G:1 us-east-1 1.01 + ----------------------------------------------------------------------------------------------------- @@ -212,15 +219,15 @@ This will generate the following output: $ sky launch -c mycluster task.yaml ... - I 12-20 23:55:56 optimizer.py:717] - I 12-20 23:55:56 optimizer.py:840] Considered resources (1 node): - I 12-20 23:55:56 optimizer.py:910] --------------------------------------------------------------------------------------------- - I 12-20 23:55:56 optimizer.py:910] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN - I 12-20 23:55:56 optimizer.py:910] --------------------------------------------------------------------------------------------- - I 12-20 23:55:56 optimizer.py:910] GCP g2-standard-96 96 384 L4:8 us-east4-a 7.98 βœ” - I 12-20 23:55:56 optimizer.py:910] AWS g5.48xlarge 192 768 A10G:8 us-east-1 16.29 - I 12-20 23:55:56 optimizer.py:910] GCP a2-highgpu-8g 96 680 A100:8 us-east1-b 29.39 - I 12-20 23:55:56 optimizer.py:910] AWS p4d.24xlarge 96 1152 A100:8 us-east-1 32.77 - I 12-20 23:55:56 optimizer.py:910] --------------------------------------------------------------------------------------------- - I 12-20 23:55:56 optimizer.py:910] + + Considered resources (1 node): + --------------------------------------------------------------------------------------------- + CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN + --------------------------------------------------------------------------------------------- + GCP g2-standard-96 96 384 L4:8 us-east4-a 7.98 βœ” + AWS g5.48xlarge 192 768 A10G:8 us-east-1 16.29 + GCP a2-highgpu-8g 96 680 A100:8 us-east1-b 29.39 + AWS p4d.24xlarge 96 1152 A100:8 us-east-1 32.77 + --------------------------------------------------------------------------------------------- + Launching a new cluster 'mycluster'. Proceed? [Y/n]: diff --git a/sky/execution.py b/sky/execution.py index 792ca5fffc0..a2419c9ed2f 100644 --- a/sky/execution.py +++ b/sky/execution.py @@ -344,7 +344,7 @@ def _execute( # # Disable the usage collection for this status command. env = dict(os.environ, - **{env_options.Options.DISABLE_LOGGING.value: '1'}) + **{str(env_options.Options.DISABLE_LOGGING): '1'}) subprocess_utils.run( 'sky status --no-show-managed-jobs --no-show-services', env=env) print() diff --git a/sky/optimizer.py b/sky/optimizer.py index 4326329579d..a4ce4f39f83 100644 --- a/sky/optimizer.py +++ b/sky/optimizer.py @@ -965,10 +965,10 @@ def _print_candidates(node_to_candidate_map: _TaskToPerCloudCandidates): f'Multiple {cloud} instances satisfy ' f'{acc_name}:{int(acc_count)}. ' f'The cheapest {candidate_list[0]!r} is considered ' - f'among:\n{instance_list}.\n') + f'among:\n{instance_list}.') if is_multi_instances: logger.info( - f'To list more details, run \'sky show-gpus {acc_name}\'.') + f'To list more details, run: sky show-gpus {acc_name}\n') @staticmethod def _optimize_dag( @@ -1101,8 +1101,7 @@ def ordinal_number(n): Optimizer.print_optimized_plan(graph, topo_order, best_plan, total_time, total_cost, node_to_cost_map, minimize_cost) - if not env_options.Options.MINIMIZE_LOGGING.get(): - Optimizer._print_candidates(local_node_to_candidate_map) + Optimizer._print_candidates(local_node_to_candidate_map) return best_plan diff --git a/sky/sky_logging.py b/sky/sky_logging.py index c8a243c72cf..232fc6dd9d5 100644 --- a/sky/sky_logging.py +++ b/sky/sky_logging.py @@ -10,10 +10,11 @@ from sky.utils import env_options from sky.utils import rich_utils -# If the SKYPILOT_MINIMIZE_LOGGING environment variable is set to True, -# remove logging prefixes and unnecessary information in optimizer -_FORMAT = (None if env_options.Options.MINIMIZE_LOGGING.get() else - '%(levelname).1s %(asctime)s %(filename)s:%(lineno)d] %(message)s') +# UX: Should we show logging prefixes and some extra information in optimizer? +_show_logging_prefix = (env_options.Options.SHOW_DEBUG_INFO.get() or + not env_options.Options.MINIMIZE_LOGGING.get()) +_FORMAT = ('%(levelname).1s %(asctime)s %(filename)s:%(lineno)d] %(message)s' + if _show_logging_prefix else None) _DATE_FORMAT = '%m-%d %H:%M:%S' diff --git a/sky/utils/controller_utils.py b/sky/utils/controller_utils.py index 118f9a2b718..9bf12752174 100644 --- a/sky/utils/controller_utils.py +++ b/sky/utils/controller_utils.py @@ -380,7 +380,7 @@ def shared_controller_vars_to_fill( 'local_user_config_path': local_user_config_path, } env_vars: Dict[str, str] = { - env.value: '1' for env in env_options.Options if env.get() + str(env): str(int(env.get())) for env in env_options.Options } env_vars.update({ # Should not use $USER here, as that env var can be empty when @@ -388,7 +388,7 @@ def shared_controller_vars_to_fill( constants.USER_ENV_VAR: getpass.getuser(), constants.USER_ID_ENV_VAR: common_utils.get_user_hash(), # Skip cloud identity check to avoid the overhead. - env_options.Options.SKIP_CLOUD_IDENTITY_CHECK.value: '1', + str(env_options.Options.SKIP_CLOUD_IDENTITY_CHECK): '1', }) if skypilot_config.loaded(): # Only set the SKYPILOT_CONFIG env var if the user has a config file. diff --git a/sky/utils/env_options.py b/sky/utils/env_options.py index 166bf42ce80..48855e6cbf6 100644 --- a/sky/utils/env_options.py +++ b/sky/utils/env_options.py @@ -5,17 +5,27 @@ class Options(enum.Enum): """Environment variables for SkyPilot.""" - IS_DEVELOPER = 'SKYPILOT_DEV' - SHOW_DEBUG_INFO = 'SKYPILOT_DEBUG' - DISABLE_LOGGING = 'SKYPILOT_DISABLE_USAGE_COLLECTION' - MINIMIZE_LOGGING = 'SKYPILOT_MINIMIZE_LOGGING' + + # (env var name, default value) + IS_DEVELOPER = ('SKYPILOT_DEV', False) + SHOW_DEBUG_INFO = ('SKYPILOT_DEBUG', False) + DISABLE_LOGGING = ('SKYPILOT_DISABLE_USAGE_COLLECTION', False) + MINIMIZE_LOGGING = ('SKYPILOT_MINIMIZE_LOGGING', True) # Internal: this is used to skip the cloud user identity check, which is # used to protect cluster operations in a multi-identity scenario. # Currently, this is only used in the job and serve controller, as there # will not be multiple identities, and skipping the check can increase # robustness. - SKIP_CLOUD_IDENTITY_CHECK = 'SKYPILOT_SKIP_CLOUD_IDENTITY_CHECK' + SKIP_CLOUD_IDENTITY_CHECK = ('SKYPILOT_SKIP_CLOUD_IDENTITY_CHECK', False) + + def __init__(self, env_var: str, default: bool) -> None: + self.env_var = env_var + self.default = default + + def __repr__(self) -> str: + return self.env_var - def get(self): + def get(self) -> bool: """Check if an environment variable is set to True.""" - return os.getenv(self.value, 'False').lower() in ('true', '1') + return os.getenv(self.env_var, + str(self.default)).lower() in ('true', '1') From a63893b0811becf9212953554c712caf9826edba Mon Sep 17 00:00:00 2001 From: Zongheng Yang Date: Thu, 26 Sep 2024 08:40:12 -0700 Subject: [PATCH 18/93] Revert "[UX] default to minimal logging (no module/line number/timestamp)." (#4003) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Revert "[UX] default to minimal logging (no module/line number/timestamp). (#…" This reverts commit b96a5b42b65a7f08d41fd57e508052e9b20a2041. --- docs/source/examples/auto-failover.rst | 91 ++++++++++++-------------- sky/execution.py | 2 +- sky/optimizer.py | 7 +- sky/sky_logging.py | 9 ++- sky/utils/controller_utils.py | 4 +- sky/utils/env_options.py | 24 ++----- 6 files changed, 60 insertions(+), 77 deletions(-) diff --git a/docs/source/examples/auto-failover.rst b/docs/source/examples/auto-failover.rst index c23f6273697..99ee5703738 100644 --- a/docs/source/examples/auto-failover.rst +++ b/docs/source/examples/auto-failover.rst @@ -60,22 +60,18 @@ provisioner handles such a request: .. code-block:: console $ sky launch -c gpu --gpus V100 - - ... - Creating a new cluster: "gpu" [1x GCP(n1-highmem-8, {'V100': 1.0})]. - Tip: to reuse an existing cluster, specify --cluster-name (-c) in the CLI or use sky.launch(.., cluster_name=..) in the Python API. Run `sky status` to see existing clusters. - To view detailed progress: tail -n100 -f sky_logs/sky-2022-02-11-21-17-43-171661/provision.log - - Launching on GCP us-central1 (us-central1-a) - Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-a (message: The zone 'projects/intercloud-320520/zones/us-central1-a' does not have enough resources available to fulfill the request. Try a different zone, or try again later.) - ... - - Launching on GCP us-central1 (us-central1-f) - Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-f (message: The zone 'projects/intercloud-320520/zones/us-central1-f' does not have enough resources available to fulfill the request. Try a different zone, or try again later.) - ... - - Launching on GCP us-west1 (us-west1-a) + ... # optimizer output + I 02-11 21:17:43 cloud_vm_ray_backend.py:1034] Creating a new cluster: "gpu" [1x GCP(n1-highmem-8, {'V100': 1.0})]. + I 02-11 21:17:43 cloud_vm_ray_backend.py:1034] Tip: to reuse an existing cluster, specify --cluster-name (-c) in the CLI or use sky.launch(.., cluster_name=..) in the Python API. Run `sky status` to see existing clusters. + I 02-11 21:17:43 cloud_vm_ray_backend.py:614] To view detailed progress: tail -n100 -f sky_logs/sky-2022-02-11-21-17-43-171661/provision.log + I 02-11 21:17:43 cloud_vm_ray_backend.py:624] + I 02-11 21:17:43 cloud_vm_ray_backend.py:624] Launching on GCP us-central1 (us-central1-a) + W 02-11 21:17:56 cloud_vm_ray_backend.py:358] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-a (message: The zone 'projects/intercloud-320520/zones/us-central1-a' does not have enough resources available to fulfill the request. Try a different zone, or try again later.) ... + I 02-11 21:18:24 cloud_vm_ray_backend.py:624] Launching on GCP us-central1 (us-central1-f) + W 02-11 21:18:38 cloud_vm_ray_backend.py:358] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-f (message: The zone 'projects/intercloud-320520/zones/us-central1-f' does not have enough resources available to fulfill the request. Try a different zone, or try again later.) + I 02-11 21:18:38 cloud_vm_ray_backend.py:624] + I 02-11 21:18:38 cloud_vm_ray_backend.py:624] Launching on GCP us-west1 (us-west1-a) Successfully connected to 35.230.120.87. GCP was chosen as the best cloud to run the task. There was no capacity in any of the regions in US Central, so the auto-failover provisioner moved to US West instead, allowing for our instance to be successfully provisioned. @@ -92,24 +88,21 @@ AWS, where it succeeded after two regions: .. code-block:: console $ sky launch -c v100-8 --gpus V100:8 - + ... # optimizer output + I 02-23 16:39:59 cloud_vm_ray_backend.py:1010] Creating a new cluster: "v100-8" [1x GCP(n1-highmem-8, {'V100': 8.0})]. + I 02-23 16:39:59 cloud_vm_ray_backend.py:1010] Tip: to reuse an existing cluster, specify --cluster-name (-c) in the CLI or use sky.launch(.., cluster_name=..) in the Python API. Run `sky status` to see existing clusters. + I 02-23 16:39:59 cloud_vm_ray_backend.py:658] To view detailed progress: tail -n100 -f sky_logs/sky-2022-02-23-16-39-58-577551/provision.log + I 02-23 16:39:59 cloud_vm_ray_backend.py:668] + I 02-23 16:39:59 cloud_vm_ray_backend.py:668] Launching on GCP us-central1 (us-central1-a) + W 02-23 16:40:17 cloud_vm_ray_backend.py:403] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-a (message: The zone 'projects/intercloud-320520/zones/us-central1-a' does not have enough resources available to fulfill the request. Try a different zone, or try again later.) ... - Creating a new cluster: "v100-8" [1x GCP(n1-highmem-8, {'V100': 8.0})]. - Tip: to reuse an existing cluster, specify --cluster-name (-c) in the CLI or use sky.launch(.., cluster_name=..) in the Python API. Run `sky status` to see existing clusters. - To view detailed progress: tail -n100 -f sky_logs/sky-2022-02-23-16-39-58-577551/provision.log - - Launching on GCP us-central1 (us-central1-a) - Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-a (message: The zone 'projects/intercloud-320520/zones/us-central1-a' does not have enough resources available to fulfill the request. Try a different zone, or try again later.) - ... - - Launching on AWS us-east-2 (us-east-2a,us-east-2b,us-east-2c) - Got error(s) in all zones of us-east-2: - create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.16xlarge capacity in the Availability Zone you requested (us-east-2a). Our system will be working on provisioning additional capacity. You can currently get p3.16xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-2b., retrying. + I 02-23 16:42:15 cloud_vm_ray_backend.py:668] Launching on AWS us-east-2 (us-east-2a,us-east-2b,us-east-2c) + W 02-23 16:42:26 cloud_vm_ray_backend.py:477] Got error(s) in all zones of us-east-2: + W 02-23 16:42:26 cloud_vm_ray_backend.py:479] create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.16xlarge capacity in the Availability Zone you requested (us-east-2a). Our system will be working on provisioning additional capacity. You can currently get p3.16xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-2b., retrying. ... - - Launching on AWS us-west-2 (us-west-2a,us-west-2b,us-west-2c,us-west-2d) - ... - Successfully provisioned or found existing VM. Setup completed. + I 02-23 16:42:26 cloud_vm_ray_backend.py:668] + I 02-23 16:42:26 cloud_vm_ray_backend.py:668] Launching on AWS us-west-2 (us-west-2a,us-west-2b,us-west-2c,us-west-2d) + I 02-23 16:47:04 cloud_vm_ray_backend.py:740] Successfully provisioned or found existing VM. Setup completed. Multiple Candidate GPUs @@ -132,13 +125,13 @@ A10, L4, and A10g GPUs, using :code:`sky launch task.yaml`. $ sky launch task.yaml ... - ----------------------------------------------------------------------------------------------------- - CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN - ----------------------------------------------------------------------------------------------------- - Azure Standard_NV6ads_A10_v5 6 55 A10:1 eastus 0.45 βœ” - GCP g2-standard-4 4 16 L4:1 us-east4-a 0.70 - AWS g5.xlarge 4 16 A10G:1 us-east-1 1.01 - ----------------------------------------------------------------------------------------------------- + I 11-19 08:07:45 optimizer.py:910] ----------------------------------------------------------------------------------------------------- + I 11-19 08:07:45 optimizer.py:910] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN + I 11-19 08:07:45 optimizer.py:910] ----------------------------------------------------------------------------------------------------- + I 11-19 08:07:45 optimizer.py:910] Azure Standard_NV6ads_A10_v5 6 55 A10:1 eastus 0.45 βœ” + I 11-19 08:07:45 optimizer.py:910] GCP g2-standard-4 4 16 L4:1 us-east4-a 0.70 + I 11-19 08:07:45 optimizer.py:910] AWS g5.xlarge 4 16 A10G:1 us-east-1 1.01 + I 11-19 08:07:45 optimizer.py:910] ----------------------------------------------------------------------------------------------------- @@ -219,15 +212,15 @@ This will generate the following output: $ sky launch -c mycluster task.yaml ... - - Considered resources (1 node): - --------------------------------------------------------------------------------------------- - CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN - --------------------------------------------------------------------------------------------- - GCP g2-standard-96 96 384 L4:8 us-east4-a 7.98 βœ” - AWS g5.48xlarge 192 768 A10G:8 us-east-1 16.29 - GCP a2-highgpu-8g 96 680 A100:8 us-east1-b 29.39 - AWS p4d.24xlarge 96 1152 A100:8 us-east-1 32.77 - --------------------------------------------------------------------------------------------- - + I 12-20 23:55:56 optimizer.py:717] + I 12-20 23:55:56 optimizer.py:840] Considered resources (1 node): + I 12-20 23:55:56 optimizer.py:910] --------------------------------------------------------------------------------------------- + I 12-20 23:55:56 optimizer.py:910] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN + I 12-20 23:55:56 optimizer.py:910] --------------------------------------------------------------------------------------------- + I 12-20 23:55:56 optimizer.py:910] GCP g2-standard-96 96 384 L4:8 us-east4-a 7.98 βœ” + I 12-20 23:55:56 optimizer.py:910] AWS g5.48xlarge 192 768 A10G:8 us-east-1 16.29 + I 12-20 23:55:56 optimizer.py:910] GCP a2-highgpu-8g 96 680 A100:8 us-east1-b 29.39 + I 12-20 23:55:56 optimizer.py:910] AWS p4d.24xlarge 96 1152 A100:8 us-east-1 32.77 + I 12-20 23:55:56 optimizer.py:910] --------------------------------------------------------------------------------------------- + I 12-20 23:55:56 optimizer.py:910] Launching a new cluster 'mycluster'. Proceed? [Y/n]: diff --git a/sky/execution.py b/sky/execution.py index a2419c9ed2f..792ca5fffc0 100644 --- a/sky/execution.py +++ b/sky/execution.py @@ -344,7 +344,7 @@ def _execute( # # Disable the usage collection for this status command. env = dict(os.environ, - **{str(env_options.Options.DISABLE_LOGGING): '1'}) + **{env_options.Options.DISABLE_LOGGING.value: '1'}) subprocess_utils.run( 'sky status --no-show-managed-jobs --no-show-services', env=env) print() diff --git a/sky/optimizer.py b/sky/optimizer.py index a4ce4f39f83..4326329579d 100644 --- a/sky/optimizer.py +++ b/sky/optimizer.py @@ -965,10 +965,10 @@ def _print_candidates(node_to_candidate_map: _TaskToPerCloudCandidates): f'Multiple {cloud} instances satisfy ' f'{acc_name}:{int(acc_count)}. ' f'The cheapest {candidate_list[0]!r} is considered ' - f'among:\n{instance_list}.') + f'among:\n{instance_list}.\n') if is_multi_instances: logger.info( - f'To list more details, run: sky show-gpus {acc_name}\n') + f'To list more details, run \'sky show-gpus {acc_name}\'.') @staticmethod def _optimize_dag( @@ -1101,7 +1101,8 @@ def ordinal_number(n): Optimizer.print_optimized_plan(graph, topo_order, best_plan, total_time, total_cost, node_to_cost_map, minimize_cost) - Optimizer._print_candidates(local_node_to_candidate_map) + if not env_options.Options.MINIMIZE_LOGGING.get(): + Optimizer._print_candidates(local_node_to_candidate_map) return best_plan diff --git a/sky/sky_logging.py b/sky/sky_logging.py index 232fc6dd9d5..c8a243c72cf 100644 --- a/sky/sky_logging.py +++ b/sky/sky_logging.py @@ -10,11 +10,10 @@ from sky.utils import env_options from sky.utils import rich_utils -# UX: Should we show logging prefixes and some extra information in optimizer? -_show_logging_prefix = (env_options.Options.SHOW_DEBUG_INFO.get() or - not env_options.Options.MINIMIZE_LOGGING.get()) -_FORMAT = ('%(levelname).1s %(asctime)s %(filename)s:%(lineno)d] %(message)s' - if _show_logging_prefix else None) +# If the SKYPILOT_MINIMIZE_LOGGING environment variable is set to True, +# remove logging prefixes and unnecessary information in optimizer +_FORMAT = (None if env_options.Options.MINIMIZE_LOGGING.get() else + '%(levelname).1s %(asctime)s %(filename)s:%(lineno)d] %(message)s') _DATE_FORMAT = '%m-%d %H:%M:%S' diff --git a/sky/utils/controller_utils.py b/sky/utils/controller_utils.py index 9bf12752174..118f9a2b718 100644 --- a/sky/utils/controller_utils.py +++ b/sky/utils/controller_utils.py @@ -380,7 +380,7 @@ def shared_controller_vars_to_fill( 'local_user_config_path': local_user_config_path, } env_vars: Dict[str, str] = { - str(env): str(int(env.get())) for env in env_options.Options + env.value: '1' for env in env_options.Options if env.get() } env_vars.update({ # Should not use $USER here, as that env var can be empty when @@ -388,7 +388,7 @@ def shared_controller_vars_to_fill( constants.USER_ENV_VAR: getpass.getuser(), constants.USER_ID_ENV_VAR: common_utils.get_user_hash(), # Skip cloud identity check to avoid the overhead. - str(env_options.Options.SKIP_CLOUD_IDENTITY_CHECK): '1', + env_options.Options.SKIP_CLOUD_IDENTITY_CHECK.value: '1', }) if skypilot_config.loaded(): # Only set the SKYPILOT_CONFIG env var if the user has a config file. diff --git a/sky/utils/env_options.py b/sky/utils/env_options.py index 48855e6cbf6..166bf42ce80 100644 --- a/sky/utils/env_options.py +++ b/sky/utils/env_options.py @@ -5,27 +5,17 @@ class Options(enum.Enum): """Environment variables for SkyPilot.""" - - # (env var name, default value) - IS_DEVELOPER = ('SKYPILOT_DEV', False) - SHOW_DEBUG_INFO = ('SKYPILOT_DEBUG', False) - DISABLE_LOGGING = ('SKYPILOT_DISABLE_USAGE_COLLECTION', False) - MINIMIZE_LOGGING = ('SKYPILOT_MINIMIZE_LOGGING', True) + IS_DEVELOPER = 'SKYPILOT_DEV' + SHOW_DEBUG_INFO = 'SKYPILOT_DEBUG' + DISABLE_LOGGING = 'SKYPILOT_DISABLE_USAGE_COLLECTION' + MINIMIZE_LOGGING = 'SKYPILOT_MINIMIZE_LOGGING' # Internal: this is used to skip the cloud user identity check, which is # used to protect cluster operations in a multi-identity scenario. # Currently, this is only used in the job and serve controller, as there # will not be multiple identities, and skipping the check can increase # robustness. - SKIP_CLOUD_IDENTITY_CHECK = ('SKYPILOT_SKIP_CLOUD_IDENTITY_CHECK', False) - - def __init__(self, env_var: str, default: bool) -> None: - self.env_var = env_var - self.default = default - - def __repr__(self) -> str: - return self.env_var + SKIP_CLOUD_IDENTITY_CHECK = 'SKYPILOT_SKIP_CLOUD_IDENTITY_CHECK' - def get(self) -> bool: + def get(self): """Check if an environment variable is set to True.""" - return os.getenv(self.env_var, - str(self.default)).lower() in ('true', '1') + return os.getenv(self.value, 'False').lower() in ('true', '1') From 6bb5b2a4b265012a07bb986463d11fe67fcef399 Mon Sep 17 00:00:00 2001 From: Romil Bhardwaj Date: Thu, 26 Sep 2024 09:03:44 -0700 Subject: [PATCH 19/93] [Docs] Clarify k8s private registry usage in docs (#3998) * Clarify k8s private registry auth in docs. * comments --- docs/source/examples/docker-containers.rst | 8 ++++++++ .../reference/kubernetes/kubernetes-getting-started.rst | 4 ++++ 2 files changed, 12 insertions(+) diff --git a/docs/source/examples/docker-containers.rst b/docs/source/examples/docker-containers.rst index 408a53a6185..41a5e13a027 100644 --- a/docs/source/examples/docker-containers.rst +++ b/docs/source/examples/docker-containers.rst @@ -18,6 +18,10 @@ SkyPilot can run a container either as a task, or as the runtime environment of Running Containers as Tasks --------------------------- +.. note:: + + On Kubernetes, running Docker runtime in a pod is not recommended. Instead, :ref:`use your container as a runtime environment `. + SkyPilot can run containerized applications directly as regular tasks. The default VM images provided by SkyPilot already have the Docker runtime pre-configured. To launch a containerized application, you can directly invoke :code:`docker run` in the :code:`run` section of your task. @@ -173,6 +177,10 @@ Any GPUs assigned to the task will be automatically mapped to your Docker contai Private Registries ^^^^^^^^^^^^^^^^^^ +.. note:: + + These instructions do not apply if you use SkyPilot to launch on Kubernetes clusters. Instead, see :ref:`Using Images from Private Repositories in Kubernetes` for more. + When using this mode, to access Docker images hosted on private registries, you can provide the registry authentication details using :ref:`task environment variables `: diff --git a/docs/source/reference/kubernetes/kubernetes-getting-started.rst b/docs/source/reference/kubernetes/kubernetes-getting-started.rst index 51d8bf57565..4f87c8a6ee7 100644 --- a/docs/source/reference/kubernetes/kubernetes-getting-started.rst +++ b/docs/source/reference/kubernetes/kubernetes-getting-started.rst @@ -119,6 +119,8 @@ Once your cluster administrator has :ref:`setup a Kubernetes cluster `_ for more. +.. _kubernetes-custom-images-private-repos: + Using Images from Private Repositories ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To use images from private repositories (e.g., Private DockerHub, Amazon ECR, Google Container Registry), create a `secret `_ in your Kubernetes cluster and edit your :code:`~/.sky/config.yaml` to specify the secret like so: From 4740ea8af05400222231b0cc3442c3a08be695b9 Mon Sep 17 00:00:00 2001 From: Zongheng Yang Date: Thu, 26 Sep 2024 10:57:48 -0700 Subject: [PATCH 20/93] [Docs] Various polishing. (#4002) * [Docs] Various polishing. * update * Reword. --- README.md | 29 +++++++------ docs/source/_gallery_original/index.rst | 14 +++--- docs/source/_static/custom.js | 4 +- docs/source/docs/index.rst | 6 +-- llm/llama-2/README.md | 4 +- llm/llama-3/README.md | 4 +- llm/llama-3_2/README.md | 58 ++++++++++++------------- 7 files changed, 59 insertions(+), 60 deletions(-) diff --git a/README.md b/README.md index 1f646b0e995..a5287dbb3cd 100644 --- a/README.md +++ b/README.md @@ -26,27 +26,27 @@ ---- :fire: *News* :fire: -- [Sep, 2024] Point, Lanuch and Serve **Llama 3.2** on on Kubernetes or Any Cloud: [**example**](./llm/llama-3_2/) -- [Sep, 2024] Run and deploy [Pixtral](./llm/pixtral), the first open-source multimodal model from Mistral AI. -- [Jul, 2024] [Finetune](./llm/llama-3_1-finetuning/) and [serve](./llm/llama-3_1/) **Llama 3.1** on your infra +- [Sep, 2024] Point, Launch and Serve **Llama 3.2** on on Kubernetes or Any Cloud: [**example**](./llm/llama-3_2/) +- [Sep, 2024] Run and deploy [**Pixtral**](./llm/pixtral), the first open-source multimodal model from Mistral AI. +- [Jul, 2024] [**Finetune**](./llm/llama-3_1-finetuning/) and [**serve**](./llm/llama-3_1/) **Llama 3.1** on your infra - [Jun, 2024] Reproduce **GPT** with [llm.c](https://github.com/karpathy/llm.c/discussions/481) on any cloud: [**guide**](./llm/gpt-2/) -- [Apr, 2024] Serve [**Qwen-110B**](https://qwenlm.github.io/blog/qwen1.5-110b/) on your infra: [**example**](./llm/qwen/) -- [Apr, 2024] Using [**Ollama**](https://github.com/ollama/ollama) to deploy quantized LLMs on CPUs and GPUs: [**example**](./llm/ollama/) -- [Feb, 2024] Deploying and scaling [**Gemma**](https://blog.google/technology/developers/gemma-open-models/) with SkyServe: [**example**](./llm/gemma/) -- [Feb, 2024] Serving [**Code Llama 70B**](https://ai.meta.com/blog/code-llama-large-language-model-coding/) with vLLM and SkyServe: [**example**](./llm/codellama/) -- [Dec, 2023] [**Mixtral 8x7B**](https://mistral.ai/news/mixtral-of-experts/), a high quality sparse mixture-of-experts model, was released by Mistral AI! Deploy via SkyPilot on any cloud: [**example**](./llm/mixtral/) -- [Nov, 2023] Using [**Axolotl**](https://github.com/OpenAccess-AI-Collective/axolotl) to finetune Mistral 7B on the cloud (on-demand and spot): [**example**](./llm/axolotl/) -- [Sep, 2023] Case study: [**Covariant**](https://covariant.ai/) transformed AI development on the cloud using SkyPilot, delivering models 4x faster cost-effectively: [**read the case study**](https://blog.skypilot.co/covariant/) -- [Aug, 2023] **Finetuning Cookbook**: Finetuning Llama 2 in your own cloud environment, privately: [**example**](./llm/vicuna-llama-2/), [**blog post**](https://blog.skypilot.co/finetuning-llama2-operational-guide/) +- [Apr, 2024] Serve **Qwen-110B** on your infra: [**example**](./llm/qwen/) +- [Apr, 2024] Using **Ollama** to deploy quantized LLMs on CPUs and GPUs: [**example**](./llm/ollama/) +- [Feb, 2024] Deploying and scaling **Gemma** with SkyServe: [**example**](./llm/gemma/) +- [Feb, 2024] Serving **Code Llama 70B** with vLLM and SkyServe: [**example**](./llm/codellama/) +- [Dec, 2023] **Mixtral 8x7B**, a high quality sparse mixture-of-experts model, was released by Mistral AI! Deploy via SkyPilot on any cloud: [**example**](./llm/mixtral/) +- [Nov, 2023] Using **Axolotl** to finetune Mistral 7B on the cloud (on-demand and spot): [**example**](./llm/axolotl/)
Archived - + - [Apr, 2024] Serve and finetune [**Llama 3**](https://skypilot.readthedocs.io/en/latest/gallery/llms/llama-3.html) on any cloud or Kubernetes: [**example**](./llm/llama-3/) - [Mar, 2024] Serve and deploy [**Databricks DBRX**](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm) on your infra: [**example**](./llm/dbrx/) - [Feb, 2024] Speed up your LLM deployments with [**SGLang**](https://github.com/sgl-project/sglang) for 5x throughput on SkyServe: [**example**](./llm/sglang/) - [Dec, 2023] Using [**LoRAX**](https://github.com/predibase/lorax) to serve 1000s of finetuned LLMs on a single instance in the cloud: [**example**](./llm/lorax/) - [Sep, 2023] [**Mistral 7B**](https://mistral.ai/news/announcing-mistral-7b/), a high-quality open LLM, was released! Deploy via SkyPilot on any cloud: [**Mistral docs**](https://docs.mistral.ai/self-deployment/skypilot) +- [Sep, 2023] Case study: [**Covariant**](https://covariant.ai/) transformed AI development on the cloud using SkyPilot, delivering models 4x faster cost-effectively: [**read the case study**](https://blog.skypilot.co/covariant/) +- [Aug, 2023] **Finetuning Cookbook**: Finetuning Llama 2 in your own cloud environment, privately: [**example**](./llm/vicuna-llama-2/), [**blog post**](https://blog.skypilot.co/finetuning-llama2-operational-guide/) - [July, 2023] Self-Hosted **Llama-2 Chatbot** on Any Cloud: [**example**](./llm/llama-2/) - [June, 2023] Serving LLM 24x Faster On the Cloud [**with vLLM**](https://vllm.ai/) and SkyPilot: [**example**](./llm/vllm/), [**blog post**](https://blog.skypilot.co/serving-llm-24x-faster-on-the-cloud-with-vllm-and-skypilot/) - [April, 2023] [SkyPilot YAMLs](./llm/vicuna/) for finetuning & serving the [Vicuna LLM](https://lmsys.org/blog/2023-03-30-vicuna/) with a single command! @@ -153,11 +153,12 @@ SkyPilot then performs the heavy-lifting for you, including: Refer to [Quickstart](https://skypilot.readthedocs.io/en/latest/getting-started/quickstart.html) to get started with SkyPilot. ## More Information -To learn more, see our [Documentation](https://skypilot.readthedocs.io/en/latest/) and [Tutorials](https://github.com/skypilot-org/skypilot-tutorial). +To learn more, see our [documentation](https://skypilot.readthedocs.io/en/latest/), [blog](https://blog.skypilot.co/), and [community integrations](https://blog.skypilot.co/community/). Runnable examples: - LLMs on SkyPilot + - [Llama 3.2: lightweight and vision models](./llm/llama-3_2/) - [Pixtral](./llm/pixtral/) - [Llama 3.1 finetuning](./llm/llama-3_1-finetuning/) and [serving](./llm/llama-3_1/) - [GPT-2 via `llm.c`](./llm/gpt-2/) @@ -203,4 +204,4 @@ We are excited to hear your feedback! For general discussions, join us on the [SkyPilot Slack](http://slack.skypilot.co). ## Contributing -We welcome and value all contributions to the project! Please refer to [CONTRIBUTING](CONTRIBUTING.md) for how to get involved. +We welcome all contributions to the project! See [CONTRIBUTING](CONTRIBUTING.md) for how to get involved. diff --git a/docs/source/_gallery_original/index.rst b/docs/source/_gallery_original/index.rst index 8613bfb649d..e049a4ad322 100644 --- a/docs/source/_gallery_original/index.rst +++ b/docs/source/_gallery_original/index.rst @@ -34,17 +34,17 @@ Contents :maxdepth: 1 :caption: LLM Models + Vision Llama 3.2 (Meta) + Llama 3.1 (Meta) + Llama 3 (Meta) + Llama 2 (Meta) + CodeLlama (Meta) Pixtral (Mistral AI) Mixtral (Mistral AI) Mistral 7B (Mistral AI) - DBRX (Databricks) - Llama-2 (Meta) - Llama-3 (Meta) - Llama-3.1 (Meta) - Vision Llama-3.2 (Meta) - Qwen (Alibaba) - CodeLlama (Meta) + Qwen 2.5 (Alibaba) Gemma (Google) + DBRX (Databricks) .. toctree:: :maxdepth: 1 diff --git a/docs/source/_static/custom.js b/docs/source/_static/custom.js index b10d157ed00..3e5653295e0 100644 --- a/docs/source/_static/custom.js +++ b/docs/source/_static/custom.js @@ -27,11 +27,11 @@ document.addEventListener('DOMContentLoaded', () => { const newItems = [ { selector: '.caption-text', text: 'SkyServe: Model Serving' }, { selector: '.toctree-l1 > a', text: 'Managed Jobs' }, - { selector: '.toctree-l1 > a', text: 'Llama-3.1 (Meta)' }, { selector: '.toctree-l1 > a', text: 'Pixtral (Mistral AI)' }, { selector: '.toctree-l1 > a', text: 'Many Parallel Jobs' }, { selector: '.toctree-l1 > a', text: 'Reserved, Capacity Blocks, DWS' }, - { selector: '.toctree-l1 > a', text: 'Llama-3.2 (Meta)' }, + { selector: '.toctree-l1 > a', text: 'Llama 3.2 (Meta)' }, + { selector: '.toctree-l1 > a', text: 'Admin Policy Enforcement' }, ]; newItems.forEach(({ selector, text }) => { document.querySelectorAll(selector).forEach((el) => { diff --git a/docs/source/docs/index.rst b/docs/source/docs/index.rst index c219fcd5c85..6bf2d889582 100644 --- a/docs/source/docs/index.rst +++ b/docs/source/docs/index.rst @@ -80,12 +80,12 @@ Runnable examples: * **LLMs on SkyPilot** + * `Llama 3.2: lightweight and vision models `_ * `Pixtral `_ * `Llama 3.1 finetuning `_ and `serving `_ * `GPT-2 via llm.c `_ * `Llama 3 `_ * `Qwen `_ - * `Databricks DBRX `_ * `Gemma `_ * `Mixtral 8x7B `_; `Mistral 7B `_ (from official Mistral team) * `Code Llama `_ @@ -93,14 +93,12 @@ Runnable examples: * `SGLang: Fast and Expressive LLM Serving On the Cloud `_ (from official SGLang team) * `Vicuna chatbots: Training & Serving `_ (from official Vicuna team) * `Train your own Vicuna on Llama-2 `_ - * `Self-Hosted Llama-2 Chatbot `_ * `Ollama: Quantized LLMs on CPUs `_ * `LoRAX `_ * `QLoRA `_ * `LLaMA-LoRA-Tuner `_ * `Tabby: Self-hosted AI coding assistant `_ * `LocalGPT `_ - * `Falcon `_ * Add yours here & see more in `llm/ `_! * Framework examples: `PyTorch DDP `_, `DeepSpeed `_, `JAX/Flax on TPU `_, `Stable Diffusion `_, `Detectron2 `_, `Distributed `_ `TensorFlow `_, `NeMo `_, `programmatic grid search `_, `Docker `_, `Cog `_, `Unsloth `_, `Ollama `_, `llm.c `__, `Airflow `_ and `many more `_. @@ -202,7 +200,7 @@ Read the research: ../cloud-setup/cloud-auth ../cloud-setup/quota ../cloud-setup/policy - + .. toctree:: :hidden: :maxdepth: 1 diff --git a/llm/llama-2/README.md b/llm/llama-2/README.md index 4f1a8f60cae..53197c431ce 100644 --- a/llm/llama-2/README.md +++ b/llm/llama-2/README.md @@ -1,7 +1,7 @@ -# Self-Hosted Llama-2 Chatbot on Any Cloud +# Self-Hosted Llama 2 Chatbot on Any Cloud - + [Llama-2](https://github.com/facebookresearch/llama/tree/main) is the top open-source models on the [Open LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) today. It has been released with a license that authorizes commercial use. You can deploy a private Llama-2 chatbot with SkyPilot in your own cloud with just one simple command. diff --git a/llm/llama-3/README.md b/llm/llama-3/README.md index ae5c10dc62b..8ffcb3087a9 100644 --- a/llm/llama-3/README.md +++ b/llm/llama-3/README.md @@ -1,7 +1,7 @@ -# Scale Serving Llama-3 on Any Cloud or Kubernetes with SkyPilot +# Scale Serving Llama 3 on Any Cloud or Kubernetes with SkyPilot - +

diff --git a/llm/llama-3_2/README.md b/llm/llama-3_2/README.md index 8e4b9820a88..eb62071471d 100644 --- a/llm/llama-3_2/README.md +++ b/llm/llama-3_2/README.md @@ -2,7 +2,7 @@ # Point, Launch, and Serve Vision Llama 3.2 on Kubernetes or Any Cloud - + [Llama 3.2](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/) family was released by Meta on Sep 25, 2024. It not only includes the latest improved (and smaller) LLM models for chat, but also includes multimodal vision-language models. Let's _point and launch_ it with SkyPilot. @@ -90,22 +90,22 @@ $ HF_TOKEN=xxx sky launch llama3_2.yaml -c llama3_2 --env HF_TOKEN ```console ... ------------------------------------------------------------------------------------------------------------------ - CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN + CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN ------------------------------------------------------------------------------------------------------------------ - Kubernetes 4CPU--16GB--1L4 4 16 L4:1 kubernetes 0.00 βœ” - RunPod 1x_L4_SECURE 4 24 L4:1 CA 0.44 - GCP g2-standard-4 4 16 L4:1 us-east4-a 0.70 - AWS g6.xlarge 4 16 L4:1 us-east-1 0.80 - AWS g5.xlarge 4 16 A10G:1 us-east-1 1.01 - RunPod 1x_L40_SECURE 16 48 L40:1 CA 1.14 - Fluidstack L40_48GB::1 32 60 L40:1 CANADA 1.15 - AWS g6e.xlarge 4 32 L40S:1 us-east-1 1.86 - Cudo sapphire-rapids-h100_1x4v8gb 4 8 H100:1 ca-montreal-3 2.86 - Fluidstack H100_PCIE_80GB::1 28 180 H100:1 CANADA 2.89 - Azure Standard_NV36ads_A10_v5 36 440 A10:1 eastus 3.20 - GCP a2-highgpu-1g 12 85 A100:1 us-central1-a 3.67 - RunPod 1x_H100_SECURE 16 80 H100:1 CA 4.49 - Azure Standard_NC40ads_H100_v5 40 320 H100:1 eastus 6.98 + Kubernetes 4CPU--16GB--1L4 4 16 L4:1 kubernetes 0.00 βœ” + RunPod 1x_L4_SECURE 4 24 L4:1 CA 0.44 + GCP g2-standard-4 4 16 L4:1 us-east4-a 0.70 + AWS g6.xlarge 4 16 L4:1 us-east-1 0.80 + AWS g5.xlarge 4 16 A10G:1 us-east-1 1.01 + RunPod 1x_L40_SECURE 16 48 L40:1 CA 1.14 + Fluidstack L40_48GB::1 32 60 L40:1 CANADA 1.15 + AWS g6e.xlarge 4 32 L40S:1 us-east-1 1.86 + Cudo sapphire-rapids-h100_1x4v8gb 4 8 H100:1 ca-montreal-3 2.86 + Fluidstack H100_PCIE_80GB::1 28 180 H100:1 CANADA 2.89 + Azure Standard_NV36ads_A10_v5 36 440 A10:1 eastus 3.20 + GCP a2-highgpu-1g 12 85 A100:1 us-central1-a 3.67 + RunPod 1x_H100_SECURE 16 80 H100:1 CA 4.49 + Azure Standard_NC40ads_H100_v5 40 320 H100:1 eastus 6.98 ------------------------------------------------------------------------------------------------------------------ ``` @@ -185,20 +185,20 @@ $ HF_TOKEN=xxx sky launch llama3_2-vision-11b.yaml -c llama3_2-vision --env HF_T ```console ------------------------------------------------------------------------------------------------------------------ - CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN + CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN ------------------------------------------------------------------------------------------------------------------ - Kubernetes 2CPU--8GB--1H100 2 8 H100:1 kubernetes 0.00 βœ” - RunPod 1x_L40_SECURE 16 48 L40:1 CA 1.14 - Fluidstack L40_48GB::1 32 60 L40:1 CANADA 1.15 - AWS g6e.xlarge 4 32 L40S:1 us-east-1 1.86 - RunPod 1x_A100-80GB_SECURE 8 80 A100-80GB:1 CA 1.99 - Cudo sapphire-rapids-h100_1x2v4gb 2 4 H100:1 ca-montreal-3 2.83 - Fluidstack H100_PCIE_80GB::1 28 180 H100:1 CANADA 2.89 - GCP a2-highgpu-1g 12 85 A100:1 us-central1-a 3.67 - Azure Standard_NC24ads_A100_v4 24 220 A100-80GB:1 eastus 3.67 - RunPod 1x_H100_SECURE 16 80 H100:1 CA 4.49 - GCP a2-ultragpu-1g 12 170 A100-80GB:1 us-central1-a 5.03 - Azure Standard_NC40ads_H100_v5 40 320 H100:1 eastus 6.98 + Kubernetes 2CPU--8GB--1H100 2 8 H100:1 kubernetes 0.00 βœ” + RunPod 1x_L40_SECURE 16 48 L40:1 CA 1.14 + Fluidstack L40_48GB::1 32 60 L40:1 CANADA 1.15 + AWS g6e.xlarge 4 32 L40S:1 us-east-1 1.86 + RunPod 1x_A100-80GB_SECURE 8 80 A100-80GB:1 CA 1.99 + Cudo sapphire-rapids-h100_1x2v4gb 2 4 H100:1 ca-montreal-3 2.83 + Fluidstack H100_PCIE_80GB::1 28 180 H100:1 CANADA 2.89 + GCP a2-highgpu-1g 12 85 A100:1 us-central1-a 3.67 + Azure Standard_NC24ads_A100_v4 24 220 A100-80GB:1 eastus 3.67 + RunPod 1x_H100_SECURE 16 80 H100:1 CA 4.49 + GCP a2-ultragpu-1g 12 170 A100-80GB:1 us-central1-a 5.03 + Azure Standard_NC40ads_H100_v5 40 320 H100:1 eastus 6.98 ------------------------------------------------------------------------------------------------------------------ ``` From 4e46cf4e8e90a18ea2b71fbd94d7a69c689ffb9e Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Thu, 26 Sep 2024 16:11:15 -0700 Subject: [PATCH 21/93] [k8s] Enable multiple kubernetes contexts for failover (#3968) * wip * Fix * format * format * Fix context and namespace used * update * fix * Fix feasibility check * fix image for k8s * patch k8s tests * format * format * format * Fix tests * avoid -s * Fix acc detection * format * Update docs/source/reference/config.rst Co-authored-by: Romil Bhardwaj * refactor a little * Add docs for k8s context update * Use all pods in a context * Add policy * Fix unsupported features and other kube calls * Add policies * Fix backward compatbility * Add smoke test * set * fix typing * Add check for local k8s cluster in smoke test * Add skypilot config * Fix smoke * Make loging log once * format * format --------- Co-authored-by: Romil Bhardwaj --- .github/workflows/pytest.yml | 2 +- docs/source/cloud-setup/policy.rst | 16 ++ docs/source/reference/config.rst | 13 ++ .../dynamic_kubernetes_contexts_update.yaml | 1 + .../example_policy/example_policy/__init__.py | 1 + .../example_policy/skypilot_policy.py | 46 ++++ sky/adaptors/kubernetes.py | 18 +- sky/authentication.py | 16 +- sky/backends/cloud_vm_ray_backend.py | 15 +- sky/cli.py | 20 +- sky/clouds/kubernetes.py | 206 ++++++++++++++---- sky/clouds/oci.py | 19 +- .../service_catalog/kubernetes_catalog.py | 22 +- sky/provision/kubernetes/network.py | 48 ++-- sky/provision/kubernetes/network_utils.py | 12 +- sky/provision/kubernetes/utils.py | 184 +++++++++++----- sky/templates/kubernetes-ray.yml.j2 | 2 +- sky/utils/schemas.py | 6 + tests/common.py | 13 ++ tests/test_smoke.py | 86 ++++++++ tests/unit_tests/test_admin_policy.py | 18 ++ 21 files changed, 599 insertions(+), 165 deletions(-) create mode 100644 examples/admin_policy/dynamic_kubernetes_contexts_update.yaml diff --git a/.github/workflows/pytest.yml b/.github/workflows/pytest.yml index 3faf75acf8d..757bfec36d2 100644 --- a/.github/workflows/pytest.yml +++ b/.github/workflows/pytest.yml @@ -57,4 +57,4 @@ jobs: pip install pytest pytest-xdist pytest-env>=0.6 memory-profiler==0.61.0 - name: Run tests with pytest - run: SKYPILOT_DISABLE_USAGE_COLLECTION=1 SKYPILOT_SKIP_CLOUD_IDENTITY_CHECK=1 pytest -n 1 --dist no ${{ matrix.test-path }} + run: SKYPILOT_DISABLE_USAGE_COLLECTION=1 SKYPILOT_SKIP_CLOUD_IDENTITY_CHECK=1 pytest -n 0 --dist no ${{ matrix.test-path }} diff --git a/docs/source/cloud-setup/policy.rst b/docs/source/cloud-setup/policy.rst index 0d3e3444372..288eb79ed53 100644 --- a/docs/source/cloud-setup/policy.rst +++ b/docs/source/cloud-setup/policy.rst @@ -13,6 +13,7 @@ Example usage: - :ref:`disable-public-ip-policy` - :ref:`use-spot-for-gpu-policy` - :ref:`enforce-autostop-policy` +- :ref:`dynamic-kubernetes-contexts-update-policy` To implement and use an admin policy: @@ -193,3 +194,18 @@ Enforce Autostop for all Tasks .. literalinclude:: ../../../examples/admin_policy/enforce_autostop.yaml :language: yaml :caption: `Config YAML for using EnforceAutostopPolicy `_ + + +.. _dynamic-kubernetes-contexts-update-policy: + +Dynamically Update Kubernetes Contexts to Use +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. literalinclude:: ../../../examples/admin_policy/example_policy/example_policy/skypilot_policy.py + :language: python + :pyobject: DynamicKubernetesContextsUpdatePolicy + :caption: `DynamicKubernetesContextsUpdatePolicy `_ + +.. literalinclude:: ../../../examples/admin_policy/dynamic_kubernetes_contexts_update.yaml + :language: yaml + :caption: `Config YAML for using DynamicKubernetesContextsUpdatePolicy `_ diff --git a/docs/source/reference/config.rst b/docs/source/reference/config.rst index ebe8db6751f..5c52e7487b9 100644 --- a/docs/source/reference/config.rst +++ b/docs/source/reference/config.rst @@ -495,6 +495,19 @@ Available fields and semantics: # Default: 'SERVICE_ACCOUNT'. remote_identity: my-k8s-service-account + # Allowed context names to use for Kubernetes clusters (optional). + # + # SkyPilot will try provisioning and failover Kubernetes contexts in the + # same order as they are specified here. E.g., SkyPilot will try using + # context1 first. If it is out of resources or unreachable, it will failover + # and try context2. + # + # If not specified, only the current active context is used for launching + # new clusters. + allowed_contexts: + - context1 + - context2 + # Attach custom metadata to Kubernetes objects created by SkyPilot # # Uses the same schema as Kubernetes metadata object: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.26/#objectmeta-v1-meta diff --git a/examples/admin_policy/dynamic_kubernetes_contexts_update.yaml b/examples/admin_policy/dynamic_kubernetes_contexts_update.yaml new file mode 100644 index 00000000000..ac6b702d251 --- /dev/null +++ b/examples/admin_policy/dynamic_kubernetes_contexts_update.yaml @@ -0,0 +1 @@ +admin_policy: example_policy.DynamicKubernetesContextsUpdatePolicy diff --git a/examples/admin_policy/example_policy/example_policy/__init__.py b/examples/admin_policy/example_policy/example_policy/__init__.py index 12ca4e952e2..4a56f04e986 100644 --- a/examples/admin_policy/example_policy/example_policy/__init__.py +++ b/examples/admin_policy/example_policy/example_policy/__init__.py @@ -1,6 +1,7 @@ """Example admin policy moduleΒ and prebuilt policies.""" from example_policy.skypilot_policy import AddLabelsPolicy from example_policy.skypilot_policy import DisablePublicIpPolicy +from example_policy.skypilot_policy import DynamicKubernetesContextsUpdatePolicy from example_policy.skypilot_policy import EnforceAutostopPolicy from example_policy.skypilot_policy import RejectAllPolicy from example_policy.skypilot_policy import UseSpotForGpuPolicy diff --git a/examples/admin_policy/example_policy/example_policy/skypilot_policy.py b/examples/admin_policy/example_policy/example_policy/skypilot_policy.py index dc4e4b873fb..7addcffbe3c 100644 --- a/examples/admin_policy/example_policy/example_policy/skypilot_policy.py +++ b/examples/admin_policy/example_policy/example_policy/skypilot_policy.py @@ -1,4 +1,6 @@ """Example prebuilt admin policies.""" +import subprocess + import sky @@ -119,3 +121,47 @@ def validate_and_mutate( return sky.MutatedUserRequest( task=user_request.task, skypilot_config=user_request.skypilot_config) + + +def update_current_kubernetes_clusters_from_registry(): + """Mock implementation of updating kubernetes clusters from registry.""" + # All cluster names can be fetched from an organization's internal API. + NEW_CLUSTER_NAMES = ['my-cluster'] + for cluster_name in NEW_CLUSTER_NAMES: + # Update the local kubeconfig with the new cluster credentials. + subprocess.run( + f'gcloud container clusters get-credentials {cluster_name} ' + '--region us-central1-c', + shell=True, + check=False) + + +def get_allowed_contexts(): + """Mock implementation of getting allowed kubernetes contexts.""" + from sky.provision.kubernetes import utils + contexts = utils.get_all_kube_config_context_names() + return contexts[:2] + + +class DynamicKubernetesContextsUpdatePolicy(sky.AdminPolicy): + """Example policy: update the kubernetes context to use.""" + + @classmethod + def validate_and_mutate( + cls, user_request: sky.UserRequest) -> sky.MutatedUserRequest: + """Updates the kubernetes context to use.""" + # Append any new kubernetes clusters in local kubeconfig. An example + # implementation of this method can be: + # 1. Query an organization's internal Kubernetes cluster registry, + # which can be some internal API, or a secret vault. + # 2. Append the new credentials to the local kubeconfig. + update_current_kubernetes_clusters_from_registry() + # Get the allowed contexts for the user. Similarly, it can retrieve + # the latest allowed contexts from an organization's internal API. + allowed_contexts = get_allowed_contexts() + + # Update the kubernetes allowed contexts in skypilot config. + config = user_request.skypilot_config + config.set_nested(('kubernetes', 'allowed_contexts'), allowed_contexts) + return sky.MutatedUserRequest(task=user_request.task, + skypilot_config=config) diff --git a/sky/adaptors/kubernetes.py b/sky/adaptors/kubernetes.py index 489e62a5158..ea8fb194efa 100644 --- a/sky/adaptors/kubernetes.py +++ b/sky/adaptors/kubernetes.py @@ -75,15 +75,17 @@ def _load_config(context: Optional[str] = None): suffix += f' Error: {str(e)}' # Check if exception was due to no current-context if 'Expected key current-context' in str(e): - err_str = ('Failed to load Kubernetes configuration. ' - 'Kubeconfig does not contain any valid context(s).' - f'{suffix}\n' - ' If you were running a local Kubernetes ' - 'cluster, run `sky local up` to start the cluster.') + err_str = ( + f'Failed to load Kubernetes configuration for {context!r}. ' + 'Kubeconfig does not contain any valid context(s).' + f'{suffix}\n' + ' If you were running a local Kubernetes ' + 'cluster, run `sky local up` to start the cluster.') else: - err_str = ('Failed to load Kubernetes configuration. ' - 'Please check if your kubeconfig file exists at ' - f'~/.kube/config and is valid.{suffix}') + err_str = ( + f'Failed to load Kubernetes configuration for {context!r}. ' + 'Please check if your kubeconfig file exists at ' + f'~/.kube/config and is valid.{suffix}') err_str += '\nTo disable Kubernetes for SkyPilot: run `sky check`.' with ux_utils.print_exception_no_traceback(): raise ValueError(err_str) from None diff --git a/sky/authentication.py b/sky/authentication.py index 4a37cbd2373..67b4bcd576f 100644 --- a/sky/authentication.py +++ b/sky/authentication.py @@ -378,11 +378,11 @@ def setup_kubernetes_authentication(config: Dict[str, Any]) -> Dict[str, Any]: public_key_path = os.path.expanduser(PUBLIC_SSH_KEY_PATH) secret_name = clouds.Kubernetes.SKY_SSH_KEY_SECRET_NAME secret_field_name = clouds.Kubernetes().ssh_key_secret_field_name - namespace = config['provider'].get( - 'namespace', - kubernetes_utils.get_current_kube_config_context_namespace()) context = config['provider'].get( 'context', kubernetes_utils.get_current_kube_config_context_name()) + namespace = config['provider'].get( + 'namespace', + kubernetes_utils.get_kube_config_context_namespace(context)) k8s = kubernetes.kubernetes with open(public_key_path, 'r', encoding='utf-8') as f: public_key = f.read() @@ -425,8 +425,8 @@ def setup_kubernetes_authentication(config: Dict[str, Any]) -> Dict[str, Any]: ssh_jump_name, nodeport_mode, private_key_path=private_key_path, - namespace=namespace, - context=context) + context=context, + namespace=namespace) elif network_mode == port_forward_mode: # Using `kubectl port-forward` creates a direct tunnel to the pod and # does not require a ssh jump pod. @@ -441,7 +441,11 @@ def setup_kubernetes_authentication(config: Dict[str, Any]) -> Dict[str, Any]: # on GKE. ssh_target = config['cluster_name'] + '-head' ssh_proxy_cmd = kubernetes_utils.get_ssh_proxy_command( - ssh_target, port_forward_mode, private_key_path=private_key_path) + ssh_target, + port_forward_mode, + private_key_path=private_key_path, + context=context, + namespace=namespace) else: # This should never happen because we check for this in from_str above. raise ValueError(f'Unsupported networking mode: {network_mode_str}') diff --git a/sky/backends/cloud_vm_ray_backend.py b/sky/backends/cloud_vm_ray_backend.py index e580b9ba550..4d6e0eb4fb7 100644 --- a/sky/backends/cloud_vm_ray_backend.py +++ b/sky/backends/cloud_vm_ray_backend.py @@ -2082,7 +2082,7 @@ class CloudVmRayResourceHandle(backends.backend.ResourceHandle): """ # Bump if any fields get added/removed/changed, and add backward # compaitibility logic in __setstate__. - _VERSION = 8 + _VERSION = 9 def __init__( self, @@ -2516,6 +2516,19 @@ def __setstate__(self, state): if version < 8: self.cached_cluster_info = None + if version < 9: + # For backward compatibility, we should update the region of a + # SkyPilot cluster on Kubernetes to the actual context it is using. + # pylint: disable=import-outside-toplevel + launched_resources = state['launched_resources'] + if isinstance(launched_resources.cloud, clouds.Kubernetes): + yaml_config = common_utils.read_yaml( + os.path.expanduser(state['_cluster_yaml'])) + context = kubernetes_utils.get_context_from_config( + yaml_config['provider']) + state['launched_resources'] = launched_resources.copy( + region=context) + self.__dict__.update(state) # Because the update_cluster_ips and update_ssh_ports diff --git a/sky/cli.py b/sky/cli.py index eb0267f7ced..f334a4181b8 100644 --- a/sky/cli.py +++ b/sky/cli.py @@ -3026,14 +3026,11 @@ def show_gpus( kubernetes_is_enabled = sky_clouds.cloud_in_iterable( sky_clouds.Kubernetes(), global_user_state.get_cached_enabled_clouds()) - if cloud_is_kubernetes and region is not None: - raise click.UsageError( - 'The --region flag cannot be set with --cloud kubernetes.') - def _list_to_str(lst): return ', '.join([str(e) for e in lst]) def _get_kubernetes_realtime_gpu_table( + context: Optional[str] = None, name_filter: Optional[str] = None, quantity_filter: Optional[int] = None): if quantity_filter: @@ -3048,7 +3045,7 @@ def _get_kubernetes_realtime_gpu_table( gpus_only=True, clouds='kubernetes', name_filter=name_filter, - region_filter=region, + region_filter=context, quantity_filter=quantity_filter, case_sensitive=False) assert (set(counts.keys()) == set(capacity.keys()) == set( @@ -3078,11 +3075,11 @@ def _get_kubernetes_realtime_gpu_table( ]) return realtime_gpu_table - def _get_kubernetes_node_info_table(): + def _get_kubernetes_node_info_table(context: Optional[str]): node_table = log_utils.create_table( ['NODE_NAME', 'GPU_NAME', 'TOTAL_GPUS', 'FREE_GPUS']) - node_info_dict = kubernetes_utils.get_kubernetes_node_info() + node_info_dict = kubernetes_utils.get_kubernetes_node_info(context) for node_name, node_info in node_info_dict.items(): node_table.add_row([ node_name, node_info.gpu_type, @@ -3116,11 +3113,13 @@ def _output(): print_section_titles = False # If cloud is kubernetes, we want to show real-time capacity if kubernetes_is_enabled and (cloud is None or cloud_is_kubernetes): + context = region try: # If --cloud kubernetes is not specified, we want to catch # the case where no GPUs are available on the cluster and # print the warning at the end. - k8s_realtime_table = _get_kubernetes_realtime_gpu_table() + k8s_realtime_table = _get_kubernetes_realtime_gpu_table( + context) except ValueError as e: if not cloud_is_kubernetes: # Make it a note if cloud is not kubernetes @@ -3129,9 +3128,10 @@ def _output(): else: print_section_titles = True yield (f'{colorama.Fore.CYAN}{colorama.Style.BRIGHT}' - f'Kubernetes GPUs{colorama.Style.RESET_ALL}\n') + f'Kubernetes GPUs (Context: {context})' + f'{colorama.Style.RESET_ALL}\n') yield from k8s_realtime_table.get_string() - k8s_node_table = _get_kubernetes_node_info_table() + k8s_node_table = _get_kubernetes_node_info_table(context) yield '\n\n' yield (f'{colorama.Fore.CYAN}{colorama.Style.BRIGHT}' f'Kubernetes per node GPU availability' diff --git a/sky/clouds/kubernetes.py b/sky/clouds/kubernetes.py index 47f8a435ebb..2c1e753bccf 100644 --- a/sky/clouds/kubernetes.py +++ b/sky/clouds/kubernetes.py @@ -1,4 +1,5 @@ """Kubernetes.""" +import functools import json import os import re @@ -52,8 +53,7 @@ class Kubernetes(clouds.Cloud): _DEFAULT_MEMORY_CPU_RATIO = 1 _DEFAULT_MEMORY_CPU_RATIO_WITH_GPU = 4 # Allocate more memory for GPU tasks _REPR = 'Kubernetes' - _SINGLETON_REGION = 'kubernetes' - _regions: List[clouds.Region] = [clouds.Region(_SINGLETON_REGION)] + _LEGACY_SINGLETON_REGION = 'kubernetes' _CLOUD_UNSUPPORTED_FEATURES = { # TODO(romilb): Stopping might be possible to implement with # container checkpointing introduced in Kubernetes v1.25. See: @@ -88,8 +88,12 @@ def _unsupported_features_for_resources( cls, resources: 'resources_lib.Resources' ) -> Dict[clouds.CloudImplementationFeatures, str]: unsupported_features = cls._CLOUD_UNSUPPORTED_FEATURES.copy() + context = resources.region + if context is None: + context = kubernetes_utils.get_current_kube_config_context_name() # Features to be disabled for exec auth - is_exec_auth, message = kubernetes_utils.is_kubeconfig_exec_auth() + is_exec_auth, message = kubernetes_utils.is_kubeconfig_exec_auth( + context) if is_exec_auth: assert isinstance(message, str), message # Controllers cannot spin up new pods with exec auth. @@ -99,7 +103,7 @@ def _unsupported_features_for_resources( unsupported_features[ clouds.CloudImplementationFeatures.AUTO_TERMINATE] = message # Allow spot instances if supported by the cluster - spot_label_key, _ = kubernetes_utils.get_spot_label() + spot_label_key, _ = kubernetes_utils.get_spot_label(context) if spot_label_key is not None: unsupported_features.pop( clouds.CloudImplementationFeatures.SPOT_INSTANCE, None) @@ -110,16 +114,87 @@ def max_cluster_name_length(cls) -> Optional[int]: return cls._MAX_CLUSTER_NAME_LEN_LIMIT @classmethod - def regions(cls) -> List[clouds.Region]: - return cls._regions + @functools.lru_cache(maxsize=1) + def _log_skipped_contexts_once(cls, skipped_contexts: Tuple[str, + ...]) -> None: + """Log skipped contexts for only once. + + We don't directly cache the result of _filter_existing_allowed_contexts + as the admin policy may update the allowed contexts. + """ + if skipped_contexts: + logger.warning( + f'Kubernetes contexts {set(skipped_contexts)!r} specified in ' + '"allowed_contexts" not found in kubeconfig. ' + 'Ignoring these contexts.') + + @classmethod + def _existing_allowed_contexts(cls) -> List[str]: + """Get existing allowed contexts.""" + all_contexts = kubernetes_utils.get_all_kube_config_context_names() + if all_contexts is None: + return [] + all_contexts = set(all_contexts) + + allowed_contexts = skypilot_config.get_nested( + ('kubernetes', 'allowed_contexts'), None) + + if allowed_contexts is None: + current_context = ( + kubernetes_utils.get_current_kube_config_context_name()) + allowed_contexts = [] + if current_context is not None: + allowed_contexts = [current_context] + + existing_contexts = [] + skipped_contexts = [] + for context in allowed_contexts: + if context in all_contexts: + existing_contexts.append(context) + else: + skipped_contexts.append(context) + cls._log_skipped_contexts_once(tuple(skipped_contexts)) + return existing_contexts @classmethod def regions_with_offering(cls, instance_type: Optional[str], accelerators: Optional[Dict[str, int]], use_spot: bool, region: Optional[str], zone: Optional[str]) -> List[clouds.Region]: - # No notion of regions in Kubernetes - return a single region. - return cls.regions() + del accelerators, zone, use_spot # unused + existing_contexts = cls._existing_allowed_contexts() + + regions = [clouds.Region(context) for context in existing_contexts] + + if region is not None: + regions = [r for r in regions if r.name == region] + + # Check if requested instance type will fit in the cluster. + # TODO(zhwu,romilb): autoscaler type needs to be regional (per + # kubernetes cluster/context). + regions_to_return = [] + autoscaler_type = kubernetes_utils.get_autoscaler_type() + if autoscaler_type is None and instance_type is not None: + # If autoscaler is not set, check if the instance type fits in the + # cluster. Else, rely on the autoscaler to provision the right + # instance type without running checks. Worst case, if autoscaling + # fails, the pod will be stuck in pending state until + # provision_timeout, after which failover will be triggered. + for r in regions: + context = r.name + fits, reason = kubernetes_utils.check_instance_fits( + context, instance_type) + if fits: + regions_to_return.append(r) + else: + logger.debug( + f'Instance type {instance_type} does ' + 'not fit in the Kubernetes cluster with context: ' + f'{context}. Reason: {reason}') + else: + regions_to_return = regions + + return regions_to_return def instance_type_to_hourly_cost(self, instance_type: str, @@ -201,9 +276,9 @@ def zones_provision_loop( accelerators: Optional[Dict[str, int]] = None, use_spot: bool = False, ) -> Iterator[Optional[List[clouds.Zone]]]: - del num_nodes, region, instance_type, accelerators, use_spot # Unused. - for r in cls.regions(): - yield r.zones + # Always yield None for zones, since Kubernetes does not have zones, and + # we should allow any region get to this point. + yield None @classmethod def get_zone_shell_cmd(cls) -> Optional[str]: @@ -225,7 +300,10 @@ def make_deploy_resources_variables( dryrun: bool = False) -> Dict[str, Optional[str]]: del cluster_name, zones, dryrun # Unused. if region is None: - region = self._regions[0] + context = kubernetes_utils.get_current_kube_config_context_name() + else: + context = region.name + assert context is not None, 'No context found in kubeconfig' r = resources acc_dict = self.get_accelerators_from_instance_type(r.instance_type) @@ -244,9 +322,14 @@ def make_deploy_resources_variables( acc_count = k.accelerator_count if k.accelerator_count else 0 acc_type = k.accelerator_type if k.accelerator_type else None - if resources.image_id is not None: + image_id_dict = resources.image_id + if image_id_dict is not None: # Use custom image specified in resources - image_id = resources.image_id['kubernetes'] + if None in image_id_dict: + image_id = image_id_dict[None] + else: + assert resources.region in image_id_dict, image_id_dict + image_id = image_id_dict[resources.region] if image_id.startswith('docker:'): image_id = image_id[len('docker:'):] else: @@ -265,7 +348,7 @@ def make_deploy_resources_variables( # If GPUs are requested, set node label to match the GPU type. if acc_count > 0 and acc_type is not None: k8s_acc_label_key, k8s_acc_label_value = \ - kubernetes_utils.get_gpu_label_key_value(acc_type) + kubernetes_utils.get_gpu_label_key_value(context, acc_type) port_mode = network_utils.get_port_mode(None) @@ -309,13 +392,10 @@ def make_deploy_resources_variables( deploy_vars = { 'instance_type': resources.instance_type, 'custom_resources': custom_resources, - 'region': region.name, 'cpus': str(cpus), 'memory': str(mem), 'accelerator_count': str(acc_count), 'timeout': str(timeout), - 'k8s_namespace': - kubernetes_utils.get_current_kube_config_context_namespace(), 'k8s_port_mode': port_mode.value, 'k8s_networking_mode': network_utils.get_networking_mode().value, 'k8s_ssh_key_secret_name': self.SKY_SSH_KEY_SECRET_NAME, @@ -335,18 +415,30 @@ def make_deploy_resources_variables( # Add kubecontext if it is set. It may be None if SkyPilot is running # inside a pod with in-cluster auth. - curr_context = kubernetes_utils.get_current_kube_config_context_name() - if curr_context is not None: - deploy_vars['k8s_context'] = curr_context + if context is not None: + deploy_vars['k8s_context'] = context + + namespace = kubernetes_utils.get_kube_config_context_namespace(context) + deploy_vars['k8s_namespace'] = namespace return deploy_vars def _get_feasible_launchable_resources( self, resources: 'resources_lib.Resources' ) -> 'resources_utils.FeasibleResources': + # TODO(zhwu): This needs to be updated to return the correct region + # (context) that has enough resources. fuzzy_candidate_list: List[str] = [] if resources.instance_type is not None: assert resources.is_launchable(), resources + regions = self.regions_with_offering( + resources.instance_type, + accelerators=resources.accelerators, + use_spot=resources.use_spot, + region=resources.region, + zone=resources.zone) + if not regions: + return resources_utils.FeasibleResources([], [], None) resources = resources.copy(accelerators=None) return resources_utils.FeasibleResources([resources], fuzzy_candidate_list, None) @@ -391,34 +483,48 @@ def _make(instance_list): kubernetes_utils.KubernetesInstanceType.from_resources( gpu_task_cpus, gpu_task_memory, acc_count, acc_type).name) - # Check if requested instance type will fit in the cluster. - autoscaler_type = kubernetes_utils.get_autoscaler_type() - if autoscaler_type is None: - # If autoscaler is not set, check if the instance type fits in the - # cluster. Else, rely on the autoscaler to provision the right - # instance type without running checks. Worst case, if autoscaling - # fails, the pod will be stuck in pending state until - # provision_timeout, after which failover will be triggered. - fits, reason = kubernetes_utils.check_instance_fits( - chosen_instance_type) - if not fits: - logger.debug(f'Instance type {chosen_instance_type} does ' - 'not fit in the Kubernetes cluster. ' - f'Reason: {reason}') - return resources_utils.FeasibleResources([], [], reason) - + # Check the availability of the specified instance type in all contexts. + available_regions = self.regions_with_offering( + chosen_instance_type, + accelerators=None, + use_spot=resources.use_spot, + region=resources.region, + zone=resources.zone) + if not available_regions: + return resources_utils.FeasibleResources([], [], None) # No fuzzy lists for Kubernetes + # We don't set the resources returned with regions, because the + # optimizer will further find the valid region (context) for the + # resources. return resources_utils.FeasibleResources(_make([chosen_instance_type]), [], None) @classmethod def check_credentials(cls) -> Tuple[bool, Optional[str]]: # Test using python API - try: - return kubernetes_utils.check_credentials() - except Exception as e: # pylint: disable=broad-except - return (False, 'Credential check failed: ' - f'{common_utils.format_exception(e)}') + existing_allowed_contexts = cls._existing_allowed_contexts() + if not existing_allowed_contexts: + if skypilot_config.loaded_config_path() is None: + check_skypilot_config_msg = '' + else: + check_skypilot_config_msg = ( + ' and check "allowed_contexts" in your ' + f'{skypilot_config.loaded_config_path()} file.') + return (False, 'No available context found in kubeconfig. ' + 'Check if you have a valid kubeconfig file' + + check_skypilot_config_msg) + reasons = [] + for context in existing_allowed_contexts: + try: + check_result = kubernetes_utils.check_credentials(context) + if check_result[0]: + return check_result + reasons.append(f'{context}: {check_result[1]}') + except Exception as e: # pylint: disable=broad-except + return (False, f'Credential check failed for {context}: ' + f'{common_utils.format_exception(e)}') + return (False, 'Failed to find available context with working ' + 'credentials. Details:\n' + '\n'.join(reasons)) def get_credential_file_mounts(self) -> Dict[str, str]: if os.path.exists(os.path.expanduser(CREDENTIAL_PATH)): @@ -433,10 +539,20 @@ def instance_type_exists(self, instance_type: str) -> bool: instance_type) def validate_region_zone(self, region: Optional[str], zone: Optional[str]): - if region != self._SINGLETON_REGION: + if region == self._LEGACY_SINGLETON_REGION: + # For backward compatibility, we allow the region to be set to the + # legacy singletonton region. + # TODO: Remove this after 0.9.0. + return region, zone + + all_contexts = kubernetes_utils.get_all_kube_config_context_names() + if all_contexts is None: + all_contexts = [] + if region not in all_contexts: raise ValueError( - 'Kubernetes support does not support setting region.' - ' Cluster used is determined by the kubeconfig.') + f'Context {region} not found in kubeconfig. Kubernetes only ' + 'supports context names as regions. Available ' + f'contexts: {all_contexts}') if zone is not None: raise ValueError('Kubernetes support does not support setting zone.' ' Cluster used is determined by the kubeconfig.') diff --git a/sky/clouds/oci.py b/sky/clouds/oci.py index be75b002044..56dd60f8044 100644 --- a/sky/clouds/oci.py +++ b/sky/clouds/oci.py @@ -431,14 +431,17 @@ def check_disk_tier( def get_credential_file_mounts(self) -> Dict[str, str]: """Returns a dict of credential file paths to mount paths.""" - oci_cfg_file = oci_adaptor.get_config_file() - # Pass-in a profile parameter so that multiple profile in oci - # config file is supported (2023/06/09). - oci_cfg = oci_adaptor.get_oci_config( - profile=oci_utils.oci_config.get_profile()) - api_key_file = oci_cfg[ - 'key_file'] if 'key_file' in oci_cfg else 'BadConf' - sky_cfg_file = oci_utils.oci_config.get_sky_user_config_file() + try: + oci_cfg_file = oci_adaptor.get_config_file() + # Pass-in a profile parameter so that multiple profile in oci + # config file is supported (2023/06/09). + oci_cfg = oci_adaptor.get_oci_config( + profile=oci_utils.oci_config.get_profile()) + api_key_file = oci_cfg[ + 'key_file'] if 'key_file' in oci_cfg else 'BadConf' + sky_cfg_file = oci_utils.oci_config.get_sky_user_config_file() + except ImportError: + return {} # OCI config and API key file are mandatory credential_files = [oci_cfg_file, api_key_file] diff --git a/sky/clouds/service_catalog/kubernetes_catalog.py b/sky/clouds/service_catalog/kubernetes_catalog.py index 9365d693cbd..24daeabf9d4 100644 --- a/sky/clouds/service_catalog/kubernetes_catalog.py +++ b/sky/clouds/service_catalog/kubernetes_catalog.py @@ -68,26 +68,35 @@ def list_accelerators_realtime( # TODO(romilb): This should be refactored to use get_kubernetes_node_info() # function from kubernetes_utils. del all_regions, require_price # Unused. + # TODO(zhwu): this should return all accelerators in multiple kubernetes + # clusters defined by allowed_contexts. + if region_filter is None: + context = kubernetes_utils.get_current_kube_config_context_name() + else: + context = region_filter + if context is None: + return {}, {}, {} + k8s_cloud = Kubernetes() if not any( map(k8s_cloud.is_same_cloud, sky_check.get_cached_enabled_clouds_or_refresh()) - ) or not kubernetes_utils.check_credentials()[0]: + ) or not kubernetes_utils.check_credentials(context)[0]: return {}, {}, {} - has_gpu = kubernetes_utils.detect_gpu_resource() + has_gpu = kubernetes_utils.detect_gpu_resource(context) if not has_gpu: return {}, {}, {} - label_formatter, _ = kubernetes_utils.detect_gpu_label_formatter() + label_formatter, _ = kubernetes_utils.detect_gpu_label_formatter(context) if not label_formatter: return {}, {}, {} accelerators_qtys: Set[Tuple[str, int]] = set() key = label_formatter.get_label_key() - nodes = kubernetes_utils.get_kubernetes_nodes() + nodes = kubernetes_utils.get_kubernetes_nodes(context) # Get the pods to get the real-time GPU usage - pods = kubernetes_utils.get_kubernetes_pods() + pods = kubernetes_utils.get_all_pods_in_kubernetes_cluster(context) # Total number of GPUs in the cluster total_accelerators_capacity: Dict[str, int] = {} # Total number of GPUs currently available in the cluster @@ -160,7 +169,7 @@ def list_accelerators_realtime( memory=None, price=0.0, spot_price=0.0, - region='kubernetes')) + region=context)) df = pd.DataFrame(result, columns=[ @@ -175,7 +184,6 @@ def list_accelerators_realtime( qtys_map = common.list_accelerators_impl('Kubernetes', df, gpus_only, name_filter, region_filter, quantity_filter, case_sensitive) - return qtys_map, total_accelerators_capacity, total_accelerators_available diff --git a/sky/provision/kubernetes/network.py b/sky/provision/kubernetes/network.py index 7b086473d64..a0c51624216 100644 --- a/sky/provision/kubernetes/network.py +++ b/sky/provision/kubernetes/network.py @@ -79,13 +79,14 @@ def _open_ports_using_ingress( ) # Prepare service names, ports, for template rendering - service_details = [(f'{cluster_name_on_cloud}--skypilot-svc--{port}', port, - _PATH_PREFIX.format( - cluster_name_on_cloud=cluster_name_on_cloud, - port=port, - namespace=kubernetes_utils. - get_current_kube_config_context_namespace()).rstrip( - '/').lstrip('/')) for port in ports] + service_details = [ + (f'{cluster_name_on_cloud}--skypilot-svc--{port}', port, + _PATH_PREFIX.format( + cluster_name_on_cloud=cluster_name_on_cloud, + port=port, + namespace=kubernetes_utils.get_kube_config_context_namespace( + context)).rstrip('/').lstrip('/')) for port in ports + ] # Generate ingress and services specs # We batch ingress rule creation because each rule triggers a hot reload of @@ -171,7 +172,8 @@ def _cleanup_ports_for_ingress( for port in ports: service_name = f'{cluster_name_on_cloud}--skypilot-svc--{port}' network_utils.delete_namespaced_service( - namespace=provider_config.get('namespace', 'default'), + namespace=provider_config.get('namespace', + kubernetes_utils.DEFAULT_NAMESPACE), service_name=service_name, ) @@ -208,11 +210,13 @@ def query_ports( return _query_ports_for_ingress( cluster_name_on_cloud=cluster_name_on_cloud, ports=ports, + provider_config=provider_config, ) elif port_mode == kubernetes_enums.KubernetesPortMode.PODIP: return _query_ports_for_podip( cluster_name_on_cloud=cluster_name_on_cloud, ports=ports, + provider_config=provider_config, ) else: return {} @@ -231,8 +235,14 @@ def _query_ports_for_loadbalancer( result: Dict[int, List[common.Endpoint]] = {} service_name = _LOADBALANCER_SERVICE_NAME.format( cluster_name_on_cloud=cluster_name_on_cloud) + context = provider_config.get( + 'context', kubernetes_utils.get_current_kube_config_context_name()) + namespace = provider_config.get( + 'namespace', + kubernetes_utils.get_kube_config_context_namespace(context)) external_ip = network_utils.get_loadbalancer_ip( - namespace=provider_config.get('namespace', 'default'), + context=context, + namespace=namespace, service_name=service_name, # Timeout is set so that we can retry the query when the # cluster is firstly created and the load balancer is not ready yet. @@ -251,19 +261,24 @@ def _query_ports_for_loadbalancer( def _query_ports_for_ingress( cluster_name_on_cloud: str, ports: List[int], + provider_config: Dict[str, Any], ) -> Dict[int, List[common.Endpoint]]: - ingress_details = network_utils.get_ingress_external_ip_and_ports() + context = provider_config.get( + 'context', kubernetes_utils.get_current_kube_config_context_name()) + ingress_details = network_utils.get_ingress_external_ip_and_ports(context) external_ip, external_ports = ingress_details if external_ip is None: return {} + namespace = provider_config.get( + 'namespace', + kubernetes_utils.get_kube_config_context_namespace(context)) result: Dict[int, List[common.Endpoint]] = {} for port in ports: path_prefix = _PATH_PREFIX.format( cluster_name_on_cloud=cluster_name_on_cloud, port=port, - namespace=kubernetes_utils. - get_current_kube_config_context_namespace()) + namespace=namespace) http_port, https_port = external_ports \ if external_ports is not None else (None, None) @@ -282,10 +297,15 @@ def _query_ports_for_ingress( def _query_ports_for_podip( cluster_name_on_cloud: str, ports: List[int], + provider_config: Dict[str, Any], ) -> Dict[int, List[common.Endpoint]]: - namespace = kubernetes_utils.get_current_kube_config_context_namespace() + context = provider_config.get( + 'context', kubernetes_utils.get_current_kube_config_context_name()) + namespace = provider_config.get( + 'namespace', + kubernetes_utils.get_kube_config_context_namespace(context)) pod_name = kubernetes_utils.get_head_pod_name(cluster_name_on_cloud) - pod_ip = network_utils.get_pod_ip(namespace, pod_name) + pod_ip = network_utils.get_pod_ip(context, namespace, pod_name) result: Dict[int, List[common.Endpoint]] = {} if pod_ip is None: diff --git a/sky/provision/kubernetes/network_utils.py b/sky/provision/kubernetes/network_utils.py index ba126197446..a1d919a6766 100644 --- a/sky/provision/kubernetes/network_utils.py +++ b/sky/provision/kubernetes/network_utils.py @@ -220,10 +220,11 @@ def ingress_controller_exists(context: str, def get_ingress_external_ip_and_ports( + context: str, namespace: str = 'ingress-nginx' ) -> Tuple[Optional[str], Optional[Tuple[int, int]]]: """Returns external ip and ports for the ingress controller.""" - core_api = kubernetes.core_api() + core_api = kubernetes.core_api(context) ingress_services = [ item for item in core_api.list_namespaced_service( namespace, _request_timeout=kubernetes.API_TIMEOUT).items @@ -257,11 +258,12 @@ def get_ingress_external_ip_and_ports( return external_ip, None -def get_loadbalancer_ip(namespace: str, +def get_loadbalancer_ip(context: str, + namespace: str, service_name: str, timeout: int = 0) -> Optional[str]: """Returns the IP address of the load balancer.""" - core_api = kubernetes.core_api() + core_api = kubernetes.core_api(context) ip = None @@ -282,9 +284,9 @@ def get_loadbalancer_ip(namespace: str, return ip -def get_pod_ip(namespace: str, pod_name: str) -> Optional[str]: +def get_pod_ip(context: str, namespace: str, pod_name: str) -> Optional[str]: """Returns the IP address of the pod.""" - core_api = kubernetes.core_api() + core_api = kubernetes.core_api(context) pod = core_api.read_namespaced_pod(pod_name, namespace, _request_timeout=kubernetes.API_TIMEOUT) diff --git a/sky/provision/kubernetes/utils.py b/sky/provision/kubernetes/utils.py index a8abb24b917..f31652030a5 100644 --- a/sky/provision/kubernetes/utils.py +++ b/sky/provision/kubernetes/utils.py @@ -1,5 +1,6 @@ """Kubernetes utilities for SkyPilot.""" import dataclasses +import functools import json import math import os @@ -307,7 +308,9 @@ class KarpenterLabelFormatter(SkyPilotLabelFormatter): } +@functools.lru_cache() def detect_gpu_label_formatter( + context: str ) -> Tuple[Optional[GPULabelFormatter], Dict[str, List[Tuple[str, str]]]]: """Detects the GPU label formatter for the Kubernetes cluster @@ -318,7 +321,7 @@ def detect_gpu_label_formatter( """ # Get all labels across all nodes node_labels: Dict[str, List[Tuple[str, str]]] = {} - nodes = get_kubernetes_nodes() + nodes = get_kubernetes_nodes(context) for node in nodes: node_labels[node.metadata.name] = [] for label, value in node.metadata.labels.items(): @@ -338,7 +341,8 @@ def detect_gpu_label_formatter( return label_formatter, node_labels -def detect_gpu_resource() -> Tuple[bool, Set[str]]: +@functools.lru_cache(maxsize=10) +def detect_gpu_resource(context: str) -> Tuple[bool, Set[str]]: """Checks if the Kubernetes cluster has nvidia.com/gpu resource. If nvidia.com/gpu resource is missing, that typically means that the @@ -350,7 +354,7 @@ def detect_gpu_resource() -> Tuple[bool, Set[str]]: """ # Get the set of resources across all nodes cluster_resources: Set[str] = set() - nodes = get_kubernetes_nodes() + nodes = get_kubernetes_nodes(context) for node in nodes: cluster_resources.update(node.status.allocatable.keys()) has_gpu = 'nvidia.com/gpu' in cluster_resources @@ -358,12 +362,17 @@ def detect_gpu_resource() -> Tuple[bool, Set[str]]: return has_gpu, cluster_resources -def get_kubernetes_nodes() -> List[Any]: - # TODO(romilb): Calling kube API can take between 10-100ms depending on - # the control plane. Consider caching calls to this function (using - # kubecontext hash as key). +@functools.lru_cache(maxsize=10) +def get_kubernetes_nodes(context: Optional[str] = None) -> List[Any]: + """Gets the kubernetes nodes in the context. + + If context is None, gets the nodes in the current context. + """ + if context is None: + context = get_current_kube_config_context_name() + try: - nodes = kubernetes.core_api().list_node( + nodes = kubernetes.core_api(context).list_node( _request_timeout=kubernetes.API_TIMEOUT).items except kubernetes.max_retry_error(): raise exceptions.ResourcesUnavailableError( @@ -373,15 +382,18 @@ def get_kubernetes_nodes() -> List[Any]: return nodes -def get_kubernetes_pods() -> List[Any]: - """Gets the kubernetes pods in the current namespace and current context. +def get_all_pods_in_kubernetes_cluster( + context: Optional[str] = None) -> List[Any]: + """Gets pods in all namespaces in kubernetes cluster indicated by context. Used for computing cluster resource usage. """ + if context is None: + context = get_current_kube_config_context_name() + try: - ns = get_current_kube_config_context_namespace() - pods = kubernetes.core_api().list_namespaced_pod( - ns, _request_timeout=kubernetes.API_TIMEOUT).items + pods = kubernetes.core_api(context).list_pod_for_all_namespaces( + _request_timeout=kubernetes.API_TIMEOUT).items except kubernetes.max_retry_error(): raise exceptions.ResourcesUnavailableError( 'Timed out when trying to get pod info from Kubernetes cluster. ' @@ -390,7 +402,8 @@ def get_kubernetes_pods() -> List[Any]: return pods -def check_instance_fits(instance: str) -> Tuple[bool, Optional[str]]: +def check_instance_fits(context: str, + instance: str) -> Tuple[bool, Optional[str]]: """Checks if the instance fits on the Kubernetes cluster. If the instance has GPU requirements, checks if the GPU type is @@ -405,6 +418,9 @@ def check_instance_fits(instance: str) -> Tuple[bool, Optional[str]]: Optional[str]: Error message if the instance does not fit. """ + # TODO(zhwu): this should check the node for specific context, instead + # of the default context to make failover fully functional. + def check_cpu_mem_fits(candidate_instance_type: 'KubernetesInstanceType', node_list: List[Any]) -> Tuple[bool, Optional[str]]: """Checks if the instance fits on the cluster based on CPU and memory. @@ -431,7 +447,7 @@ def check_cpu_mem_fits(candidate_instance_type: 'KubernetesInstanceType', 'Maximum resources found on a single node: ' f'{max_cpu} CPUs, {common_utils.format_float(max_mem)}G Memory') - nodes = get_kubernetes_nodes() + nodes = get_kubernetes_nodes(context) k8s_instance_type = KubernetesInstanceType.\ from_instance_type(instance) acc_type = k8s_instance_type.accelerator_type @@ -439,7 +455,8 @@ def check_cpu_mem_fits(candidate_instance_type: 'KubernetesInstanceType', # If GPUs are requested, check if GPU type is available, and if so, # check if CPU and memory requirements on the specific node are met. try: - gpu_label_key, gpu_label_val = get_gpu_label_key_value(acc_type) + gpu_label_key, gpu_label_val = get_gpu_label_key_value( + context, acc_type) except exceptions.ResourcesUnavailableError as e: # If GPU not found, return empty list and error message. return False, str(e) @@ -471,7 +488,9 @@ def check_cpu_mem_fits(candidate_instance_type: 'KubernetesInstanceType', return fits, reason -def get_gpu_label_key_value(acc_type: str, check_mode=False) -> Tuple[str, str]: +def get_gpu_label_key_value(context: str, + acc_type: str, + check_mode=False) -> Tuple[str, str]: """Returns the label key and value for the given GPU type. Args: @@ -512,11 +531,11 @@ def get_gpu_label_key_value(acc_type: str, check_mode=False) -> Tuple[str, str]: f' {autoscaler_type}') return formatter.get_label_key(), formatter.get_label_value(acc_type) - has_gpus, cluster_resources = detect_gpu_resource() + has_gpus, cluster_resources = detect_gpu_resource(context) if has_gpus: # Check if the cluster has GPU labels setup correctly label_formatter, node_labels = \ - detect_gpu_label_formatter() + detect_gpu_label_formatter(context) if label_formatter is None: # If none of the GPU labels from LABEL_FORMATTER_REGISTRY are # detected, raise error @@ -632,7 +651,7 @@ def get_external_ip(network_mode: Optional[ return parsed_url.hostname -def check_credentials(timeout: int = kubernetes.API_TIMEOUT) -> \ +def check_credentials(context: str, timeout: int = kubernetes.API_TIMEOUT) -> \ Tuple[bool, Optional[str]]: """Check if the credentials in kubeconfig file are valid @@ -644,10 +663,9 @@ def check_credentials(timeout: int = kubernetes.API_TIMEOUT) -> \ str: Error message if credentials are invalid, None otherwise """ try: - ns = get_current_kube_config_context_namespace() - context = get_current_kube_config_context_name() + namespace = get_kube_config_context_namespace(context) kubernetes.core_api(context).list_namespaced_pod( - ns, _request_timeout=timeout) + namespace, _request_timeout=timeout) except ImportError: # TODO(romilb): Update these error strs to also include link to docs # when docs are ready. @@ -676,7 +694,7 @@ def check_credentials(timeout: int = kubernetes.API_TIMEOUT) -> \ # We now do softer checks to check if exec based auth is used and to # see if the cluster is GPU-enabled. - _, exec_msg = is_kubeconfig_exec_auth() + _, exec_msg = is_kubeconfig_exec_auth(context) # We now check if GPUs are available and labels are set correctly on the # cluster, and if not we return hints that may help debug any issues. @@ -685,7 +703,7 @@ def check_credentials(timeout: int = kubernetes.API_TIMEOUT) -> \ # provider if their cluster GPUs are not setup correctly. gpu_msg = '' try: - _, _ = get_gpu_label_key_value(acc_type='', check_mode=True) + _, _ = get_gpu_label_key_value(context, acc_type='', check_mode=True) except exceptions.ResourcesUnavailableError as e: # If GPUs are not available, we return cluster as enabled (since it can # be a CPU-only cluster) but we also return the exception message which @@ -701,7 +719,8 @@ def check_credentials(timeout: int = kubernetes.API_TIMEOUT) -> \ return True, None -def is_kubeconfig_exec_auth() -> Tuple[bool, Optional[str]]: +def is_kubeconfig_exec_auth( + context: Optional[str] = None) -> Tuple[bool, Optional[str]]: """Checks if the kubeconfig file uses exec-based authentication Exec-based auth is commonly used for authenticating with cloud hosted @@ -735,8 +754,16 @@ def is_kubeconfig_exec_auth() -> Tuple[bool, Optional[str]]: return False, None # Get active context and user from kubeconfig using k8s api - _, current_context = k8s.config.list_kube_config_contexts() - target_username = current_context['context']['user'] + all_contexts, current_context = k8s.config.list_kube_config_contexts() + context_obj = current_context + if context is not None: + for c in all_contexts: + if c['name'] == context: + context_obj = c + break + else: + raise ValueError(f'Kubernetes context {context!r} not found.') + target_username = context_obj['context']['user'] # K8s api does not provide a mechanism to get the user details from the # context. We need to load the kubeconfig file and parse it to get the @@ -759,7 +786,7 @@ def is_kubeconfig_exec_auth() -> Tuple[bool, Optional[str]]: schemas.get_default_remote_identity('kubernetes')) if ('exec' in user_details.get('user', {}) and remote_identity == schemas.RemoteIdentityOptions.LOCAL_CREDENTIALS.value): - ctx_name = current_context['name'] + ctx_name = context_obj['name'] exec_msg = ('exec-based authentication is used for ' f'Kubernetes context {ctx_name!r}.' ' This may cause issues with autodown or when running ' @@ -775,6 +802,7 @@ def is_kubeconfig_exec_auth() -> Tuple[bool, Optional[str]]: return False, None +@functools.lru_cache() def get_current_kube_config_context_name() -> Optional[str]: """Get the current kubernetes context from the kubeconfig file @@ -789,7 +817,27 @@ def get_current_kube_config_context_name() -> Optional[str]: return None -def get_current_kube_config_context_namespace() -> str: +def get_all_kube_config_context_names() -> Optional[List[str]]: + """Get all kubernetes context names from the kubeconfig file. + + We should not cache the result of this function as the admin policy may + update the contexts. + + Returns: + List[str] | None: The list of kubernetes context names if it exists, + None otherwise + """ + k8s = kubernetes.kubernetes + try: + all_contexts, _ = k8s.config.list_kube_config_contexts() + return [context['name'] for context in all_contexts] + except k8s.config.config_exception.ConfigException: + return None + + +@functools.lru_cache() +def get_kube_config_context_namespace( + context_name: Optional[str] = None) -> str: """Get the current kubernetes context namespace from the kubeconfig file Returns: @@ -804,9 +852,17 @@ def get_current_kube_config_context_namespace() -> str: return f.read().strip() # If not in-cluster, get the namespace from kubeconfig try: - _, current_context = k8s.config.list_kube_config_contexts() - if 'namespace' in current_context['context']: - return current_context['context']['namespace'] + contexts, current_context = k8s.config.list_kube_config_contexts() + if context_name is None: + context = current_context + else: + context = next((c for c in contexts if c['name'] == context_name), + None) + if context is None: + return DEFAULT_NAMESPACE + + if 'namespace' in context['context']: + return context['context']['namespace'] else: return DEFAULT_NAMESPACE except k8s.config.config_exception.ConfigException: @@ -987,11 +1043,12 @@ def construct_ssh_jump_command( def get_ssh_proxy_command( - k8s_ssh_target: str, - network_mode: kubernetes_enums.KubernetesNetworkingMode, - private_key_path: Optional[str] = None, - namespace: Optional[str] = None, - context: Optional[str] = None) -> str: + k8s_ssh_target: str, + network_mode: kubernetes_enums.KubernetesNetworkingMode, + private_key_path: str, + context: str, + namespace: str, +) -> str: """Generates the SSH proxy command to connect to the pod. Uses a jump pod if the network mode is NODEPORT, and direct port-forwarding @@ -1048,8 +1105,6 @@ def get_ssh_proxy_command( private_key_path, ssh_jump_ip, ssh_jump_port=ssh_jump_port) else: ssh_jump_proxy_command_path = create_proxy_command_script() - current_context = get_current_kube_config_context_name() - current_namespace = get_current_kube_config_context_namespace() ssh_jump_proxy_command = construct_ssh_jump_command( private_key_path, ssh_jump_ip, @@ -1059,8 +1114,8 @@ def get_ssh_proxy_command( # We embed both the current context and namespace to the SSH proxy # command to make sure SSH still works when the current # context/namespace is changed by the user. - current_kube_context=current_context, - current_kube_namespace=current_namespace) + current_kube_context=context, + current_kube_namespace=namespace) return ssh_jump_proxy_command @@ -1647,7 +1702,8 @@ def get_autoscaler_type( } -def get_spot_label() -> Tuple[Optional[str], Optional[str]]: +def get_spot_label( + context: Optional[str] = None) -> Tuple[Optional[str], Optional[str]]: """Get the spot label key and value for using spot instances, if supported. Checks if the underlying cluster supports spot instances by checking nodes @@ -1661,7 +1717,7 @@ def get_spot_label() -> Tuple[Optional[str], Optional[str]]: """ # Check if the cluster supports spot instances by checking nodes for known # spot label keys and values - for node in get_kubernetes_nodes(): + for node in get_kubernetes_nodes(context): for _, (key, value) in SPOT_LABEL_MAP.items(): if key in node.metadata.labels and node.metadata.labels[ key] == value: @@ -1706,7 +1762,8 @@ class KubernetesNodeInfo: free: Dict[str, int] -def get_kubernetes_node_info() -> Dict[str, KubernetesNodeInfo]: +def get_kubernetes_node_info( + context: Optional[str] = None) -> Dict[str, KubernetesNodeInfo]: """Gets the resource information for all the nodes in the cluster. Currently only GPU resources are supported. The function returns the total @@ -1717,11 +1774,11 @@ def get_kubernetes_node_info() -> Dict[str, KubernetesNodeInfo]: Dict[str, KubernetesNodeInfo]: Dictionary containing the node name as key and the KubernetesNodeInfo object as value """ - nodes = get_kubernetes_nodes() + nodes = get_kubernetes_nodes(context) # Get the pods to get the real-time resource usage - pods = get_kubernetes_pods() + pods = get_all_pods_in_kubernetes_cluster(context) - label_formatter, _ = detect_gpu_label_formatter() + label_formatter, _ = detect_gpu_label_formatter(context) if not label_formatter: label_key = None else: @@ -1773,8 +1830,9 @@ def to_label_selector(tags): def get_namespace_from_config(provider_config: Dict[str, Any]) -> str: + context = get_context_from_config(provider_config) return provider_config.get('namespace', - get_current_kube_config_context_namespace()) + get_kube_config_context_namespace(context)) def filter_pods(namespace: str, @@ -1802,8 +1860,10 @@ def filter_pods(namespace: str, return {pod.metadata.name: pod for pod in pods} -def _remove_pod_annotation(pod: Any, annotation_key: str, - namespace: str) -> None: +def _remove_pod_annotation(pod: Any, + annotation_key: str, + namespace: str, + context: Optional[str] = None) -> None: """Removes specified Annotations from a Kubernetes pod.""" try: # Remove the specified annotation @@ -1811,7 +1871,7 @@ def _remove_pod_annotation(pod: Any, annotation_key: str, if annotation_key in pod.metadata.annotations: # Patch the pod with the updated metadata. body = {'metadata': {'annotations': {annotation_key: None}}} - kubernetes.core_api().patch_namespaced_pod( + kubernetes.core_api(context).patch_namespaced_pod( name=pod.metadata.name, namespace=namespace, body=body, @@ -1830,13 +1890,15 @@ def _remove_pod_annotation(pod: Any, annotation_key: str, raise -def _add_pod_annotation(pod: Any, annotation: Dict[str, str], - namespace: str) -> None: +def _add_pod_annotation(pod: Any, + annotation: Dict[str, str], + namespace: str, + context: Optional[str] = None) -> None: """Adds specified Annotations on a Kubernetes pod.""" try: # Patch the pod with the updated metadata body = {'metadata': {'annotations': annotation}} - kubernetes.core_api().patch_namespaced_pod( + kubernetes.core_api(context).patch_namespaced_pod( name=pod.metadata.name, namespace=namespace, body=body, @@ -1877,10 +1939,12 @@ def set_autodown_annotations(handle: 'backends.CloudVmRayResourceHandle', autodown_annotation = {AUTODOWN_ANNOTATION_KEY: 'true'} _add_pod_annotation(pod=pod, annotation=idle_minutes_to_autostop_annotation, - namespace=namespace) + namespace=namespace, + context=context) _add_pod_annotation(pod=pod, annotation=autodown_annotation, - namespace=namespace) + namespace=namespace, + context=context) # If idle_minutes_to_autostop is negative, it indicates a request to # cancel autostop using the --cancel flag with the `sky autostop` @@ -1890,10 +1954,12 @@ def set_autodown_annotations(handle: 'backends.CloudVmRayResourceHandle', _remove_pod_annotation( pod=pod, annotation_key=IDLE_MINUTES_TO_AUTOSTOP_ANNOTATION_KEY, - namespace=namespace) + namespace=namespace, + context=context) _remove_pod_annotation(pod=pod, annotation_key=AUTODOWN_ANNOTATION_KEY, - namespace=namespace) + namespace=namespace, + context=context) def get_context_from_config(provider_config: Dict[str, Any]) -> str: diff --git a/sky/templates/kubernetes-ray.yml.j2 b/sky/templates/kubernetes-ray.yml.j2 index 1b09409ad0e..b807fd2135b 100644 --- a/sky/templates/kubernetes-ray.yml.j2 +++ b/sky/templates/kubernetes-ray.yml.j2 @@ -18,7 +18,7 @@ provider: region: kubernetes - # The namespace to create the Ray cluster in. + namespace: {{k8s_namespace}} # The kubecontext used to connect to the Kubernetes cluster. diff --git a/sky/utils/schemas.py b/sky/utils/schemas.py index a50c400b805..6e752f73ebc 100644 --- a/sky/utils/schemas.py +++ b/sky/utils/schemas.py @@ -775,6 +775,12 @@ def get_config_schema(): 'required': [], 'additionalProperties': False, 'properties': { + 'allowed_contexts': { + 'type': 'array', + 'items': { + 'type': 'string', + }, + }, 'networking': { 'type': 'string', 'case_insensitive_enum': [ diff --git a/tests/common.py b/tests/common.py index b6cefda22b8..c6f08588d99 100644 --- a/tests/common.py +++ b/tests/common.py @@ -70,3 +70,16 @@ def _get_az_mappings(_): lambda *_args, **_kwargs: [True, '']) monkeypatch.setattr('sky.provision.kubernetes.utils.get_spot_label', lambda *_args, **_kwargs: [None, None]) + + # monkeypatch class Kubernetes. + monkeypatch.setattr( + 'sky.clouds.kubernetes.Kubernetes.regions_with_offering', + lambda *_args, **_kwargs: [clouds.Region('my-k8s-cluster-context')]) + + def kubernetes_validate_region_zone(self, region, zone): + if region == 'my-k8s-cluster-context': + return region, zone + raise ValueError(f'Invalid region: {region}, zone: {zone}') + + monkeypatch.setattr('sky.clouds.kubernetes.Kubernetes.validate_region_zone', + kubernetes_validate_region_zone) diff --git a/tests/test_smoke.py b/tests/test_smoke.py index c616d9a8b30..c85a68b9862 100644 --- a/tests/test_smoke.py +++ b/tests/test_smoke.py @@ -5582,3 +5582,89 @@ def test_sky_bench(generic_cloud: str): f'sky bench down {name} -y; sky bench delete {name} -y', ) run_one_test(test) + + +@pytest.mark.kubernetes +def test_kubernetes_context_failover(): + """Test if the kubernetes context failover works. + + This test requires two kubernetes clusters: + - kind-skypilot: the local cluster with mock labels for 8 H100 GPUs. + - another accessible cluster: with enough CPUs + To start the first cluster, run: + sky local up + # Add mock label for accelerator + kubectl label node --overwrite skypilot-control-plane skypilot.co/accelerator=h100 --context kind-skypilot + # Get the token for the cluster in context kind-skypilot + TOKEN=$(kubectl config view --minify --context kind-skypilot -o jsonpath=\'{.users[0].user.token}\') + # Get the API URL for the cluster in context kind-skypilot + API_URL=$(kubectl config view --minify --context kind-skypilot -o jsonpath=\'{.clusters[0].cluster.server}\') + # Add mock capacity for GPU + curl --header "Content-Type: application/json-patch+json" --header "Authorization: Bearer $TOKEN" --request PATCH --data \'[{"op": "add", "path": "/status/capacity/nvidia.com~1gpu", "value": "8"}]\' "$API_URL/api/v1/nodes/skypilot-control-plane/status" + # Add a new namespace to test the handling of namespaces + kubectl create namespace test-namespace --context kind-skypilot + # Set the namespace to test-namespace + kubectl config set-context kind-skypilot --namespace=test-namespace --context kind-skypilot + """ + # Get context that is not kind-skypilot + contexts = subprocess.check_output('kubectl config get-contexts -o name', + shell=True).decode('utf-8').split('\n') + context = [context for context in contexts if context != 'kind-skypilot'][0] + config = textwrap.dedent(f"""\ + kubernetes: + allowed_contexts: + - kind-skypilot + - {context} + """) + with tempfile.NamedTemporaryFile(delete=True) as f: + f.write(config.encode('utf-8')) + f.flush() + name = _get_cluster_name() + test = Test( + 'kubernetes-context-failover', + [ + # Check if kind-skypilot is provisioned with H100 annotations already + 'NODE_INFO=$(kubectl get nodes -o yaml --context kind-skypilot) && ' + 'echo "$NODE_INFO" | grep nvidia.com/gpu | grep 8 && ' + 'echo "$NODE_INFO" | grep skypilot.co/accelerator | grep h100 || ' + '{ echo "kind-skypilot does not exist ' + 'or does not have mock labels for GPUs. Check the instructions in ' + 'tests/test_smoke.py::test_kubernetes_context_failover." && exit 1; }', + # Check namespace for kind-skypilot is test-namespace + 'kubectl get namespaces --context kind-skypilot | grep test-namespace || ' + '{ echo "Should set the namespace to test-namespace for kind-skypilot. Check the instructions in ' + 'tests/test_smoke.py::test_kubernetes_context_failover." && exit 1; }', + 'sky show-gpus --cloud kubernetes --region kind-skypilot | grep H100 | grep "1, 2, 3, 4, 5, 6, 7, 8"', + # Get contexts and set current context to the other cluster that is not kind-skypilot + f'kubectl config use-context {context}', + # H100 should not in the current context + '! sky show-gpus --cloud kubernetes | grep H100', + f'sky launch -y -c {name}-1 --cpus 1 echo hi', + f'sky logs {name}-1 --status', + # It should be launched not on kind-skypilot + f'sky status -a {name}-1 | grep "{context}"', + # Test failure for launching H100 on other cluster + f'sky launch -y -c {name}-2 --gpus H100 --cpus 1 --cloud kubernetes --region {context} echo hi && exit 1 || true', + # Test failover + f'sky launch -y -c {name}-3 --gpus H100 --cpus 1 --cloud kubernetes echo hi', + f'sky logs {name}-3 --status', + # Test pods + f'kubectl get pods --context kind-skypilot | grep "{name}-3"', + # It should be launched on kind-skypilot + f'sky status -a {name}-3 | grep "kind-skypilot"', + # Should be 7 free GPUs + f'sky show-gpus --cloud kubernetes --region kind-skypilot | grep H100 | grep " 7"', + # Remove the line with "kind-skypilot" + f'sed -i "/kind-skypilot/d" {f.name}', + # Should still be able to exec and launch on existing cluster + f'sky exec {name}-3 "echo hi"', + f'sky logs {name}-3 --status', + f'sky status -r {name}-3 | grep UP', + f'sky launch -c {name}-3 --gpus h100 echo hi', + f'sky logs {name}-3 --status', + f'sky status -r {name}-3 | grep UP', + ], + f'sky down -y {name}-1 {name}-3', + env={'SKYPILOT_CONFIG': f.name}, + ) + run_one_test(test) diff --git a/tests/unit_tests/test_admin_policy.py b/tests/unit_tests/test_admin_policy.py index 96b666493d3..48e47a6007c 100644 --- a/tests/unit_tests/test_admin_policy.py +++ b/tests/unit_tests/test_admin_policy.py @@ -170,3 +170,21 @@ def _gen_cluster_record(status: sky.ClusterStatus, autostop: int) -> dict: os.path.join(POLICY_PATH, 'enforce_autostop.yaml'), idle_minutes_to_autostop=None) + + +@mock.patch('sky.provision.kubernetes.utils.get_all_kube_config_context_names', + return_value=['kind-skypilot', 'kind-skypilot2', 'kind-skypilot3']) +def test_dynamic_kubernetes_contexts_policy(add_example_policy_paths, task): + _, config = _load_task_and_apply_policy( + task, + os.path.join(POLICY_PATH, 'dynamic_kubernetes_contexts_update.yaml')) + + assert config.get_nested( + ('kubernetes', 'allowed_contexts'), + None) == ['kind-skypilot', 'kind-skypilot2' + ], 'Kubernetes allowed contexts should be updated' + + assert skypilot_config.get_nested( + ('kubernetes', 'allowed_contexts'), + None) == ['kind-skypilot', + 'kind-skypilot2'], 'Global skypilot config should be updated' From 5cef688680ad0baff798c355ae771dc59105b25b Mon Sep 17 00:00:00 2001 From: Tian Xia Date: Thu, 26 Sep 2024 17:00:49 -0700 Subject: [PATCH 22/93] [Tests] Fix smoke test for GCP disk tier `best` (#3992) * [Tests] Fix smoke test for GCP disk tier `best` * Update tests/test_smoke.py Co-authored-by: Zhanghao Wu * test multiple possibilities for gcp disk tier * Update tests/test_smoke.py Co-authored-by: Zhanghao Wu --------- Co-authored-by: Zhanghao Wu --- tests/test_smoke.py | 43 +++++++++++++++++++++++++++---------------- 1 file changed, 27 insertions(+), 16 deletions(-) diff --git a/tests/test_smoke.py b/tests/test_smoke.py index c85a68b9862..4d81015f9cd 100644 --- a/tests/test_smoke.py +++ b/tests/test_smoke.py @@ -3427,26 +3427,37 @@ def _get_aws_query_command(region, instance_id, field, expected): @pytest.mark.gcp def test_gcp_disk_tier(): for disk_tier in list(resources_utils.DiskTier): - type = GCP._get_disk_type(disk_tier) + disk_types = [GCP._get_disk_type(disk_tier)] name = _get_cluster_name() + '-' + disk_tier.value name_on_cloud = common_utils.make_cluster_name_on_cloud( name, sky.GCP.max_cluster_name_length()) region = 'us-west2' - test = Test( - 'gcp-disk-tier-' + disk_tier.value, - [ - f'sky launch -y -c {name} --cloud gcp --region {region} ' - f'--disk-tier {disk_tier.value} echo "hello sky"', - f'name=`gcloud compute instances list --filter=' - f'"labels.ray-cluster-name:{name_on_cloud}" ' - '--format="value(name)"`; ' - f'gcloud compute disks list --filter="name=$name" ' - f'--format="value(type)" | grep {type} ' - ], - f'sky down -y {name}', - timeout=6 * 60, # 6 mins (it takes around ~3 mins) - ) - run_one_test(test) + instance_type_options = [''] + if disk_tier == resources_utils.DiskTier.BEST: + # Ultra disk tier requires n2 instance types to have more than 64 CPUs. + # If using default instance type, it will only enable the high disk tier. + disk_types = [ + GCP._get_disk_type(resources_utils.DiskTier.HIGH), + GCP._get_disk_type(resources_utils.DiskTier.ULTRA), + ] + instance_type_options = ['', '--instance-type n2-standard-64'] + for disk_type, instance_type_option in zip(disk_types, + instance_type_options): + test = Test( + 'gcp-disk-tier-' + disk_tier.value, + [ + f'sky launch -y -c {name} --cloud gcp --region {region} ' + f'--disk-tier {disk_tier.value} {instance_type_option} ', + f'name=`gcloud compute instances list --filter=' + f'"labels.ray-cluster-name:{name_on_cloud}" ' + '--format="value(name)"`; ' + f'gcloud compute disks list --filter="name=$name" ' + f'--format="value(type)" | grep {disk_type} ' + ], + f'sky down -y {name}', + timeout=6 * 60, # 6 mins (it takes around ~3 mins) + ) + run_one_test(test) @pytest.mark.azure From e6b8d2c086544ab5cfdb877ad414eafddaa49cb4 Mon Sep 17 00:00:00 2001 From: zpoint Date: Fri, 27 Sep 2024 12:32:14 +0800 Subject: [PATCH 23/93] add AddKeysToAgent for ssh config file and ssh cmd (#3985) * add AddKeysToAgent for ssh config file and ssh cmd * fix pylint line too long * renmae to mpirun.yaml * update comment * reformat * hint if ssh-agent not running * reformat * pylint * pytest * revert ssh agent checking --------- Co-authored-by: root --- examples/mpirun.yaml | 24 ++++++++++++++++++++++++ sky/backends/backend_utils.py | 1 + sky/provision/provisioner.py | 2 ++ sky/utils/command_runner.py | 4 ++++ 4 files changed, 31 insertions(+) create mode 100644 examples/mpirun.yaml diff --git a/examples/mpirun.yaml b/examples/mpirun.yaml new file mode 100644 index 00000000000..4ec7ce0107c --- /dev/null +++ b/examples/mpirun.yaml @@ -0,0 +1,24 @@ +workdir: . + +resources: + cloud: aws + +num_nodes: 2 # Total number of nodes (1 head + 1 worker) + +setup: | + echo "Running setup on node ${SKYPILOT_NODE_RANK}." + # Install MPI if not already present. This will vary based on your OS/distro. + sudo apt update + sudo apt install -y openmpi-bin openmpi-common libopenmpi-dev + +run: | + if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then + echo "head node" + num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l` + mpi_nodes=$(echo "$SKYPILOT_NODE_IPS" | tr '\n' ',') + mpi_nodes=${mpi_nodes::-1} + echo "$mpi_nodes" + mpirun -np $num_nodes -H $mpi_nodes bash -c 'echo "mpirun hello from IP $(hostname -I)"' + else + echo "worker nodes" + fi diff --git a/sky/backends/backend_utils.py b/sky/backends/backend_utils.py index d7211d18a65..b83817b9b42 100644 --- a/sky/backends/backend_utils.py +++ b/sky/backends/backend_utils.py @@ -428,6 +428,7 @@ def _get_generated_config(cls, autogen_comment: str, host_name: str, HostName {ip} User {username} IdentityFile {ssh_key_path} + AddKeysToAgent yes IdentitiesOnly yes ForwardAgent yes StrictHostKeyChecking no diff --git a/sky/provision/provisioner.py b/sky/provision/provisioner.py index 37b912db979..0c188599ae6 100644 --- a/sky/provision/provisioner.py +++ b/sky/provision/provisioner.py @@ -259,6 +259,8 @@ def _ssh_probe_command(ip: str, '-o', 'IdentitiesOnly=yes', '-o', + 'AddKeysToAgent=yes', + '-o', 'ExitOnForwardFailure=yes', '-o', 'ServerAliveInterval=5', diff --git a/sky/utils/command_runner.py b/sky/utils/command_runner.py index 4d57854bf90..1cb1dfc88e6 100644 --- a/sky/utils/command_runner.py +++ b/sky/utils/command_runner.py @@ -85,6 +85,10 @@ def ssh_options_list( 'LogLevel': 'ERROR', # Try fewer extraneous key pairs. 'IdentitiesOnly': 'yes', + # Add the current private key used for this SSH connection to the + # SSH agent, so that forward agent parameter will then make SSH + # agent forward it. + 'AddKeysToAgent': 'yes', # Abort if port forwarding fails (instead of just printing to # stderr). 'ExitOnForwardFailure': 'yes', From 836c5cde26da252862aa031f3ee7a7341bbb0048 Mon Sep 17 00:00:00 2001 From: Romil Bhardwaj Date: Fri, 27 Sep 2024 09:42:55 -0700 Subject: [PATCH 24/93] [Examples] Add env vars to deepspeed example (#3981) * Add env vars to deepspeed example * Add env vars to deepspeed example * Add env vars to deepspeed example --- examples/deepspeed-multinode/sky.yaml | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/examples/deepspeed-multinode/sky.yaml b/examples/deepspeed-multinode/sky.yaml index 378992d66a4..37d7445a2a1 100644 --- a/examples/deepspeed-multinode/sky.yaml +++ b/examples/deepspeed-multinode/sky.yaml @@ -18,8 +18,15 @@ resources: # accelerators: A100-80GB:1 # Azure, GCP, SCP # accelerators: A10G:1 # AWS. Will OOM for (1) single_node/run_1.3b_lora.sh (2) multi_node/run_66b.sh. # accelerators: T4:1 # AWS, Azure, GCP. Will OOM for (1) single_node/run_1.3b_lora.sh (2) multi_node/run_66b.sh. + num_nodes: 2 +envs: + MY_VAR_1: "hello" + MY_VAR_2: "world" + # List of env vars to propagate to all nodes in deepspeed. If you add an env above, add it to this list. + DEEPSPEED_ENVS: "MY_VAR_1,MY_VAR_2,SKYPILOT_NODE_RANK" + setup: | git clone https://github.com/microsoft/DeepSpeedExamples.git || true cd DeepSpeedExamples @@ -60,6 +67,10 @@ run: | HOSTFILE_PATH=/tmp/hostfile.${SKYPILOT_TASK_ID} python -c "import os;n_gpus=os.environ['SKYPILOT_NUM_GPUS_PER_NODE'];print('\n'.join([f'{ip} slots={n_gpus}' for ip in os.environ['SKYPILOT_NODE_IPS'].splitlines()]))" > ${HOSTFILE_PATH} + # Generate .deepspeed_env to propagate env vars to all workers spawned by DeepSpeed. + echo "Generating .deepspeed_env" + python3 -c 'import os; f = open(".deepspeed_env", "w"); f.write("\n".join(["{}=\"{}\"".format(var, os.getenv(var, "")) for var in os.getenv("DEEPSPEED_ENVS").split(",")])); f.write("\n"); f.close()' + echo "*******************************************" echo "Hostfile: ${HOSTFILE_PATH}" cat ${HOSTFILE_PATH} From dacf27348ae1446c3c93d0ee2fc57702c5366eac Mon Sep 17 00:00:00 2001 From: Romil Bhardwaj Date: Fri, 27 Sep 2024 13:42:22 -0700 Subject: [PATCH 25/93] [Docs] Deployment on existing infra (#3926) * add onprem guide and script * add onprem guide and script * wip * updates * Updates and add figure * update show-gpus * Add deploy_remote_cluster.sh * Add deploy_remote_cluster.sh * Add deploy_remote_cluster.sh * Add deploy_remote_cluster.sh * Add light/dark images * updates * fix light dark images * updates * comments * comments * lint * existing clusters -> existing machines * Update images * Update images * comments and type hints * newline * comments --- docs/source/_static/custom.js | 1 + docs/source/docs/index.rst | 1 + .../sky-existing-infra-workflow-dark.png | Bin 0 -> 46000 bytes .../sky-existing-infra-workflow-light.png | Bin 0 -> 43221 bytes .../kubernetes/kubernetes-deployment.rst | 5 +- .../source/reservations/existing-machines.rst | 153 +++++++++++ docs/source/reservations/reservations.rst | 2 +- sky/cli.py | 128 ++++++++- sky/utils/kubernetes/deploy_remote_cluster.sh | 243 ++++++++++++++++++ sky/utils/log_utils.py | 98 ++++++- 10 files changed, 610 insertions(+), 21 deletions(-) create mode 100644 docs/source/images/sky-existing-infra-workflow-dark.png create mode 100644 docs/source/images/sky-existing-infra-workflow-light.png create mode 100644 docs/source/reservations/existing-machines.rst create mode 100755 sky/utils/kubernetes/deploy_remote_cluster.sh diff --git a/docs/source/_static/custom.js b/docs/source/_static/custom.js index 3e5653295e0..1fa28105186 100644 --- a/docs/source/_static/custom.js +++ b/docs/source/_static/custom.js @@ -32,6 +32,7 @@ document.addEventListener('DOMContentLoaded', () => { { selector: '.toctree-l1 > a', text: 'Reserved, Capacity Blocks, DWS' }, { selector: '.toctree-l1 > a', text: 'Llama 3.2 (Meta)' }, { selector: '.toctree-l1 > a', text: 'Admin Policy Enforcement' }, + { selector: '.toctree-l1 > a', text: 'Using Existing Machines' }, ]; newItems.forEach(({ selector, text }) => { document.querySelectorAll(selector).forEach((el) => { diff --git a/docs/source/docs/index.rst b/docs/source/docs/index.rst index 6bf2d889582..d83bf7821c3 100644 --- a/docs/source/docs/index.rst +++ b/docs/source/docs/index.rst @@ -149,6 +149,7 @@ Read the research: :caption: Reserved & Existing Clusters ../reservations/reservations + Using Existing Machines <../reservations/existing-machines> ../reference/kubernetes/index .. toctree:: diff --git a/docs/source/images/sky-existing-infra-workflow-dark.png b/docs/source/images/sky-existing-infra-workflow-dark.png new file mode 100644 index 0000000000000000000000000000000000000000..aaf4245bcf38a7f13f07a53c4eec68556b76e613 GIT binary patch literal 46000 zcmce-XIN89+cs?3>Q+>?NJl~G(jwBNTj-G@NQaua%&eI;bCvTt=bCFJSeO}}I3{>(-@bh( zjE(dx_wD-&xo_Y8AAcVPTHgBQyX@N+yU$qf>aB3d68UJjll6AjYU5IOp^x!c?cbU5 zz8-MwB~Sapzm;<`+IlRxo$epx;k*CZ%`)_6mQ9x1fxq9z=vzED)w_~d|4Lq_R{D{& zCU+}EQogu&b{Es4G@qjsCYc@zzX4yFjfgfMQ=ctp34*?y!Kl+nimt#4{_p?vKl0ZV zyTxEv;T#g}8CnY4scui*8r+=MDp*W?EJC6k?j)yfY)K*D@;TB1vs5wmbOr)bd{v#Qxt8UA!jX+G)I{%qpCZw7uedtBrsDMy1)* zM3u!#VE0EVdm^zMMGqzH&Cd3qSL6HdU$*nz(c~^TH#!D0W%u*%PgW!vVf=SAcTSA7 z+6bA@!pbOGShF}37kf|1#~tR3c^_yb#|VtypE35!TJn;<&0fH9e~0gB$!e_Do2`G- zjoz6A#=C(&k^g3Q#!1y-?r%COJ-|2hWMc}VB+`!`lKcJ6*dHCgo>n-Ew=1(A&5EUA zswa#>)ts*V&EC`lepemoma$*i-%{K(4uT15G9jt#8TTbVSM>14=5)G{9w^VjAK0#4 zrTr1ttAF`jRkM|i$8X54BtHO)cbPN>rXFSQFfmrqBhSqoD~i=X)hZpETVz(aCEnQ| z#$L;$>rP1ZjPZC~g~pq1LF3$_&)TZ&as1ajnA4BILp|zmaF?(FalJ>_%Se2z=uzp0 zrd(B`EAR!+FCsuW;sT4pvg}dEFHKE2|Ez;shw?zVqih3P-85--?18uc5s1N$RKs;k zGyUJ$)gtd~@%$e1{;OiAkHmI{TZqH_;nsUhK5=4e*KziAk3aKU$+YOU`8_U*7mxbi z=J)({ITTkeP?g(bZfylaqsHLWv+NOvUntHzhKmSNFTO|-&W?a{(Z2n8{XT)1*IQ(9 z^WseJ?~yjO?RS2J{vHzh!F30Gm{XDUkY|DIzpj*IeHz^uF z+ahkq^1GUdhiU2y{)IBHjgLCcp8w*$PtHEurtqq2WFAI(6tA7d-iq|aDG;8;+@K$) z-!Hf`@MpCTp~<6H0~@UJuKJyYum1Nup;ckg$&V0A_HSBzP};TE!2Gx6N-#`o`)I+xCE*X*r!8Geao#}?tM_khgMUx7kNbt9hZR;C+f&bVqIO_P*}1mm zzr|YvZxaWN@d&zYx)>W>`i}_t_jSGiJk1n2a4soIJgT|tANFhqKmSTZWNRTW*W)o3$27%zdsfK06ZC10#9L`qs zh{7J4ZxFm3Ek4?z%~sMr{@b1H!9UMo*ES+gz-^3P9?w3`-uZEzDHjbp$riCa%WdoZ zUJwU9%r=s^I7JUIR!U-}IMer4*OU_6m*lz^(ST@n^)4PnV!d*1d6)wlHNVua?ZJLI zzh38yD)BmT*`Xe_Hv}2%tOVdWdvOu?J^u2m4jG!cBU$xCDUvKPO1!wp?s;{24=ZkD zqYG?01z_XP&&ZGVXdAEx@r!mMV4LT$+wg`^6)5)rfqC6H;m66x?B0$J{<;J=7wpPX zV!!lSthn6h@c)2?^Z@`Y9^*>R)sxjme9u9#Eey7!~}x}~4pD>ee$4!^`Y`!`LIb-dEA;+Ilhm2&h|zr@$~5B=W6 zMYKT7VSIc_`(Zp+#ox+j+FW;3s_V}g{$LB|w)n;9{S`OXc(6+Fkj2HTpB_ zvK%0l0e@zI^Y1D=AsRnRBPm3HtOaBOy~V@8&$5@!2`pk%*^)~Br}IUh#0Y#{98`%s z#Z$~Dq5FIo+!*Y_!R|LWH--B+roVO+Yc2tf>ecVKlS8qEr;DE!pCS?1kHntp=dUX( za$3`$axu9)tY10Z*{=8VxUpUBQxuN<2)}!>qN7Qk`Ejj;QCI$b!y(7W@~Pxwrz>1eNGF=5xvu7qGklG{s065z+MHgSc29t|8g z0C$5LE8E*?*aoIkZeqc31*}85$TC4bLVxWAWftB{alrWA>7i zx?Ui9l~6QzU^o-LDBk{@lLc7>b`xXWE7T0d@9%{}2A&cDvM>oSB^NBRYDkc66U!*U0K- zL}x4Gk4$fxB^kgujY2c^#N&;1&Tl`W2*HEIXTb_^Cze0Bn$Lc0+dY4xjv z(q(B&lB1m#!)$Bj*G|^$^^7vM3(Np^0v1;@qQbRgw&N)hSVy=S76n&3lK6BQ; z;d8oN8d@&Gpw=R=i6R9TF~TUQcm8_Xa8^9G8TltJe|oXXVdD-@1!F{8t_8+Zj$(y6 zL83h67pbW>~E$KH^C=){PodGJ$QPv3K78%09SoyuuUv2-g7LhTBSb z_x;?!P(r3~Cl{aRUGRLq1kf$vgX}$X3rvAxs~7eAlv!hzG@+Q#k^Ay??r)yTxYr(jvk~kT~@t!Tszdd$7jyb{uF4xzj+5c&4rUGx&3oKSxx>N)!Xhf+hvkNJ73hm z0W!Afmt(E>;p;#{XK>99Sz?MrV^gc!!x#4=b2&w&BK*;7S{%j4IKi!wfnmQlIlS`) zr_rAs8MWpxLC;V+L&=hAwr`{ z>(O1BIrb*JicI7FZI`$8E&@)yR*(fU9ceEwHeuZ2ax!?K;Z2J(aNcf6+(46$n2@<8`v<`9twW3R%fv z6YTgBdk6cguKMJ2tz4VKAMS60ra-OjEer^I_s2E&7{#m*ucyd`4cJbaTj76tWIGWd zDyC~U9JUXB%J0W$e0leW;qvz@7C;R4?J-*%6kr(?P@ z5fW|qho@##u*W}rWV4&wmlZo77_FIN8J@)_la&|6>m)Q$2)4Fk3*D1*#9PGL`X8QJ zfMB_4N^Cr^$k?;Klgca!cm8b%z!`idqj~>8#WO_@M=Vsrd&uMrf7u3%^^)?K3T03C zYL4Gk(xlo9>3{nP3R0WRuxF2v*64hJg&XEi9yVy~l4$tP<8S$6+A1DP_?Kj=UL0(b zNHA6n(LK!`q7NWM14Zvsr0ts&%tWbiW59p+q*L-XS`Ql}a7egQp=<+ub&uaQG~H>I zTjiq0&aK)eM$Vsi3wFMMNzgR)Pb8m=$+*>mV4GcR;hp58;DW$JLH@G^X*YuYGy4aA zB^xdklIQ*ULCurGG6IXye?og)o}$NT?0Gn~9)8$>12&Ng2ZZupeL>hXc-WAMAiul~ z*bMG7|KEMy0NSc}N}NdMrDG71+p&yM_Tu;hJ74&Umzs>5%X;3f|4L&9V?(J>wtd|0 z1`Y_B56o&6s)`;CT$`XF!8&vQJL&6im9cpSO9i|2{~K_TMV>+LtuRRei-lK_aDPzl zCgAj-h%2?KRh19Z%qCX0;a)T)w$?0^lNCjtpSiG|b;w2w&7eEC=6$BBeqa~^$&^GPn!Hg zixPSNuTCiD;y$&hY3}Dk^hSs&jM}g?@V@M=?XB9NU@d30uT;6gwCvu(FDGeyfb(1R znv2}1ncQ?v2CI5{dO8=J9YFnKhWZJUBi$AO8!D(9o<9D^JGg?b)jW%uUbKSSUh7*f zc;M`;&KyJF9}5YPYku7b-6+yI|Un`5y@;igZ-`ctibH&-gO_ z6h{{qA27p2{L3*!77}UFD!xMFF2{F)oQyvXftXg)!>9Y-R#zABCg}CRmHudGZ-&X@ zx{7NlXEQTvK4)L6`P|1-^Vx;67$d%7G#->_{qU)*>Gb)D7H!M0?(9!LehwRn*IHe_ z{;7l-bRukODy<4UHNZDC=s@FXnlQ4wR2ps4>3y=FlkEE%k<)tcz%%PB3%HP`-x@zI znrPcmjGm7hOtfR1dOh=5|FzvDC-6F@wM`Y{CdZ5(DIQZob-C8hIyGzSkvs=0tJBpL z6-2);e>=T{&4zzcEnaT0ZOW?F3)U6$ng z+vpzuJ2~=I2Gs7pexXKBeE3^Hq<+vO^tByh(!z?7bRJ-6f=&D*D>vFVrppiN1Xzo6 z<=9(WtCV3d(G`kf_uFI>ZPnLLhdEmOw%ph|k0(dI6IXsV*4}sRy2;b$tNV8c7pHN- z3IESakv^1e0vl6@(s4sPpOuH0k&E9v;#${OueU zeL_M)G8G&feK=Eq<7gAC`8jMtbW_99zBPD$%TKo`eX22YHbP{1X=x|<0Tc5;*aJ%l zY?`=#W!39@CJybs{xey~OXek)%Lx^!=u?RjEWyMOw6sgM3BL7E>7(KcH43Dk(=HQ=JO{&-MZInJ}(0D@3E+S^Qx|*R)Tan z%zpZz6tk$cm2L$HgkgK@)*KCUDh&IoFuXyyqax+{9zt+uge8u9LP{KMiiKnm{IwnjvTsCqfYe*C1gZwIMdbAq>tVz*LW3 z42R=*0sA66eL2>Dyi3Ec?4HSvHaB?3wpLF0eqBW%?}v}^`7|eM=0LW!t*wZ?y?skm zM8x(CH9D1>8;O4t-Vm?{XR(jqrz@wb(yNY;w|t6Lq^yWoW4a3|G}&pbp#3(?V~o7T z`igV%s!W5R;a$+cP5-gE%vU-)$J`aaiQcR-D;NCUC(-)v4w^4htU3r~Ky#)3nnLUbHOQDyo>iJ;y{3 z>%Mv_88MOil^NEAwN>OshPJwmHBzA$i0u#a9ZFJgbDE`+%a^wD75>d=fpxx6bPfBL z;46h7BD)vSC~*jlK&i>IRV`Jx>(9AwpXwvS)niNm)Kbc57{D;yn%Za>S5E;kgGRaJ}PqcgSIt8=Ee zOlQVMtdHaef=x}yo@qu!<<-@GC`J9&@ytA<7zeW#6ZIU2?Q7q`1houuHBEgnS?>@7$n&O&#aT$^!`5kQ9BeDp)MZ);-9JGsAT) zSt%|zW}ZwZ`99A0TxzC}WGHfi|5Jai!{@-JkW5JU_x@V{_P@W4_a$8#zeC;HmQI|^ z2i`7icSFulKKIy*z|?;Sr~|VKKoK9dFWUm z*@T`w5*iX>-R#ROMzjpi-Yeom+Em)rJl0}KEpD=G_~e=Z!PQ!Mf6H$nw~}3ejZEZRFSGau2O|?9wAhg9-y3v=+~T%!0`E*oN*WpXR2y>$J1oeUOosI4MXxL#7egW7UH(B`LD0HylAlZCl;~Y=uGcDGoj2e{ z?u_~mjO<5BNPC+&E}DgOy-Puy&X;6n9q|$3r)k$^*54rO(|3Bwl7{$Y*z>)gNxPKJ1W{@6T+l8{7#7Lo$Nh>l^eYWb>68I?sYrQVm1l zI!LI`>D1~40i?uO&OiwwUM{-}%h(=@WocYrigPPCaW5vSDe2@_l+R2b^TxAplOyzB zEAhNyO2kNyd!w!ZK1Qz{ZV8u=LJFayA63OpaRGX-Ye_m(*Ou_dx6lhK(er0z$W!9mnMD7~YX%Xk<*$FdHZA>JJ)uMz z4!A3gR|Ii`wPain94b=d$4~4mmD`FpS%22R@c5Z2>v_~K3#p^4ZeaYXyuDf&FGY!8 zjxi3YgrRTYt%Y}{Q5B0PcOu>0-C+Q*pRBd-q}9B(5k#UCm&IEo=y}y2(g%QQVSJZsJo+xE1pvD;$}^PpxpXI=spJ#}2A z1nOIoEO2MD-K5GY%aRw_-VN)o4JUhsA~4B?lh#cvO_@}NTHsevM73wu08)Z}Bnlvq zU(Lt&@gc_}X7iVem_GAOcD49O1?LTvDSUHne=YHhjYsr0swESUJxUD;CaTEF8zwXG_@=jFoh<<;(EQt9i_T&7)_En$}6 z%sk(wH$Wbe>y#v*24Nvxv?1jtrNH-?&+R<2OFHq2(M`dt%TmkY8>=5s`T0f@hiT0J$1kv!+F#*MRu7f3R%b z-%$Zq-?SnTo-Z@M9>^dP@?k1E!lMqtE_ zTi1g&KfpWs1vU;3hvgFGvu|3Vy%GxCY4dl0G*F|i28`gf|JyP^z}6@yZm*W3P$-8n zmq4>Q61r6XLYQMNuY3_M=0Ib^dRW_;Ub0ODJ^`H^^r#8*9BwkMYhgU0WJ$y>vKk*S z-za6}GU|WMJbR;6abL_8h6jV*-n&F?&5}DDazraN-?n%{Im|I)S;_$xz1&Eo*s0pX zhN2y%W7A?pv>KPjy8+ji4MWu% zA)=PAIb+ZOt6KY(eSoe^gu$67v(O7$C%VD4u$8mTaD;D@xkc6}E}K*SWN+v?nzZuJ zqk)CG-;~`IBk)ztaZTnPpO8Pm_e4t@d~HJXx{4AzKbL;2`8)zY1e{hT;7#K8gcKP) zN()OUF=WT3xn;`x^}O$qCojin1rPKJ-BRO(HmF`q1?dr$ezL-FM)<8x=S3acl_25h z59`)L&sxy{+3oJ>=^;y*zAUPHN@qG9tg0+#g)+oj#D!t{e6&!;wAkD5Ou0l-`gZun z$)?L(-JmCEAAd*Q)bW*2ju$l>kzbEjvDH2pPHxKcP3BD|MSJaRtDssB6#!IjG!#*w z3DTI#Y%~yB*16EiYi}@-BAu0?k;ww-t798a)T4~cD+(7w1#qHd)*pBm&+@i z7T26o9RU8#D6luhqUEO0(k-9Kl0rBcNCZnxRK7JBOK>roF{;KZR_VNSBn4zZd64K0 zW~5cubJxJ8q1~kZW)0%>HLMy1XB$3$DAjM2btQ`7cGVb4-OzHW7EMVWO%nuns(1LU zuSr9&Q(<;6=*-`RFim?HyHdW zV@<#prSH+`t1BryF^6JWt2#$nbc4_>IAZS4^Ait>F40$m$}1{_WMpJQJ@4I<{#M-i z^cLPB%%{`1HpY1Dy2R?-_82Az%oGnpoqO0(ku^sYNbPu&TB`84H$JZ(u}npbr(7I? za3M#B_nmk)$<@96*x|qHvzXx-RWvqI%Feuy70~I0ieb7c7M@S7tc+5 zuCIn5XGZqOtO>cFDwet|7x8f+J>cVnsv}TJ-5>C@wRAzc3w{ysbjf{kmXf!zs z{=h#=oKZOp)?E%orcP!;VXETZ(;@CoTjv=yP4(KYn;deGkYXHfl$iC~8qa_}M_sXd zEsW@H*AsYtB*+DWraif!->L01W@FD7L^?i>It19CnPUGs$hXaje)a+*7;g0hJpEhM z3PMY3AeUf-)p|SSMHwzJ*VW23+)4Ga&ke6Ulzzu4!t!aqR%(WY+yJkaiCpEs4Z5Hh z2$x{@sCD1+O8nhr9F~v^By2#|%94exuOd6{5S9uV6zZ&jpe!yVG*G+x3v58y8CA0I z9Y^9y7>-utO1K4tFr!AU;-L6u`K?5{ty0N`CV|7I4(lzgBmwQ9IM7#IEQX=qen4GaaVLg|de{p8C; zw)e_koIci3!2+-``(z2yn>V$S?6+^c%ioE@V6Dp_g7k2Yw& zwfmjd+YABjy*G8!upOzMZM6Q}ZBi)&ObypNn1|n1Kh}Zqd9w}_m8%&@1E$M zO1(UF?}x&&I8OHKqvh7>tZe(ghrr6_W}C**GyNg-dj(;-Lvr6BPbOCb8Vo0zU7Ifi z+yjz!es5ct{T4x!Wfr)lay{t1?;f^Q;A*waBM}(&A9jv%%5U1xg^G#+mQ$qVV0<>J z;we`);mj6ct_PCrjb2LG%Le(Hn)XQn=`O>C#V8Mf6W?@ruGxdha`H0nc}_=)B_h4Q z7E-r)k?! z8G|=2LA7vx6aPq5ZN|@C?r(<0O&Vu?2#+n{4sb zyfqtRw;hb?lUzp4ZDBDQk0odQ-)P756%DVKIM?3Z0l$YBH-?}XMOrbwZ-%2Z#a?N9 zubMn#idXm+quw_6_SW1+J@6B(s=0*Ut(}kZ+hn$pXSO-Mlop6Baz1NaDZ||@4Z^y+ z?5QqzAiBwUjVJ231o4bKBm8lGY>eii3y=_c-PqvP1)}1{PVR|jM$9{M{G@fj*N{RP zAtbq7H0zg-evs&?#0Fj7XA*~_L#-=erfUC-OU~)*uW0e53R?*x z+ksjam!8W5jh2I4%=E1$wUsyF(*^%73Yyvz1cOcQm61`y`3ZxTgj-_{Lf{y=R=?|n zC7GhZ6Vi~)FCHcn9yP?S8HGkrgmfF({)zK5(}n6#p=&D^(o<-v)@+~vO-_485!Fx6@WKazU0|^;Lu+0h&~Xyd$>Rn>F{dhCNc}; zQqXAUlij{$=Y_5p`2tspvh1iz2CHXgzX>04cb*~SbdBAFJe$na3T;ub=@k90n(sSX z1^U_M+T{7Ldu*bZy3YN`Hg+J!xKt6uk>-d&CdJILNmKuuThra{ueUq8$RG9_9B;*XMNQ50O==61t9UN4NtH9$v= z0svCB+)|Y=`+Hldm&9B)o&Mm0k+O_>ujb|DRgwje-r{b1XeV!7!Ysl3CqcO!Rjs^U z9b&NZ6H3F~MAN7GXGdUq7=-{~m;A*`df6N8O3T|*o4#wCs*IZ2aOGvbf36n?A565x z5?uZ3v=-f?S?=zhxSL$5<33>*Jgasql?ntzbkF^)uHODS8bJ+T%X_#h_3X)HCM0qJ zGhUlDr$&k{|JqUU>d?mY`l)6GDIf$sR=eLJ+m?ZY+;HaLXkS@apvR3ry_kA;_yfMt z;opJNpPaVyN7_dOyrq+D+nx#oZ1u7)fB;K5*LCj`9|vzIJ*WAIt#q6+BXTcB4DqIC zV>WGw)}fAxfKDW0bh{$QMaF!uqDdXH48Nzlv+kH#zeqlfrX+3bqdZupih1w4b5R|N z{_<=VCvsnh>_{f&a@*PBE-?kq`=G9%h^MlU(d<~ST7@mY9ian7&Rf#k#egS`d@cW1 zLV2sb`%ygbbsdJpL z9dJHiU4^FIi~4?ccybf;U~bX~vfGpeoF%e!LYdyh)Z}W>7BR>9!`w*9=7Sz>oyeR5 zqWNC#WoAp>yK46lxpWCV7wv}tNU9aUOtBcySPnnZ-0`60c)Li^vLjkyk+zX$fi@YR zA9M~|F!WQi)_{7Y4EeuFwAIM%Th@#=gMGgw2tXv~_&y$FM&7>wWz1%cMi`DEaiPYq z>qSn#GsPG?5_*KjCaqB;y36sFpxwz4^etU=kBLCy=c=OENc>qeZJjxnmKnVV(57H|U4$&9qQkOY%az)W|P~7<<<}Jj~WIm7nn0%U2s~+-j7b)ZE z@&d0WZ2E2aYKLsoImwvaGNK_@xGniyYqiz;t!Rfr+y?Y))x%qR$Ht?4c&aqBb*QOg z7hL{}&~E;MdU$7+i}e z^Psx-ijydG|i! zn-+qs>&xU#w7p&dt*ZFQ{jT+T&9koSrU|mLxz)PoNx|1IYcb=rOFl<$y^f2ESADxr z%kJ6`@uX4p$~2~A{E6wx}$=G=^i83-17Xn8d+OZObs`_HJ&6^%MGdS z=dVzqOEkIQ&T=^}cu(V^GQN}{L|s4%A;)Z^7ghqhtXTOR2U8JIQlCnd=-u8yU9!p% z&szGU?@^c3{Z{W?4{Rho-Va-ohisnt1jo0Eg=p;#exdV`bk@EVZvo{+uMm{K7VQ?@ z7Eb~9GU^|6N5`l)f&b)>8(~Ut(|L2u{`oMw#zxj!Tw}Pqi)uK0D9eAfykjA(nkcMS zu&CgjJ4N22QmL!sbWPPRQV1bi&X`Dv1dI=0ro6yavI>Cd0vZ4vj-NopSxN} zx{82YCzgtlDhu^+=za%#6d(IU8&}L|^))t^NPL{uKc(!H+>!g9S zONygkOTLQ1qxqE&mdCfN;D$osav*@MZ<>Y$M!Nkmiq=>DfAVT5#;8BgXmlo^!J!zZAZlU4J(K9;Pd>4nnSZGQAd<7D;tk%bkOnl7y zZiDM1kXdVb&Ek32D*KY?pa0;SwP#q<8J#cRj>X_d%VP=n^q(8O6Xh?r61*!aDpEP0 zUW+0SXxvtbn9H`nSvt_ObK0xLa`(|V6L(f>Vlorn=uN`FOUaMyp6}&GsEzKcbBBPu znZ?{>rpfp;75jyiGcb7<8_^&V+V#%!PK7dr_73HnwPQ}Ft~^T0G50Z9o>qeH({VyUGUGcDm6iUjZ-L87Mebv!|LTdV zx5x9OJNM_d+S-Hb-zfitBgc2lISeJTZHk=B5rcrYJUhRPLoqk|Wr8cvCZ6qcFljVZ&K;Kek zb{=GHrA{jeo!YmLVr*8+dZFybflxw3PMGZ%)J5NP*qGv`Kuztn)J!sN-T&Qljs7yUS0zeEX0Jda?66Ic z`F`0q&(kFqubC!XO1Z1{`BC)qu!g zMyq^RsRBV%9!!rL^VFL?#qr*mREa7X zgJ*}8w2i;I=4hVQudNVaz}FnsC1j)_YXORo-#am&$8jsbjrdrKXYhvyQM)8je;}=ybh|Z1dDN8&@En!A0-0x<-qgw%2ahJ=UHg3V!GCcn7##uQd4UGOs#pidoB(I z%Lu-Zxn)~bGnjZfa1v9h)i~(28}L=GkTQ0q?CAg^n1g#SC!@^w7taX#OZv9!0s00I z`#aKZeYoFnW=M<-58w4~Ls2%IEbqa{ZQcFwr%l-MMR z7SvI59MP5#x=(=`Nz~M&_EkBen1;&Jywu|wW`_$XMY}}KzEj5LM7L#MFqvtA1+E^m zK*M1i5sMR3w-_-Jr9B8*c1Be6kMb~syO;c1^b5_r4c=s`weB>06<(`#qVwrBwn*QW zW9E@SzG~q)i-h7ei7+|9OuAk^3@LT&#-7S=Q4T*Ow=EpC_OXM}=kSw&%tzl|%{QFb z?SgK#PbNN1S3q0nb_qMw=nju4U?5(L~EQeX~B$JuS}%s=aN@dvEC}xo-NyE7;oM z>hMiX?CycBPyMG`sc_9_Yjgk^3tTq9x%sD`XRfjG7W{=+?Dh7%blhv ztr|c+vQ0P7pQkjUiY8R>(Op0`61J?-m>uE6k|#!6amWQO^_!$f#OS;~E2cl;I3M(A z7KxkTwF@~s(Ew|sx?MF*irKo+b_YTW5A7bv+9B$!rb+24FV;vs z%XHH8{!l1o25J6P45%{9`-4p8lyIJZ$-N(5W=T2!=+ain3UssYphLA^QOBdfTS3O} zSv0+|38dCshNGB@Rl5^V7LoDII{2$S@%>xR(OSugf>z+Ps|U5>o8)W|6g>LMTBq&|vlv5$c^_(;po z%_%;2t)bo4o{bx14!PF7?l_;z`wfnaLYx5V2kaf*j=*F-L!-{^)ee@h$lZi|(iY}k zz~xQ#0K#LUg@!9g-2)drKS9=c(`O_NDNB*L4OHbIeb73Hk8|HZy37q#DcvnXNjje# zW{2|BTuUJ$0ROQgKNJvxt4N;zhUL}v`zpj6J~=Raky3v*J?Y_%)qu5^e%UZ5o_8zS zN7W$Pp(DetMBom~{B2bg8HK?>6_?+;{QD{fs1H-RWcG_GN1BIw3T<*o+`jm`uJDsV z^(0U1fOio$a=56oMf*_YeYt=pM+)|o{B8LYDiYPHU+!_p)fTke{$Nwb%PRg+bSOz7 zui^dx!b9F}=VN6TFDSR+{;EdP?S|;i5iyT7l57-*Eu6YKn^$KZL%no)`&~t7CSmx~7P{MGM@Tga^ld~4&Y^iO1$Zqol*0-pVci_);goZXi2MxU!0Kga^1qJCl`PP(tW@cvmjRLKfqSvoq<21kT6_;NbSOBuN&6GGdvq0khWHYgr@Z$Pd9>1j* zyonOjkR)Bd@@J^=8|`c3(R|3rW47xvhk2`aa%^XXFO|f2-%L57;=)L|AJj#p>TH!< zsHt9fg&ViAc8=LA7zz>R8ih2vno92NVc*5tJ;33)st1}?NmJbC|Y4yr)2GG z#1xQ~iiz9NrzmY%Sw+m{K}82Tlf=@DDgnbUe6UOi`n}kF$2r~`bgubc-MU(HBu`i)Szks8e6BeY2nlqRr!OFHNSCe+VgX^Yyoeu5O*SA5>R|)$$(| zG}{s3on<5Oz$?DK%g>Y;VfLcG=YIe5;!9w$rLjDyXs@>yRvGnozbX=PulMiz3TGcz zSZKL*KTzbUY~Sas(ey>rO>RW)l{LUbcwFNHt~ zI#;nW?CI5W6CxgAAt7z|pnEx<7N&t{#Bx#|f#ui))Pe#wyDzzMGE)W-(L0mwx@-lc z=*AE1dC7~JEk+$Lau46z0GpL(QFackm+XX($IxY*Hou^~a$u3Vd|o0-Wvj{ZK)qd; z8v^gfU1gcW`eZBfMyVPQolaw85~=h4U`{oTu+A0~UD z@((8eP3Kb>UwP6cESs~vF_dfmGOCmJv?FeN=Lvpvk#yJEzyUfT=(#v?ZP|4L{$XYd z3Ec94$&WHKL#&U$W(XqE#d3Ko@=n(RMMstZ_0n1G9D9swT?S&4!po?^M;(TUjc=zJxSe$bSZJli5Y)BU za9s;Seu58L>5;y5s7NllU(t1PmBpLc8Fx2UDjV)oz|< zzJsvl2z4Fr=G#y}ItDh}BV|c`^G)h5}18898_8h*x^w>-RZ3%SI?&23G-0;nRJq0E#(+a*Cr{Q!Q4F2Qm>G9WArMr02}n0oISPn0=_NF&LV!q1C<&-2HK7v-Ek}V+ z0z?QQ22$R}bAI>U@y7eU`^O#cjq&y$y4hJ}uDRBlbFSIej`Z2tXGC&f9#`X6(VXR3 zDy=tNmclUU@R)l%+MSuvGamkv)xRp9JbBq}luMfCNI?MW@>%s|4XQhe+Yk6?yE7<$ zVcu>jTRz>eZpQZdoHI*@xqZ7k!jtRpKwWiC z-tus98CY2IO7bY`X5_ce`)<_ku-m&^^0J9NgWHw z)V^!y0H5XQIN0Ok4v+hh#<45ZJH_ypVa#X8S5sp*nf{*w0%p>~@&nMTM~SVE674^Y zW0Q1jH|R$K!SnP7m}JwowFF7#hiS=gG$Nba5}K?V&bj);MBO@MTD`%x_6>RKpvnC~ zWhU(#*wapHBP1|*?|wHJ%K(im{_OkKW#imf%WrJw{4iM79l3QHsnS-{c_%D~QAPrs z_MLRQmtQh&7w!KLt;r@mSvE|ctj>>A=ILOiP0F9en*4kqV!mN?qa}e+xC6ci1O-H8 z0$vPeC*`$#5w^2+Z8;ArdvH4^t^1j+EulJJOhJ)AB#(^e=bn4>oo#I&q=O#sIyDXe z4*WwvNJ^ZQAy3tCes^}&%j6^v$?~BsMMLzYuCbrseXrx6CSU%r*f@QZduTC5mOFg* zYWXVdJK$Z~)5{6Z%1tBY$-e>2s%aTkBL|JFFDN?hGXGxZIo-W%o+91TjrMazfE#M1 zZvX85sfMB6kX#;1%uxQRWZnG6S;yWJ-JIstjz5tY9QUZtJSn%&q~Yo}!~&0~I~olz zrOlP|^eyYnoMvpCJR8B7_>uP?P))50DAI`pi{45Ln(MSd&ve`7bd#tp! zwY3$Fx#KsTU0IjS`z86ZF^bMgaoC|1Mqb6nS?lWSGWLa?Es(6=x)vmrdz}OD-$YC) z3E>~?1vo-%FTC86oC{vMyg@(2p1*Zt2RTOdCt)IFA@lGyK3li5UjYlJbbcj}bb`&Q z&4>GKXPjvc^>s8w*=NT?vX`Ns)>r)Gx@o*MfUklV{}~GL9{1r*tj*Ta%dN=f)hNKz zNw?i;b4eVHBvC2>$BB<>K>Rzq=J+>6*@})D_Rg2XkfTzoy1zVlDJR-*$&WAfU0IeV z>dY#^0nebR$dR#bz$2`M{b(b4GIDK4V&}O_MWV+}*2^u*inoVV*s+j-wb}TeKP;+^ z^>|bA{8|z}FFe5X$cl=gCyBm2fjgd*x7w}VpPRMK+FnfS-`n}#9D1+m*>sp>9~h)V zlU8OnBJgwe)r#f`%>lMROi$rJ+?G#ZrcNCHw-abUuAw&XN6e^j#u3>;9$>(?s}cxf zL^`+w-VQG3yPaoGt}%`O!v zNl%MKA3uyTPG962Zu>@xn!`9eb2Z^D?B<+aPBCd=S|^Vydaii2``OajJY|YWN7Jxq znf!+6*L`^lS$z)PMzMt(4BE8Lc2<4oIxt0}^SBc!q3l?;u@aN!)EuuIwMD^(-1jFp zJqh-?9yqPc4!kaD@%NBw_x127*u;zN7j8LyQ{NDdd3Fk6CatLI@Bko)K(g%GP)cipSN<HJxgP~+t8K|72y$K9JRIlESYhy?Zt-s64KpdQxvRyb8tPE z;e7ioXw7R&Ygz|&FI@3KOIvARX=;7ARsq{D_y`;mWs1caYeO1#UbGjyxSZI($vPVu zAo@Jqz;YuS{gd@A+kdYOc0^JY)khexI?KPN);*i8)_Ogle$wnR_2=SR?HPo5$_4(s12zI8Wf1TEm5x!Bu)ZSlFbXgzbE}h__{GPmpyJ+5@ZLykY z@wIyJY#!`XhP3d)+PQ?wc^S~L{c&R{KcAjvwMb9+83HRNt~*7NY7K!@OV+~T?XKJ; ziNCt{dVW2+Wr*4l9A05+_8N;LT5&&`;mk3(eSLKS!L0!4jUIX#cJ}aHiv3oY`gczKZK>76{8#Cx+pc zuk!^giI`YexB^hLE$aU13lBOl-Am|)v6|%pUhe?Ec!@zcaf8Q99}m;LS& z!AUHx95la{1=l95N@h{Ej0Ly-vi@kXqxL)Je*T6?+xR85wVNA@J)#P@{?6~O3@)}O zhHgH*MdcE;PAX_rRPevL`Xk}+TdlW3#|X-cwUI?ms%l(z|n9@4_3I zLvOo;{tP@X{pQ`$!GfE5kM0~lR(n2n-`m&461MzruCnuor_oKm%a;EA4(ZKHjLCnj;=3`@JOnz z{l$}~IYc()e1t2Di=-GiKk^Ar?7;LKsg%oKS5@nK?*)Sf-seh~CA4!;I)i#)B7)fC z3`(6k*|i&wuw?L%Ry+*6hS-gkE0#et5BsW3Cu>u=j11qp;P-auB0^3~kEe0)k<_5N z$Ztz(68gH9DI>_X(}s#j&B9mZj8z{(rE5z*8|*GUzxdtNw~uqXZoehA=(7qGT^B%w zD^q7~XhTg&oTAumbav=S+*xJRv`?=r{d1@Yxc`MY6t#JFOFP+7l1W;)SB%fvw;H(q zO6{IuT}K=$4+Odd>P>R}wBOE}g8doOJ^UPDB?rK{SD5Wn7fE8q$O2-V_~F zx!OGk*BD@q4&Mr{tM0+Xr(vKDaF(DS5Cl|K|1FY*9ZIE~mZ06Z8tbLto%qXqunbQd zsFnc*adzm8i&A6@5H1hvGQHFi8ceRACfmBkmz>j9fihriUZ_vfDjx}XY701|VTkN~ zePd^xDaJ$~=SfFpmF~y+bnrd6lH=QeSGfc0)v1<&I&FF-DPiFaj`Pv;TBFXi5VyTN z*8Li6)7s?uvhl}Uy?bbK+Qr^*(Kn&jps1M~8OkKgzh$*70&qGw$pxyL^k`=?bd#BQ9O z`BUlr)vBG8r~a=uhE^)`M?ZN#C-ushtYEA2-oY;S&MY8zGp*dc3C+PF2*TgKc)oJ2 zgBRFwbgdW}eg$4kM(zgVR?E!9714S`JKt?>Q_`)8Gv$ou@=h>qg>iQ`>o5co8JGU>ROu)8F~3xLW;9!aFXvx1%dyWeHg}W}t{{ zzRO<;LPV%8RFWL|OAAD^HJxWH;sqc3F7V=C6YCbLwV-Gi$jrT-&@-n#f~Jj8^IGX^ zk_-les_jmw%;?M026rB~s`RxZ-px1X;?9l}-D|KLRpKhc4UYI#T{89 zg!9t^@MkXwu5%4&ptlg_r-vc7mP*BG1aQo_>^-UvspRnb1TzOM)3%1WnH z%+Snq6%RS?%K2Dq45U9a9G3}JEwPy&NeiB%=X zi|k#AnE`>~Os{GT+pJ0Bt%zU9E!p>9U|<4Ru&@e%}E za+~zJK21DBo1@-khVyNRDkEop65C8jZk@AEeK}3Tl~#3XdD_7NqmyDBPnvhWR5^)dyoO>T( zY4)3bRn~vcKYlYn&6x|E>hQmvt@@$^4=9Uf+t~{I7}i0Tr!LQuk7t5V?(=WvAQMk_ zQ;o!(L_CE|mrKMW2|O$**FN}kw)&e=R-62Nj%YZxsxdI8Z~bzOFtPFaX2)cF!(3q1 zp$l{lZZgjV-8`aKn}>Of zDW$BxZiE?RbT5p5A$&J5c7c>cRBP6<0<`c=w|l%4*L!cP57vs!Kq;r7rH-<7KI4J< zKfGwPVGsAOTnp^F*BsR=%LiN3SN~VaH|Jj@sq)#AYNZ*#Cb%1iN`J`XzU=ADo)I$D z2y$7@sY6aYORS6-MNei$--G)SDsJ2Z58H7gVz-^4zT>zoW%LS#GWop9a8b!+kZL)R zgZqxD|FymSd{o$7fxo`?&*}FhhFn7t!o8TBc6)g-1-k8)y&B~)S|gVa;2oi&(C!`j zdX`yo(gV`XCAs(NG5$yhygv;=m-K=72)=RglqkQy@G{XM1gVE%Arl@ob|cH{f8Fo| z=pj_Q3cFWnN85~U8OxR?14&#y{kkuVg{fW~#oRPXBTTAH3N7AaMv!DSKWZQE+U}>L z0>?)dsy0xtd0}3`;ffVfn3a3FP~a@RE$E4fxR`IqT_{AVx!Y}|yp7K@dN$fK310HwQvnMr#w1XU7aHr8XtX@87L2y@NeCzk?l?>bMgZ^=ij~wQ@%o% zmLiXSAO=&J6qC^5KqnJmsftxD{YkJg{Elv7(6X|rhx@d{LPY&VHN1{?cvjwWrN zSB4jS#zBcW*r)ovKyO7k1e085H$JH1wcsRA|5f|hid;N`yFH&k__>Ghb+PBbx?vr8 z)JV9f*xmth*|7F$c4!Q`cyWK;N^iWruh}K~jK`S0{h!2E?Nzm%*=MC9yPF~*gRs+jWC$)guT>wugX=FnFDdVA|gzrY4!TGhw2hsR}8enj1vzB$^_jY)1pGkU@(8IcN| z&L%h+G3-)C|3(73yP>|ilKh#V>H6&}-Me;5rNmI3Iqp^okrt2% zv*2%Y(CAEb{T3OfK1ohX6?5znJ!mR-GU&T~o8;)HC!B^)_jyk+goGuVL6Wy>YJ!mB z{vn3F2q(dnfj|RYq}Znm%H!zCLlHV)jPXL+a>z z`0BlSTq+;7*)z4dRU#-H`m514{JCPox-yb=vnhQ3_eA3GGWrO;hycb%+r!I6&=T=U z9a(w@V?(9V;)YN(tYqP9r$_5UXy5hp7_o>bkujCY5(9qS+;1D6b*$^U{rWv>8LQS$ ze%uavBzD3h44L!{Ql%5zZ|6J#H_M;NR#mXSv42x{Fm0He;~&A=xlw?5U+&u%;?RMj zvcAVzfkU4YDy>4zOHBrA=Ki|7ai`$u530B?BHn3}>Y9k@M(uf*q z?QQP_aqY#bunP6VwGu~ZH+*z1aNDjxMGu4e-#8sU=hk<9g~V6xcNImogxhR@^!_xr7qu6rz1-f3UYZv>mV3p2hTKEv7RF;W|Dmcpq36-MwL#{qopYI> zraJxv-lo0be~mD=>NVa$yM8`b-xFyob5&E>W_ui;K69YK*{QDhqyjmu^)s!9v+{ec z?~v2Jiak`GCd}PsI;!=PgWC!sT#f z>vZ2TMK~+MBG+=j@uSxmQl3f*Vx;Ws>@WnOn!#Hu{DkbU>($mr7i9L4#7EnJiBcH- zxOegt6#wvWqT;Bb?mFec(0AU9Q7Wb}|BoJvw+R&I41|*HF|7yq1d)-nifK3~vLc&M zIi1XF{faqcPkJk?OfyPy%tv|WURvm@q6?fvJgzXERWJudq)!?e?M{n~(}RL)PnAl9 zvry)>l9HWa7WGMK!;1zDo%DFxsHj&p!K3?y{`u1TwFx{}`ac31Uqf;xNKn;ih&(P# zxbfo3NStnBy+$8kV)g0`9XhChRFMx|sd7+7%4D~n%TLM%l?6B5%d+}2)Ggw(rN3}} zB|UwZFgEzo7_Ocj*5_jm(WDva`l?P9E;F_=yYcMG8#bkIoWVO*TlV$}&w5XiPUB)g z8J=+ZJ>M5!Xox(po#M~>F7$=w2t77m!0imzYd9}tKg>I1UrK#}v)e&WEH!dC9QD3z zRrgt9owIYZxv2`mvT8yauk&?b=|WTbxhItS%lMj_TedTQB~ymJZQtr341fQQfkIrX zh3-UT7Wjp3KkK#wsn<9A5YwG4dn-y?nmyrmX4zQFCI11`jfQ57PR|8iYbW#PD2LG( zSs`;FA)th!3CDTWh-*;@5v0490WC~WmsSU#f&t1}8 zk(@ZA=XGeM`S$G5xeM$0bRS*&fAF3+2~WPD4u4kCm5|Mcjh->P<)bI!Vz7qivb@ zxQh3qhwbP=z?uCLxj@?qeCNS_=H}Fz@b(K2rGJ*6;takx+ z&#gJcLJJN$S|1mQ^Vfi);@HbBIJNxrb)VY>9B)Ijpggg1icNQ8rR7`%?X=mRv)N-$Juxah}r3FvVij1f?Cc& zik^|g4Z102sT-O)m6}a=Hjr1au&R%umcAf62+@YEGx>t6vzrL7+=|-+_y9iYMx(Vnx(q;mH7Zv~U;LSR(kw$T9}uFl6=dEYLQu z62I=_T&bO(XVr9|0h2h4Yz8!E`I$p-t5dHY*MLXDu!_hEdvHE%d{3uqH%>-10kZ)r zCKAH81A}wU=y7z21o3UeYH4{@eg%GA?;(@1d|&2J$!HHgI}F8fL(s!wnd{lwealpm5BIFHlLvuB;(l$A zsOf!jfT>f3l&x`_`4!{CSy4z9{36gDnqYG}Adue0Wh|4e7Y8UY3G0jVR8hB>A1=0} znPn&ge0R98>;N%u3BvX|r9GLp#PvVsDamGqc?u~YZL&ODxEi-pK$QY+tuH3o0^HWy z;kxWq%gDHs*qO;|M9bXg1;^M|g9MnAjPNpI<7kQM)80H!?Lzf{iNx&L@VtRgc;ZM? zBtv6sYj&O*k6ZUj6h2khz@3;x(7Q>vreTHo+>gvm?!eZI=#)AxS z!p_N6Z)&|AaP%xV?3UsHv4Gb#k4Mwm(4Gd6D)m2`MO;$?;^wAxlF^@iQZzby+w8=Y zJmOGuM)-D&&=EB`5pi0DqscGZCGtdnag}`qk<1nc&YJ`4(51GA-27tJGbZ;-z3UAk zkg1Tpz1g9Kr&_M}*SlJ>D+%XVGXesLo;YM`I-t4v=3ve3u0^n;Phl^Zwc9^mywMZu z0PZ)0du2uY!;*q{ZynV(%FabiFWCi!jla+%&~k6Enu6-8)CXLBQN00&Z=2;e_|s1S zQv9z1*^e_JTbqDH$aS4qqltEBngcR%Yi?V}WL(wEi!Z>TlXM*r4NrTvGRfl0(uciM z1T%m8rVQA>#>E4tl~k1hy{f~f5{M4v>YXM(TjV~xFFZ=^BxK7LR3E4mA67Ln?Ciok ztPiWx$(>(VOoyqf^9Q2O)WAv%X3j=5Hz#E)wp*W(lC48x9{mZf;~!;)kPY)uc8TK9zGBB&3(3jH`V%>6_f7A>Q)HOu(5JC_Yx{xqxo*EU z)oHqWCQ<49)T8?G3wd-)j0@#fuIi|UZg1|&x9+iN1+ItPC8TzbuzU7glNiRh`M)3X zbV2R6G|5p(_kfC zTdlhEX1#z5B8ZEDmFbq0n$I^NdG%U922rAyWt(6n^}Od%ua=g_&x3z?kCGo-!t5O$ z=VpvS?;FDHmtA_Rp2FVW^a};d8WTCZQ`Pdt!X3@<`*FpYpxNFT+E|vNGbG&iNM+r| zk;?BA(3v7OGI%Vpl;H(?p05C;c`U26sFi!A&1+A9?iB4-e%kGX3w`k~5MufkXL!ho}xq?Z^=>VBu4 z4J`SR=JV&rq{0(G{iCQUK+aKr#bZ7aW0jK zb|vgF7hgAGuPyjnt*8R??d?p(E+B`wrBvMFWa!Ljs4rPNM-kycLD;s|JliI;?xs-<{Wo5y|MLqcA7IlO!I(A4y_z97I(8MbwTU<+9*^W$poEz zCG>b=pZkO-tvWAV^t{5`uo4PUY-o6VM=2%+I$ zDf|4B3wY0`Kg6ZCHSk5(KKKJ8J;Op&Zf;}Z_ za9H8vj9(vM!^tg?{m3b+LOOW_{%r^_rYc+BGSe)jd8z+7u!3{Nu0hp{ec9)^5J}B2 zjMMaro&!2e@ri-IEFSIuWm!3RYs$ZDEr4t}EM)&AH-Hj1*1B4-H0->l`L7l__*KL2 z1I8}G3r>HqL?G*7cNWy)kwitBKxcKGs0xh6tIY{a)1>!f$wv6p7u1vTGs1OIA#DUGYijTuZ@QJ2!~r4|nGzZX zH`n~~+2?W@(tk=Ph%p^V`LwI>;-6!DPks5|)R2_g>D~S>zPGLl>roeM9^^TC{pUrn z8M+HUd$-VH20zyi#l1K`$-c{#u;l1Wb)+8DkfYY^*rv;V^u91hvvu(uJ}z|EKBkh( z&uPu*Uw0Kw=?rB|GiWSN%Ps94alGY>J>`_gWkFaS4<}(MZm|_Lk+o5Kp>w_{peCMR>_bp# zYECK!+Mr*PE-m{l5u;MIAs;L_ZGa&dzuwz>Hvv|nN$A+_TDUI#PWfWHOQn?v%_w2^ z;pn$Yi zoN)9GNWKIgudRLX1R~o|qqFqfexS{pI-wxsZrKT_+825~%AG4@0Z1FUoE^LKZ6_1Q zNv-LkgFgRu@Dwk@47L+iwz_0k`yGf+MOn^1b8%TL0a(Q97mSdUTlPrytKPO&>nfe$ z^ymS_>`P%8W48HsiV#S{)iNIO%a|=N@Vv_7HGP zg9ENt>yN!m~Gn|r>N zuZ^zK!C~FC?mXVDI>i8UHhHx=zriTOXzF;T_p2Z`*@pTB9FF)ga^?olI#{^2oWHhR5I1EB3;UH&1uzS;q8{5!|z5}xz zT96ktXV0n06FCAgO^x>OFm_wKv_@TC#LI*WQ){b={NQeQtNpe(1z(_x+Lgam+F~Pz zO9WNfAnXiZueB<2|00#^YhBK5o(#{m=VQZ0?{5*2~hnNkp zr7b1efV2H+llb*ny#+(u*c=JWx(?jgL@&k@&#^b6hn)}#El1bAPuT0^gaBUV1)D&j;wJ zZFuhQd89L2ICL-<^_y!SyFkD!7dVDhqqdS9K|%A|=iBU;{n07Hos*Z>4$75vVUoug zOh-=K%ZY3m5l@3kA4j8!r#QrN7Y<<;t2R_?iLMWoz$kQBH;4@7+fpm9;=><*~TfQcUQ*VsT|8=K^Koi;P%+PB-bWO&02pR7WnY zhx%?q3bH0?GA%}!*X+o!;>czcb&+(HDCY=nfN8tqXw%x-lj=dOaeyN4+4JTdKsEmq zF}@-)!^S1#*Q3Sk901#t?n*{^`S%6Xku2i9Fb3Ao1CJm=Dku`6LIi`IKw2bqWN}_b$xK+it@KbSY{5s&zbe{| z=#A6)v9PpNY-+=_s=DqD5rJ-u{9(J4Mq2C{0#>cXaBYPJ^2f7kx=(bLd2Rg8PLmN{ zn6LM&-7b&VhPk?5RqxzglwU?B(W90O%RPI}gj>t!a0Z)oxa~8$#``@tivPeNf*pv^ z3x>ptaDea;O9jegJUsYlM$Wyo_>cN?-`g(I#Y$hjJU#t-xtSFNT{(b9Jhea7z`Nj? z(Yuv1J3m8BbHr(1rZ?Y8AYhU)+F*$^Qpu{Ukj#z~$ct#Cnl!Sa#LU_|nU+r6upAt8 zt)sw~opeDj3Og)K)msM6FIKI+?Zdd9fe|o=pvKGopwx{??7X-la$=faWa;P4@27)% zR~v1E%kq{+hU_rSUSro)zgm;_>oPP4y=z1C-T&CP=hN4R?xv7J?u63snM$<|z#*!6 zJs5^?Tl@+xB+Yw(KUodX`0+a>WP1?DoGnqi% zffj$5CdPfn%s*>{8R|vMs&WiwG6@Ppf_HXlFK3HyJ_|&mXHde{DY&9yvc=%vYZn?? z%6quZz0N~>iO@KsM0)=kKt!|TMEC5G5Mau_D1ju>?ZCgU;Yd=tOCEJv`nmqf#&-4W z!ns_Wj2`%e;hGpUG;idq5J6i$_Yc}~Uho>M)+#I}f5D*h6UGQUtKZ$4H7^@{!})~? zR}_kek;sgS3`(+92KG zkyo{JMRa~*krqru*-Un8 z2rBr>x-u{{j}G(DJP7Dk$HZnS{b70HB0MxBK(a~eCdFd6r6IVgiK?nozVGs>!gn_q zS*0sH)s+@HD=&h`qlO$GE0T(?d}4AoXq8Rq3H8lhRU5Z#a6Dw~clVoHUD~8nVfeOx zDPFE5K7zY<&yyb?&>gRLZ{B;U`g&9qX?>K<;R$}Wyz}9o?M9Mb-$BfACwE}AdConx z!dQ$xdp&ZfTNzos-Xl6hTx3mI;@se%<}$(_vIIBnX0~K-U(T7CMjw@~cCu`t8}iXApf?P)(M0vKT9%Exne%7&OzEoH zURIF*U2WaKb9h95)>N!Ri`w(~+B^oFUX6mKUFw#oU3pLkiHAkT8pZ-3KDb0F?`{o> zl5xl1X`Je(;}8qQ?ci+LE#J{SN|dWrfR3%tW{KK*CY*oFb!ybSrgCTeq7Wixbj=^Z(4OV#zn?Jh4VGNEN+*WoB^Hp7@D zW(rG!&BI?-I075vFI~ZD=bq)?%+|+Dao%nF(VcLmUMi^0*tZAJM_4HD2{j(;+JYnp zDf3sj$(o~8A(>u(v-SHgYwPQYjkRHw;(JZx{=xB{J>pO5$!P(J#?f#!?gM)x%7|R> z2E!jhY;UkUoOY2`k)}>cjZW>_E^E22w@lJFRyG+_n4XvcD;deIf-1?rw;q~bt*uc4 zvX39lq=$^lo5kW~M4pg@JX?YE40bf9_z}y;wJU)U=j;-@wr40lJ#LV0XV@(_>y`OM zyP&=n455_~Kk{xfQ|t%w=)XYS{S0uXK_D;fD^sji_J$p0<08%I0T04)>R0Kj;oF7z zNb82>`LKlBJqd>R6|a*|AyT!hD%ZMiX?FdGVnPF#8%D~r1Ruiul#t1nhzg+|S7lY< zQT#{p05w$)ApfhIr<4Bfd?PBDE4Qs1-sI8S$+A%mbs%N{_TE+_u-xigXU6ERugN2k zJg3B5BmP!p_bNdLohlj_%sX*zw4pXUa98H~{U{u&H_l9WuX3WBhZ_Wl{-0uTwf-K{ z5t*GagsqIoxV?I_yEN*oYj&&cvl;U$OEkR&Ie%eYxctFojh&vh$;O-0qtDmJC`Z4_ zhI_f7e8(SyHyF)fO(Q#3+os;{6sQ(3S{$@c!~T|V9{f&6$1V8jTc)j;Reu+Xy6kXu zEXre7^|F+T9LcuV#oD5R6SM?trz!CM9TRw&OU@irmbqiFCI>4m{0bd#1#c=h`Y^eN zr+l=XSyLmFDCfj-cjiLTGz$}?ywk0((>{{k%~WGD8n&(<7|cFS!#6mgKj->u@18Jz zKTjT(k5MQkuG>}b%F~?q{J0oE;{NwJqwdx)0kZBG!4}H97)wB7#JPFSaR|QXs3^4t z=6_1r3VkEHqDl4HUyH(U8t#Md1W5NI|2 zu{>?`qs^J}(O`!#n{)L>> zym}(A5l<;VP5^JTm>ZC!Ru`ZFT-iQYlhBznzjA(YuYE016YM zYnl8M{Kp4w4IVkT|0JPQ@ZG)Cdjbz*#&_s{w_l%COiVD z>m5i0%7iiJ`+;9Bt?B+ourd??2kg4r0lR#+GYcprNv`$R56R!7dWn}AVlioBzgxWk z4B#mg9G#dxnIjN1PI%fx&UPmpo;$U)7XX0~cphm`4j_0z4|tIQ+SLx{f$24O5}poH zt$>QFEXL{%4R=N^f9}G@nSAg~@(e5BbzVTE)vl zWD5!OjZDz~M+!)#CfFB@bm?EPv~K)KvV~42C<}A(Ddf~k;DgKwvc=vLb-w4lN>);+ zT?JX#%N{m^PXvC`d@_!T34M{EmNv%7B&dw{Oy-S*Hbj$!-$ z*zi$(3A5-rPfi1{?E#?DYXYb-1iuB2F8S+!0w=VcmJM4V_*yY-O=Lw9e0Q*8x?s^#WlgvJ?_Y&6kU4=xndIVsRta0tn|EQF9x3_(7 zp&x0@2-CpL&)tI4}?ngE*ls&5Ta_3wUV@*$E(*8q`UdG(){48|t z91BwN5H|io1O3*C_)9e^A|it8uHoP7Ai3hu`=7D}*IlTMz%TiLFoo4z&x;2ay*mnf z8M4?pIsYH(a{nGLA0MOqku>ATks~W^z5n^XM?$1S5;(K|SIR6$w4Td( z@|!YX$RUUJR*IU>^XET%QRn{C7r}&=mshRok@V4^nE@4H<#JrUZg?p1!&m4NS9tWl zXfoMjnCx*45Ly8ipEY%||ZXOh(LZq zs6R?J|B>L;$3KJm8?BxGgYlqV*BLPFawwLzV)K7al! zt4&*O1;%txxli+=uwNtEo96K^+x?W?r_Gq6%>k7;7oLcoS6KoAr8TGX>Sk}2JZu_- z-2qTCIk{oj?jO;;x6rv4ocn&DzaNXMJZRSUdw>1Fs$2hqmkfxZeL#A{jg0v__=oTeVwP?cN5Gycb*<7Sm|Q? zEgV7Rz2!IpUSVpbU$trcIST1TP=1fKZU{Oac^cCAURGWU(Bo(M{Pa9{?*8kZLI#m0 z*b{&KNp*$_*n7f0=Dw9*vbNWI7I1~t2JTbMoezHk$gX%2cUxYZ^vaG%-GA%ysP`62 z#8^!!#=W{X-WA)SLJ0DCcfZ#h;)7Dhv#>LK+pP#(% ze0a2b<$(KY)8TnX7Z(@F;3LPS;Vy~-*tifrj$7Nw18I-S#%Y09AJ~0(Sbo)Q$p@_C z`U3xX8CAJb?An}kt}%Ka-^us2OTJsINS}wuv9a`7fBnDDTzsTv`=sCQio6V{+UCr) zC;;8YKpz7#UCeOXzqKOnpQXzNbGH6ago6MqU{C!S+M{|(Ea{#U)wVDNm8}jd-K;FN24Rh@6NjO(8zK`$3d!^-j!JJzfR)Dss zFJd@i6XZgM*#&Oy5av;pVb(q`TRW4a{46gbWhAVggRe> z*sYQgU}@LYXh**fIl|8mb=$3;mV?xq@I6y-+CTw$|N0k9t)n9!?U3Fj9KvAt%bhQn zw>&tRKT{Nt1t7!!Pm3&D(U#9if-M1^@^3I+|M3r8*FIrxDITO93Q7Y$d@J%+M*`%> zLlVqewCBGCJo^AmdO|d)2=MZAz*@|F4cm`30zUjWPEM1R(fU6vGI%Ew6r1Bg zERKLh4!j2N_+Ox>V;Un3z;nR#&3`iqJ?Uv8|C@RG|5cF>&Ul1xj)xWT-$nQTgd2qS zC=kHAObr@q5!xoJw`ED`iw`xzK>#O5rO3W|@baIzyL9QFnfkVueCn-XCTIY0>9QI! zT^4MqyIMaiq)Elll;&a0TNCb+ur?;%OKJXOIRe>g9UP>(xfa-_Wu9=i81S_yTzwBR z`Y%^{@mI1%?+q81)rPTZ?vK-LU0+|Tc+Fq}W*+E^q)Fu)JcY!U|6b>AEPpnAYz)}~ zq$&CFEmqmlWtXM^o7ruu1m8h#E*v`zj=C#>Lje0)D4flvhI`q79&KQ|Qj69icNgG^ z7+6VAw=v`)cdU(*7E|{HV{IP^J0%QwE+F~U#tXb7h|gkWmHwA@RRKGEhri>r{}?{i zw4OuD=c1M@iSjJ2TeQu*y-iy@&m=#S- zvi|8A&^Kys4%}Auw#)>%R|3N+!jVVYxNPcnEM&7VYUC_;IG>Xmb8x9Yoa3wI1t?qk zQ_bpooFLNx2ab>ZDWJ7-KYUi0lqn@&v%8k1nqUohwaI$qab0G&888VG4k@NW8pzO9 zcV)+uRZ5v4H_KJjq$gccXFaCf^N;+WuO(H|xyzuop1ITqnIP}`0W>$j@B+01wQiQk zJND4(XA`tgv`q#P7)B2u{92+b!XE)iH->w{{Krld$cfK43f& zlbG!pMr(#*dzC2nSLaw8Etl%x(<`w{N0SY3GF;eV8ZyMO=u}gDpin%lq-vmHbWxYf z>^HGO51kgWj&Ac%A&*v}XRO=4M+qzSc0SzRDVb~rgeViMUC4zMvp^NQxX-#KgF|%s z4V$WuCTeeuZ}-}9HgK(vo&IrbJnTz4qe?UiRyE^1{nDRT4#)xyn4*Sw}-0%C0Va={$9K7p$yb4T_*rhg{sK+ zCs<|u^7QMzoKX)x#Qhnq>v6^pa8?yTCC~|AWb1Umq^gkVIP1hAfVVwUM%JH&NiLkB zEh}Q#t6pL#Zr*yF^9YIA7l`6^VE$S^AA`dTiTSc`qIUO%VNqH(MBi_Eo1jG93-a1G z&6#%0uvMCO+~p!)ogn4+uCISJ$vH)~FrR3Kj3Ja?F7af`f?Q)|(4l%qp^nGMr#7Vx zFOf&Vq;0}BfccC7cQe&~Bxa+>B%wl>wk*c|rQcrf?R{ZMvvob*%fTmigf-+jtk$2w z^~eYmd6aS$`Teh9Nbsua4@D#T#LC8Ky$_@C!jki9fLf>FBQL;^sJNp=tgSr>hm{Pi zU+}1u2WGqY^fl}Yz~QWn15li{!Ao1fwsWV7;bhJnP42?UjF!SZC)Ph7r@8%Vw;u!> zL6_IE=P2=jL<>klKhEE!ElZifj5<$yW`gim0@HN=Nhlhhs^w~oidUa$JkG!@Xv>>J zB>vfQ4v?i<`H=62R~?7T%)CqlpVBn5K$`DgEB|eQ3>rl&m*s~AvDMp{rtBZF^3~y+ zMPktUGye0|`Bk_5HwX36_M6OgaWSTWy$69wTNu-&mz#~|vsaway%rZ}M#wUN`p|^u zx0U|ZroznF&bm8RMf(xefpvH__XSD-k#&(tPIH$+^v{zxTWH~YbFd#iTNNQfdkex` zG}rGQsUur3)hdjiDI)7(KYAfswOgc#oS_b__z}#|&B#QpwZ@M9fAfg_x-rXb16G8x z_pUhZ5 zUnI-71ItGu1~5#=9G>FRTb%7f5MC53UYV z)(xu5PRimC-=~1!Z24pwonk>Rer0H|Q4v;BLm-8G0!z~yMvVQunxpb$ygGC-Al zT5Rw+1HY^^udq_`HjD4ZwT{1|c|L9up#L}`7k}6^sA$ZsP0ykWvjKLBw1zFw8gx?J znr3cuR=vV|zs_~5BOke;h-aYqd2aQ5gOd!va-LAGHiTb7y0ZfsMjoDz!cTVxG0B?e>AYL~2IpTgJ%lVQmITpxY=KF{-dp6_%1 z_&tC8e!st(m+sH!zVGY4?(4pm_xt|f1|g1%t)0oZ$5C6Fh^)ObR>Gr{?W!;w6~;42 zf3(ojae48xqr1`=b>^dUZ%K~QZfmR%$3CxtKMqcd-y-oDl`E^f3UbL9sv$fSZz!_( zDYzR)$uHbtkhSzPrpV*{?^;`gfaPf}l_|p|j`Ik<#KW(}7 zwauc{ZalFphb*PR_M)cHAH@86Ud3sj68#D7XvHd_=c$a?)lUO#Tz0DoJPaz%8@m+Bx^VNvo|3o!du$-INGOsz@ z{`s?YnJy}=+jFtScy!9#F@=|-^t_&9P7FQZLNsi?Ia#4-d{Q#E!um$2q*u3NmM%V- zB3J%VaU{{J*QA0SrHj9R@^@h9iKFBltQSy#zQvKc!U|jf6U7l%`NA`~%kQzl%IbV2 zPI+a~+R4HGQ-T%YhV6=Ik>^9BSd(M@2Or0o_y!FWKP^-c)6*P|>iD8=J^uceIf^U0 zR=ik*T3AFYqM;LVMnbHXua)ryTqr!j84C?kA3g(im{lSqaGrE#ZgZ7LjiGv^Wwq&r zGF=L0kK|wg6>(%s(eS*%=U?! z7|^EIR}K!R5p`OU9;4|7z&IA>m9rAvbnKIirveSF&MngXy@hEpN%GYu;KZlYMvcik zOXjPL@oouim9?5wVz=dkHIbFpV>x4B)ux889C^7uq#T&k1 z24$Wo2aFfr>#>g1K~$Gpl;FtJ%;>`tWPi`}{1X=S#+cfjg3E=Fk$4cat5I_4?Im%& zmv~rsG*|Du5i5vY^}0=AOJ6#qh1zRQov96_Absps^985 z%|0~lxSdX>|G}1*m#)y4hZThKr-}|yJ91MDt7!$(c7~(`)OgB17I7e51h1p|mr1JgseM za#|u@E5|F%?6X{L*gys3YOu`qg1}BUIa2NCG9jI$P49k4`uBJfiAlG?Xh6I==*EDY z%l4F6$lY@(jSNh&TBtLTA1GW3$N5{`c(Z9zX~MHn0@onfXJ7uiYfqd*T8@R}X3Sfx zI`evl&CFk#sRCEaaU+z#9qs+{B%@%FKp*9*P@yZfXkp*c)XC#L zL1O6;Sr!Ykh_93^?xH$!E_#5Vr?zVCpX{=sJ%*x)9H4=w{{)@q_s1>^Lz5n*)xI>f!TFSgI+p?Nl{o5Pe zWSKS0RkS4O@rBUiYu!E#vemZN>JCGA{})ug$4(t4HiX?bfc<(xqwBb2B`1q+z8tK+ zeARCO33uRF^E>`gvwYFWsZ8Zyo7^-_{@S?fW3n+x_1*VrzUZnWVE_9 zPQKoIu>KFQw(kb(CO(KCT)UlabH0S}%D&O*Xe%)&sUclurp|kgGu(b@yKn^kNoDI6 ztE3;I%wSiiwqozBK70mDezNop4kOfgu)fl3L4ACQRY~x}XZNI$;CWW}vmQs%p$hpf z+eomJ1Z$eCfiO6slZ*Xx29?`U#WK-su&$%85>e73%icdW?NL9NKu&LBL&q?)@Sg?O-3 z{Wy@^ZO%Vsu23LZL^j)(iBg4n3vHSibWD3_I7-LiuGhSO_dqKy4woY0IUr?{sp{@x zX^i#4V>nYyvEy4#Sf(^igSIYMZo@FUUj8=RE(l1zZ`t6vu@h>Uud^w6h;OQ#bmJ{g z++yBrRsDx0DN&E3+?JVxl+Hc5Fny6bo(BzLdWkj%zdMzkpfsIsL<^CsM4m{{ja@v_ z=t>${O_C5lsP8|peUh=t(JkB2qhbZqB{T}&H=aN9DP^Y_eVNhNX$m=y4`&*Nqv|T- zXaZPl=WRs5=UW`N{1H{r={Ak?ZB6K|Utz}4byQ*P#v?9DJIaXMpYTb|l}vJZ+t`*7 zBp80hdt~~u%(@|5(RjT)W;_y4>Hqz&F=N6ElQNc>fEI?|;YZC2uUl5C{SukC0lH&t zEQgc>c@5VnS#ITS`O4G(3@ThEKz3~0)pA>s|6tDbn!BKJe(aKh{qBz7KbxP0?3akA z4H^WWr+O+oAs`bZ#eJDp{21msDL*tIN0UlX@^rF^``B&co(J}luya2Ewad17JWIe; zD834N{L9R*RSAf;J0#*cxmJB)iV)NY*3Uqu!TUd*sqr)eTaj7He4dqGj9jy7uY(pO#QcdD<>*DW6l`^^u>d@-!LQ1C3g+Si+G;V6GG04sh7kC*&rxUTX zqml^mP-Rus!_!=Wmc_d; zhpsQ8!%Zo4X-QwTrOs-#j<8qYG7MsZtWY<0l>cH0;poyT8TI2dp~#Yq0u|z7$|vmJ znV&J8sU!GhpK$A<Vu!}izj!Y1HZoEE{<7a$5b{9E!3SL&Dz(rPeUf^ zP^3af%ap#-&J?0PsWrDJAAcNip`35yGh}*uJw|<1o0qCoW^@=7!SPOgvb5yn@W8m@ z*VkM<_4;ZHLvwDEzQ8pFOS^jU4Hw_3eUS=}TbS`Z(RzLR#0`FDuyqMFZEspJ{iZrw)x2kBmPgbpZJb{gK-P+{SOiLLtR)KiGQnfAL$D($+)GrPa zvPUC}DF)l?wtnh&eB@1-h=Y_~xaRUFxiVs|#N26wB zB3w&NRsO9dFI%G*FDZv)7Sf~x zVF}O&AnfeccvCv7U(p>2 z2{MpL&?S-{qfMksKHv803(S4_XnFUgN%?5)axA)5@f0Hd+>PfiC+{_<`b@i^!du3t zye~o&P_WA1<@*F^2jnMfEmZ>AfyMf-IH%QW1>ru3!MjKjE-A(Vol}@wSjNq(O%NYa zX&PFf%(h(YnfEd7#-JR6y;Qj%K+|DKc`ADjh@yfcy{@*+^bGIhnQ7Y(ky2sUO!b>e!A1_6~Pn|?1!vUMqi8Xdj(IYVZ< zI{R_>P=BlGSja|)7h`hTD3Q)xQGC2#C`XuQ6Ctkl?V}nDi3T{}PwD0F{^QeGb{QL; zz-o;@^7Z3s?0HC+DZUwAs_{}a99X?JqY=SDr-Rof{7B#6+Kdj%N>6#R@SaD5I4W6u za_1;p$&$|T^_nUjktdd6mn&_;6vgH#J0s9bMNvd@bi2Q>u|}w@casuSk*dgmfwOlD za6It9P+zr3_G3ZMPE1CR<9Vh^AEfCgS1z~xbdDfmkvR9u3&tTxG`r+!gHvU2Bma3K ztSM3=5jHMH|2%uuFk|Ir-^omg$pM?0E2|#!i4?XDDeV?xMZTZYq!y<5Lhl-4~Evmhii(Yqx9g#eESwgS)cBP!;KNOGBQ7cP5Y zB!zw~!lG1OoWHRqOPxLSMK>dp96#u5xd0Kn6&+HQ_b?YB&gMR!G+p(<1^k831Obdx zMo*f8OZT<5D5hs?SotIIa%o^!7P)g4N3oZcj?Cig0nK#EG!p5#^mZf!R*NllyGNZS zMteNU^+vtTst4NMbXl&>B+pZ;-o@n)&h_9#Ixw1LPe(cDNRqw*&BTLV76Q)boMp3! zq7NArf6`9%?VLQHCl>FFA=0!7C3p4MsJNV$SvF;(T+to*YvWuoCzcC^9)@Y!Fb{)+ z84otiCnV!y%xYD@?S?9PC9@3`GB;IM})z9q0ulixy1B)~(g zI!5if6XDtlp@!<#vy5b=Bb_t&6Z$GX9Njj&NpX>ts8K<$kQVig)5gcwA#IX62~JaF&WAo zSLu(z+O`mDh;|Oivy?xMCniFQ*GjHR|0Q0pJL8kWTMf@KPD|L@wurEtw0U{?%1Do8 zeL_R*BM)!u{r$sw5a0=Jq4eF$o_LqF(PucbDcKfRpe2h-{ooA39(L8 zzB}WzXSidWVdI39?!;yt5u?6U9#5IW|Ec^~$<-(LR!J+qV;Q$jLpaP%etRv3;i%nK?K+)+!XFtLNvv10RMNYUwzSqWjQ7A$qfek+tr^|CWEtigm3n}(IgIG4*#q0U}H;hpAV^hY%Dn#SDDkjZ(%LTkFYHQxc=J@Va&c%)H|K(Vq?>#F3VN6>e?iw)8vY!%(J9ec1XC;)tGz#37UVl zEN^>dX8|@{rlXhhv`6}#XwL7f@n&OyAbB#zKMGmpHvDg5LVR1P5usi5+P{M}AUFXw z#wZ4+{uk&p4-eG%FTh~{wf}3dS!N}WKWM@Ag?SVI$4vnLF6CBW7YY1vjqUsYB_c@B zd*=BQYqaP|(eH)*b@b4GM(G@Ako-F!&;Mgho+ZnXKDz!byathtK~95Lf!(oR|0R}U z9mn*yhzq3E{}tGx|Aj$}Uz*X8wPs<$-!6~k`xze0hP0=!QllJ)xC z96$tqzP#F+@7eZJfyQyXe-EnWX?z2E5k%Dd#s%X&kZ0a>r7&t^X{ z2cW3oq~aEa4G=h*imLfi3v`H*Bm`tg+DE??IqAJI^|*HOm45rCp5#=6oZf+k+`GI2 z74M^Cb}Rb$bqC$Qtj%%$fjq3=Sj`VJxQjp*5$NiWPoDq+eAI)Fz!H*GgE#nYDn*mT z=|Q{`^BYsQjW)xlv-G>&$yqR08qC$&+S(a~wx11swdXbWM(&-^;fT!1$D-9PIjZoQ z%jz#iq@M^VqSbl#oV`<*S=s3>@RXKKV>*)n6?(apu;%?)`SvH|Py7>t!qPc$#JD_3L^3Kg>#mR3<1CPKbk1Arr zH-{b4$at)>XA;w7c@wR9>F9!9!xMT>0AuN;y2f^?$^EAJ##x>2uN(k|9qc#UZbqg~ z^tx6H>p$XJ8aeot>)~adbpBZP2|O?3JkEHivyp#~{v93B(zt~Z1jJhNQn_RA#XCCb ztZsL>aN+4mBZ(*s0H39W_v}$&Lu#q3A-ekJawqmM<>1!?4?o#DEdX@W`_+i#`&(i7 zV<%5~IB8tuF_P{hS?-=#_;q?;_19mT30&1XQ)?qW61~#G8tt3l@|963NQ6`df)Hw< zP#%ncpRb|#Y|c4HzhFHeZtuJI$@aqDqudPofgyTUm-`jf%aR*kF1l>_f%|@)emTczvt@2B0H0I3-G2P} zruJ3P34K!8=d)@50Y+~9w)ERTNQzaC(G)zV}6jQetKUEnEnpHoU1GJZ7P20nvrQ|QOf^9 z=VX1=k8zy#hZTCJLf7KR3K{yE2kgbotDVCDbUk1GpD_7>YhT{p)klE-LB48i)(8fF zvW6R<;r|yS`4*=J#v@RupC0-TvffWOeM*+u^(kgKMrv1Mw5nnF^+Q2N55?1bdZd( z>2_xr5Q#kfV-m+!VF9cMLBb~Rwk|-YYt>Zpp350ME!4B!!hiZhQ-rW<{KU;A)Np&K zeARyC8{E!5o}YkFp?S@V4;VKHA11|F4_GPcKy4X%#o7(5p7(2Q2#ch+i^``&LwLTOt-12g5mdOfr#;fLDR*a8mWtB3;UMh{p+w&lT_fTfa_ z|E`6S-9@#~UXDRXOkPzFG(1lnd-^`03fL?!;YE4#C8yEEp=FhBV93aB0N#zQP#CaR z4o*SU^o+2>FOOp*fd>-N#34ZI_lehlq%$qD&PMWp5d^#e#>|D@F6fr;1wv6OL-<1@ zzQqVlNFDe?5#Nf}$x#(Rjw(jzO73T1zBq)~G%SF$b)dI*T+KOIJ2dPF;#CHE&NlBj zXxxQ#XYo5oBypw(p#gbkG)hKHH08rQ-gLYs_UT|V9oFeV4ls_cdyYtZO9vR9Cl-q^ zs4l^nGNH$UrWlZ$Cm^t>h-6E8fLVp8fWUfJ!N^Nk4OFXck-8oUd!31J8YD7>4xQT5u4Q~|DIE6|b7z>lpAmyzqylSe)e(qQtk)q3 zZ3#s4nDsvM7InW=k*BG-O<)gozvrw*VX8g|kpO=hdWF1dEQTDGtP%JqSyXv)u;2h+ zfnD^52FZk&+zk!68k=QFFv?@3Uw6Hy769|-HB}-VA@n@BgTHcm(+m)Lvd%y=3E&j< zLQ!ys;T!5WfXg;nqwUxaPVOCU$0+4K&F9{@#TpHYwAK0_WF_SB*9cpvzWt}+E22A~ zKNEniZ~Xoqgwtg3hFUi?3{<9ApfVMICksWIq{Br3dvi{b)+z-8tY&|2HcI$G5&f~R v{{Q>;>)Pa-TgbF-?jtaHmnj>)ZbZVU0Wj literal 0 HcmV?d00001 diff --git a/docs/source/images/sky-existing-infra-workflow-light.png b/docs/source/images/sky-existing-infra-workflow-light.png new file mode 100644 index 0000000000000000000000000000000000000000..b2cf42b48787d47b738ea5aea41ad8f94b721afa GIT binary patch literal 43221 zcmd43XH=8R7cZ>GqZ~oxhynrv$AU;#Y0|B9M0yD%6bZeT03l#Qlxm?!=v``n&VnD5Q>0=76KtasL6da{=dIp-n(u-uv}C2?Ai4;Dl6y+tDw5Un7$&xJt`;b<3D1@Kb<|y-k}{a#B43RW=&xI_~he% zG(25BQ?EdonvTbMA|2jiGUB`UkA4k>RcRtg7;5_3D`55i|9|7%GP2S`7_eV0Q3s2Y zt#hH51Y4s$cUre8R#==IWGJmfQD-n0ZzZG4Wade%?eRf(if$p!mHx58fLcMkfjx^X zlr@NkSe0>fs@38Mf&5N$tAR8(dZ}eu@N=jLw%{-3Tv- zx$&_u_waaUdRc>&Umg^0r@_q}%lS9DbSif@op_PLcC{{uFJ48IKH8-1hlDy|uQKU9 z;l=5cEVxdUSYD>Pf2>5>4ZD(7#zz??1*J=zIugI>JmGcI#d2a9uftulnmmUecakF$ zE#!}p&Qo}TSbmRaJ7tGX3FHa??%yPqC0R`7)>{2H_4U@19dq&Ok_%N8%FI!+bq&Pf zS%_kP`H@9#jnxm`ovn9g&LGSFX0ki@Eh~J;(znWxv8owfF2s|^nL=tyaOfO%$0p872a%K12QO~dw-n>`Z$6Af;MSc80cOwm7&r2{EKLjZ zGo`P=-0uT1<&UguAG0N|mG9<5(>vHQmM*|CjKUOYi65tbPwP)2RX4=DX1ob%kU6`N zN4%sT$t0Im_IDv0U2`_SPp=1o9Az!!V3MnjwL~f+*LuGx+w{k(()nen9pPZ+li~YeqcQ-+^5%vl7*i{1AhAJcQvtW)jx*l^P1S96Ff#i*)LS7*^(w3eS@9$YSq^ zPa%~H^SQ{=Oxd@+sEYUk8$P%*o6lY}Z-C@sWve^(n~vT!Wk>ID3eQy-8^2Mss(R!< zg{PT}>V5Q(E7Zl5UZkXz;qNdPy>90&Q0R6AJoF~0r7T69+4$k#+OR?)_UtCkAP zHAF7;58GXnP~-!ortq)ff`bwx?iZO8&ckp+62qnMgADJ@RxT__B0c`_KG(nc> zf4~MT-<#=>3Sr@mcF4zly_ES$V^vmyH0|r9Alf?c zolrFf;zK&HMb3Ct1Fs=>NoL1+SgH%RtKi!|BDsy_{$njZ3cBb>Xjw-zA14kXCrZgw zU{26`f=8lsF&{5dLXBnqFb8Po`!m2Fn0?(@@HY~vl0_@ zKqr8eD>!eDxL;z9|6qa}x=!iB*n^z#ou2Xf`5Z*uGZDDP@%+wwr##|CZm!`aS9%H8BOi0s#9k0wPY;LcQ(tuG<^+Cwk93nojk-(I=a|g3$<{<%Ed&C;#;l{EvSS%E zzaTI8C9}^~qXMssmO~9%7wpD&MCX3}hCoWY#to3J&|Tu0Lp}gzNcF^mI@A-rt8`2E zJJlboKL*5{|A#^s+IZt8@~mu$T==kxB*oW7#U8To@K^l52I*JPwIKWn@25c=y-<7n z(hDcIma^TZn8V?d9^o*T7I8Y#Ow%<)7I#kPPXvl-W6cwgdq9cKAC$`+ocP#qkNSov z`a0Q2yV8fljb5-@MD*2vfHpxt8Z5$_FxhxwSkXC9H%KbOL#QE?(Y#`3wVVbE>Tzsx zCq$u_?j{tj;nxD_HV}ITnfzKEy@J1M#|*tseC!K(td2UG|CLsV*E~2r;xQ?)k3t-T zV*5VNSp@w_LCjVDKm4^FN9N=2ql=(2SmNBj1k~zuOB_O#Iq|=Com?lE!0(UcSe(Iy z!DwPSlEbBMC@q(_-=zD$+DyOB={1+x@bo(8xPFO|pK?vEwcI+)kn)7L6RPM@sO<(W z3vPl&5CQby>N#`(_KPHDB3WePT{4DB2TO&X{{}8@fmI|3LhjVENL&BOQO7>w2J$^&P*Y9rIJ6MzhCH zQ$}~I21U$YwR#`Q{7K7mT~H3ZM2>=UOLp+1*$DR6-5G^$oq#7eQ4DIB47Hsfw|+m6 zCQcW9=6UCyMqC+7t*>vigxNvvbI? zbC^v|h$r06r|$8^x#UQ>%LK#VCmXOXKABVTQXmmPjcIDg>6<(R=PDrLv~Wz4BnPqjmZ6ZEX(j93kEaheEYYyZ@37_(7{MtTuo+? zT^s$zw6!NLeGk>)4a9g(X>F~2A`WTpR?GRDzb7s|6K^P87 zrl~QpL4jzK>*UGV%$LNZ%Xx>?EaEwe!Tz5NVJ1KLQ&vF-bMB(Ci25i-3*Ep}lHO%y z*B+$E74WM!MejxA7F>x;T5Ym+=yGls`W4E}7kbSH4-OvC>X_nx;$mA%)U$iLy~GI@ zV3AX18g<(_Ct$p*=ewYjcMCk-w}+3mk29fyZ4}`S^&-}oQ~LIoq1fgOCznPhPqtCx z_uq3fSOYwF?-YcE3DZlJ;OR`s6{MBPdG)JEQD&d(9geMpyn`zL5$UtQ(NF+cvZSc( zMiC>Js=3}dc6U{JXhhBX-Qj@R zKZ$hu-&V%0s3(4jn@LN=rIv}Fp2%0n5@V48|LZ7p&iDgHJzj0NLc751t=SY>j*8n@wU zRpkC|Wp_QJGviYox6yM5C8QV@T!c=Rvf3P3c#Otl1KO^V|LA<2cYgR>GB%^hrp0Oa zX{Sj*VsbJ4ArP-lO4}fMbU6;hGZnYm+~oBRSnGof+axMmGFP6BK`qOwJ zkjJRxp;%rqgl2S%7Nk0?*=(|-VU;p>^ZnYs@2W+qrwea>o-I z%{?Gm>@8IviZdWuri|+zS^Mm#8|IqJ%!=mpXbc5!EVn6U49 z=Ob`Uwy>41XL=+(adyvJNvxBjey-qsx$ngVHKJ0(P_yLy&HcIjlQ>feoB4srNy6^{ zbCr93`0tJFUGKF8rpyr*dMbNY2j&LQ!n2Lv@u+~ar^Kf^uc_*Rnhlo6`1zV-Uh(wSz%U%p;~6i1bjiDSnD`7o zA?#KQan@Q2rg!x{UO5%(DLlBDa7X5BZ^?;5cC}R%p5(^52cHK|T%OiU_?2%y^eD;s zMLMi}(K7|6yUuP@CY1R8PP0G3{pLYlHNWQHm=Jt zHJe||!EHR;jfqXQNlf@TE9O{uDr^zeKzjNd5Y-!Z`6f3QjK)nqOnV5m4>pZg+YVR8 zb>@@L4~<$%mFNC)+P238L#hPreeODd!It|{ zg|Mn%*Z!y*FwmQT-ES+cZ+f|=(sc4eXcK-okH>tOH=VDcT(6eOuc}w$9yRVo7wSBauUthZOJD1(HRm3~^}JNtZ{}&LLkA>I z_{@3e^_1vLL$^cxO`|gT!UV#c7^`AD>XY9gD%Mgxp5&>=S8NXRlULOGgvKEaJ1aAX z;BnLKxC{{Qo}kD0!})f$&4AGGLPWUxPM9bH7j+de32~%NH|wLvvoYg#o$VeIYxyCJ z6M@&H)6ViYsR${?A|laMHste%F6wUkQFUcu(-|M*smQn}TKH$&yk=7p9KXJL^hjG* zHs)@Lnt07k{oiEm`r$H@&OWjonxUm-pwbw8Qyk}L|RD};LQm_DzsNz zJlmge`EBL=tb>`#aR;ToYbpln3={wGsrb+P+BtJ!^AnMUFqO~n7G+ONfal}j;TlW@ zN=-yTao?ND(`1ziCtMLoJiZSIU*Rh0avHp}!`EwNSp_~Vu+@UNIJ50JvP?hi?KK@! zdkt0k##hL=!T+6d-xY8`m3631(6=<58*_2IQfJq^)zw4c@Y8By=8`ERh5W~kTsORi z>KB8btwj}dPhMvQoMaqK&7JY!R-PQ;mJs4twFVz*&eA-TBs?KWi*r@+PI*Qy!!o+f zCEHY^W4+N!%dz0%Jue`0)dHm6XKUJPG~&hBNbt>icCoLG2FOJ@3;G zr}!rjFR+$i>yL9}gL)79iBXBtT#RjP+-@{H^+a%0iIK|CBu)#e6D(&k-k9lUEERSY zBLsvHr3CX2dVmyI{av6Y%ECPj>E?~y3>dxk6G^NnJZies-qrSz7aY<~e`%yw79^3W zxG95ow@X8nDMtHpgDWPaR?5}$?$y)U_6_|C_AlWZ;njh+?pVw!YWNy|M(1;#5P${)63BFca zW*wU2xBCp!BrQ|;Zt_m|1Xq)O{?1}Qy{D7Ko1X=WvExXN1D@Hpo5l$fn0a~&a5H#@()Dk_oq-t8!408&-WgqyX% zyu+ncy=wuyGnATzH_d2O8{yb;ldQwr>~CkZIyB==w@#s&^7lVNO1TCKCo_64)Qcte z3cSNYVXI{FRQses(_(2*X6%J4$Oi;W3I3+gWQdthjM6yh`ah_3`cU@ zZi8Pc+`~f(mR2^JN1&e-B$$fQy95*rl{7(MQ=Mr)Pp2XhrdU>Xa755avl9*}#G0|i zi}IXBRTtwvX-*tk9<9#~!dz==Xp!<-@`r{jeLyKSz00miv6pGrft(rvm6_k0$YXVkvXT&v%({cZZ@4D-w$Lf74O*4q2N1lwt!h#*WvZhOR6N#cllU>}0 zEn3F3&7+acOYr2K{mPPRyqX|SxX(hJ_T2MD#;|qa?$mmBU{C5{xOaFTA3apYeHpXb zyTKXTJfr-OdJKDZc5O`?%MqzQy(qm^l2?nGl&dX>6!@Zki5}Ks%FkDchTf)%6scy$ zyM*QJ3Q^deUEwN1G@IX7=p0L}V2IQ#c&zl-(r0j>y}Ow(yhGhW+rvck!j&LwOKvcB z{exLgNeuAZ6t24}A3i$fe}K3<$5%PJmI|EKJ%nh{4Ap@Rmopc!3FeVntHEqWQ86EZ zhoi~a)`IQv6Z@gJ4+!|wwJ$LzDX4?fy=TbgMKAl0l7f}zG2O`Cgk*{5baKG?`v z)cwu(+~Q1aLBEW55gq0dMptZtQxEW_x=GfFCa>bJC0iM;8y!;+2`yt6-OM69uAA=6SO!%D5_d7Rl+PyIlO1m)xS1o zEEI03uRe0JCJCBZX&e7A{04=6>Ek(4v)<4HJa!UAFcih(=mpph8^zYuG#kSe$0X0D zDc=1At~T-`z|}Am36w8V`AS#?Uz~smNObsBB6pqkLG8?e*}-7tyB8ckD_YuP$iJqb<_c%QKUjK8|AXC_`}ZnVvQKJT!} z+eVa9xi#RPJ?YuQiMX&$Ng!fN+(BBV5Vh);CL!xQQ~jjkr#_iB`H;0chikN`rKcRf z&>!ACAaa~kycJ9TuRq2TjB68ZWHe2p_?D6MP%BH%J3CrE=BQ?QLh5=Y7k8r8KoM zL#epUmW%BO_rW?$6lx3la#S@su8cYkxHmDzJh z_5oZO~F!;bPusg(Yx;)U9oUBiD% z|FI$`=0ZX(eQJq|^>i)9G+O1*wvuOYvgn0zsA9wMoP2OFSBtb+AoyaKp`+IZip>LU z(U9EH0=?dqag0!?>ScBvua{>lpD9d1jbA=rvQccVAtmy;vT`XK6bA`*WLbs0-wEUE z?CHARQzDvOrR@?HfU?K^05uQ&NXD3KJ;WYEYBq1y_mziOwN7QqW_~F=0Zzt-Js!1P zE&kqD?rLjx(L<*DuPkDDJ*p#^Gl_LlfkAeA`ND0Q}n&)r@MR=18ehX=~F+rl-4Tbn?E7B1XV z-S<~6m+mkJ@5?JR%%@BW;;aJVwzjw4j}@@nIoWcg2jxa)p#STkR~d|3B+ylHTN zjdN;ehB|0s?4hEHPL8nkzsiooF{3We1NU@nSR>44bI~*V;!rV}3xRJp3(PaBLB8 z8n8vS%*cgMVX=asi-#HN#GBH};($WO2n6Lq6v|6tw40wtd*xCCFY>LaR9My2-KndU z7@b$Ez1*BIFGzgZv7sdi6G5Qv${>TyMsh~vjnG{p` z)GMzOugnY{<+-n*d3t+ws3wF(*6hXKE9QsyX4!lqIexBy=Rax#UIz`4+w1iUs*Oso zBC;l%*On)MzC|D`T^-cP?rdnpSn9MbDKaNDk~PORHSv))Sfb{JeC{^@nK@2#eA?FW zH-3?*HDejz8SEH8k1`?%LXvZd_C>MEop`g;9_SR;tru*Xslp>sl>kJ)ate1uw^3-@ z>qq2g!ZQn>1;L-O!_QW0%acGa@TCp;**CV8@TX4FD)EgLWLh;>DY;%+LqpF7IIBwS z*Rz|-f89DoGS(_n`ntkL z_Nm!;(}9j6Cx^TbRz4r6ctEJ$=5T%^t=1(1o)oqwagH_rLG;V=3VA!4mxF&|b8eYQ zlz_EJzV(68Kn@w3Wd6B9qjhlQ5FGcN!gH&x)o-!-UM##d)l%0D6A)L9Y5d-R-6DUe z=&OQx{Pa;~8-gW7m!uu4o}W_V6FAI2^ib~$(>yhA8a#5^|C@ii_%|rreYM3T5lv*bRjl z7{nuR54#Wo0!a2kXmq?LhT<4!TUU3T3b_J+g8v&BpmL(9HW; zuAGl)YN%da(^gclvx<}XQi~8R{8Hruyjx-vrGqPNfS zCzy|}1-+z&9s`Y=xW-FE26jXB@d9s-$TMW>?Qx_aG;S z6z=X8CO&A<(9GZ2U4IitAE5NnM{q?F<8mqmT=L4e$;>cp$MoyLY`{PbCc6oyQsW`X zHymuh!jf62X`$|hxg$qzQ@cul?%nq2B2cBRm)5HObDN0kyI5;%m{C(1*Nwpc#;%RN z>>}u;R(zE}wLgfL{3tNu9?X@Y__)(p#@#0Ohdj!+7R%w`$Maq~qSoezJDaQ+iX_`< z3mV2J%NVKrEi_WjL83?eEI3W=g0g8x zMts!HwD{%pE?KjA)17mVbaOkWR-YXqyTsQ%f}fus&im;8E*&!28Wuz;aC; zET8N|n2{z=pi7o}3UOcIMNd=D3VwchZorG-qP9Of=-Xfo;BwUA8OOYExA(4#M;cl! zP7gRkyC1y;j%$O8S#fDEGc3S~(bF}SS87!$tgeXz$c402k7byfX8!uS#TH%0qQ8@W z%~vk`FX=DMDRv>kNV2N1v=lU^zF&Ya3<>cI(i`Ok@ZpVLcLePYe|>fseTnwVKWBN& z2-2GPcHfM$OxMdi4Y)y~)MP-R`|Bqk`sUd-`s!oFQ10@n%7J|wM$L{uxy$XdfIe=j za;k1gUl6ZVyS!x}s$w!*=ORyTG<$CeMi{xyUG~z+-`x-k<;#@Jm^=mdd``H)KPlTW zGF;&in15ef;ICt!GwUqgCIZ5Dj4-XhN_xEmRI-d%lRZlt>M4`Ud^hgi;!QHq0~up` z9;AMf<8v?te9)wo%OFwIRl~ftkO4^# zt5-JpYms1A+lFMmP@fjKVmB#VcRqt(s5(%oY28|Y=Diu15BYn3`b1dK#=V6~b*MQbI))#aZo2o} zPu)}73RRjro6aMsV4|mpZrP*?1#EAb7*WP5F^d7?IlJEv^_w?z+%-%o(1jZe7w58? z>xhg^de|r4QR`q+(J(wBV>Rl{=T7&a37bk3@cL?XGh;rn60%t9`OfAz?jhcn!&)*~ zJ9IUFonB^ZNw_=iLQE7m|M%Q9(cP^UQ&n>L#M~6wy=1`%IEhPQW6F#<-8mS;*RP4Y zSFVNtk!hO;YYUQb%Leok#y<{4uZ`6H3lfpguG=|ovm=%?GZrHuZUd4t0~X~KFVsE^ zm!3IzIQt--w`60P`!%^+s-yoE_FuO6R;fndD$CbYz7lVkD^~IAQkTkNAd%M>_7q#* z1C<^xMxpE-y+G&KFPFF*X_=xn4>ly}J4W&=%cc9&W)!BPP62jbDLpIAkK8H*F$^6K|Eh=Y$s>?_=+{V1y`nO zhY>NthmC%6$sI;gYQ%|FkiElR^V!KmVTmfw-F&W#W~{qz?>a{stK`0r;%#oEf5nZi zDfLXP$UQ(vf=vs98m`x(=(C_npL^z763MK43_sOp0vh2AgTTj5Q6ev6_XHfP-hW4=4@cE(I7@HZ$Aw zy1L|Y`hZ(e8(&cM8>?MPJJ?Bmbkd#yc1pye7la50$A!!^+@$!) zLjPKLupV%GSS+%XpL8{5>DL zAp(UPe*A}p5?)9qj9(gOQ6gApCXR>EMK*(#C$D3EG+|l!%D#%BO8-3x%^ABpLgwQF zfOnIQhx)dK#$K<#hU*|}OVe*&&1j=r(6F85On9+XiIhdoSKWNj%@DH_l6PA<(V9n@;wIE$ zsxzBVMUTKKarD`*>90`_mN>RmZ5`}aqU8X9;O0BDUX=5ljbo>d;Ua@Ra}+>X<;mkm zN1bv~Il1x!O>gXb!TkLm7xc6D$tO#e3&sRULk?c|%ZvAFD80FLXV%Rrg(8M(nREAx z7!BXcKx1V=U|!H8mGh}V|DCaqE=*=jc?I)|acFDaSdFmICM^DNemX?%jb|!Q?BLKo z*!unE=uY0z-W#zGR-v`8<|5|g6st4;t9>DV9Y&5z; zI+|9da5mui`bswHQ_~@XdtU<)_0v>WzUIMhBzQV(O<;SvyyqDYe$G2GuH#`5dc)CK z$gJIDqlty^(qJt{WhyVn<71v55ab-#CYtv74dli99|o^3N(-RkY5Ksk<}2$DUEJR| zrq%0y>WiVCm9USCrrcXe!LNX%1RPc(BfJ3k(AA}TC8R%~U?}s314|j>&A5ef*%!y~ zw;V$2EuxLV$CP&}j!oF6fgX;fe$;$fXXy^r9FVvK;=g8belqH8UDkQjMvpz(@!eEj z6AN{_vLtb%CnGiMdbQEqn?)jfyW2zjlB2v^$r)waf7At=a9iBU95E(f0g;;3ZA1mc zpZYKS^Kv<8BvzqL2!&Am1#t-8)T#G;%u2RTc%SUpUF0s9eB7D{Mrz3=VaD<2q=L(;{Y5 z;7gRz8?RKsJoBamUa1PB;YqLbNDupKo7Mop;(L(F>i>cphns3fn*k1TwJ3 z7xOfGt7-x-&R+iJpXz9hUvK0eZY}6+w)KwaRrT@_+zT9MlP}HS)w!E)f|_rB;dbPR zZ`Sa`>pWf&$KQcfOrC!7UMI^zm@c=v+1@#u`EDnSf{NjBXT0v<_}PudR2}FY z7*Dv5;NH$cEN|Sx;_TgYgRI0b*=t;MrsTAhi+$&}@3ZPo+o2HogJ6Q^ zeEhzKuL1U$5uqE4q`88xhWWVB#8s%@bXtS&)wxeuB$#&{!p8L8jT}QSbPDjqm1Pb~ zh*x=BMK;B2`0%)L{H~m-dTfk>ojYz;yr*?h)!T0SxM}Kx1|P8%g|CnhT^PBlhVdSP zSc%^#8J#E#)}kgr4-euGyeA*|)RZ!airr?U&7XPGJlaq$xEaP~ZDsN#P5nz_DDk88 zf^Ys=pju~y`to5@)m?rE_rj<;wBY*m8Ob9m`-r2}s=ulv#-9uPGj{awET?+CUXMa& zrrq$DB_ohQiD6jL()kwscaEGkW6j{wQ#y?Xz|D+JNrUR9k5QN;y-`=&N)kXTVPw(; z1$ht_x+43T`^srtp7Nze#qWSHD@Q*GHo>RcE8B0X5JZbGH|5cDO0TD9%3Q?oIy#`lhoI74kc9zRWlAoj@-!*%Hi5Cj1R2E(^P<^TQcaEhYRGGzQu*h&5l#LjAA+0XJS3<9{6<-> zM7B)gF9EYR<8%MaX2?6CJ11`JYibOQS@9;nYF$JOAZi9hGh6NaZ;HFF?tijDFAKV? zR)4xKPAIKr&+K!n#Jie#;@xXCO3!&`KG#^O`dug^0=(Y;2pof4<3h`{j3 zxsBw>kbtnQG8oYu1ynD;KA5j7V@o!p%ocC_B;^`}r=)osHV=+Fa0Bfo+iJL#bm0fB!bxWPu-j!z9Ve zsOLkcQdjK?;olN}MhvGqx>@0iZ2cU6im%1W)nQl(4-N*gR_Vhxt;`m`4P$O!CA-BX zO{&8^H9)5$EI-McAFMt7PR2z^0Gn2e?WAo7Rn^Qq3s=xjJa$?$g)3s%Z3NhJ1B4m`q|eHB zv4(mvkSFO@)lDO90YfcRP|Z}GGAG%q46}@&!hqn|tjvU}Z?OsU`@yFIHhDPW1lIR= zzn&K7_9O`0xWNYt9vjKi zf!>~5NP`9Q=|I1*EhJ-+p3r_2Dtv=&SmUu=QZH@y6_gu0fST;bQmb743p^ZQ%YWF+S-SCTyb3c z4|V~#@K6ToGT%HN2SOZcsWq-mEmyPxb4=}UcS3}|zSQnmH~GsUTV>;AjV)pyS*!L3 zlQu9CmIIrmAA_k6r>fqR)3as{4U!pednMGQ-+#qiO@D|39`XaFcb?v;_-u(UkDGsB zz_GnB&SydXU9#*Dy}^3>or@_rMc8-P`{^*y5i!@Yp+%on2ia{^gY;s7+BDZ|5W><& z9o<4rc#2;Y*!2gLMt?A{9|s|#_cJrTJ_c9WetL!nilVTkpMIBfd&K^xY7S1xYTE#E zU$8#6MFcG@+=f1o>d-l)rh}HOS8vXi}J5vUvJp zFxU)THWg=i=U2^VfeK)t+69imj-_oqeGs|Uq6ZUjmeB>4@EWObd|9P$u^}IqdAe>d zEKo|h43s%yHj;zPr!BBzFIVoLzk=947V+(3=9hV2%{-$bo&+7?rXyP7kl&@T;{&>K z$)^^sIq?IgFb<%CYZaKcP}@}+C195=Tu8i-FW=BdE+Mx!6N%_xQJUBv(h6KdH)1y1 zytdTJq9?m&&JCbC}w6wmy&CAi&SUG5z!pPD{BIOwqAQnMlVo8FOv1YSchuXqIpaUujWWb1yytc9Vu zvXR*&Y-ZBnQeIzW(l@cY6NQa(9mfQVcv)(hhyR$Wa~6ZmqgfEQfMSs(0HsQSF=Nm` zL-6m)=_f93T?yt_E8BVyK{oV*=~5hoTe{?>FTRXpF}(F{NX zVgeHH`Rfz#BQeRmreNOIwmt87p31CcD&^ny%PwVx^k(tNJ1BXa0$`I4-m7pEBLn>AjLFqNpl7DKMl- ze?s8s+@i{BI;MgI)F>e)FMsd$aQt05Z~ND``i&lor>2X7dcj4@25h}8JNM~>;Z+06 z*$2yn@?l2HWxrhy)v5WqfXz@c*-6{+@X?(iAOGErxcox(caO{z6<1f4FVxxd)7;{0 z&!U`uur{m5*GCM$%)bkpu)^W zBcTpwDF8Oussk4{N|{-)7xwQz1^J~UqX17~$u}xW2R2V6he1C)e4W`IsPXu3{^!R2R_*-y&oT0)(J9bcB; zwb-R>BX|ycR(?U>q9)b;hH78;?tu-}la8jc-3ijuGm9Ml4Ru1=l}?={!29;;x(Zg2 z>HU!vG`@v4@7oZOZ}c_n>PT~|Oh+H5w2@Ry%x!eomG29b>F-#6pgi>^S5FB}_h4(5 zrUKk=orf0=O}XRwiIqbaYudYX(qgwx6|H&&KGxm0x<*l`S&NNn(|eS(NPB`frP+NW zQCvC6Ohe6a{R90o4D4H+-ZvTrM0DGHpvZWE+pOl4N+J%kO~t!YV)5>QVL(~#NT4!I zQ|8qC@K`p6kM}p^mi0>5@n2TBukmW^=riT3qOAj=5ULOMJJJasTWQ!;g{gY}^{gzz zfST1Y{q69MUpX=O6Fc(FS;p-p=TF;sU+ zcZ9yq`$pyuE;U5+NY@TD=c8oU4Z*PjJtc2{dFd1Fvr?MRFqdz(s|R2O2KDGAyKi3h zeu$p}AR}sILYOjoP1L}-?`;=~a2hEM6vS+8fb?y*WB+CNS( zzwY=tjAN|(pCyZsJ&HcqaUG+mX+m1Kr>j%+2`p{etO&Lk23|l{MjO8AitmQf^D`C# zHn+Cq_ITyal&UD}_*D{>D2yYvjKgQZm*jCZ+3c1_VOhk#C|-=F8Ah1K!G`*hkDwHt zwv^Qo{_-$1D0M#Z6Z?hj$jC&bS2X1I!F{SeC48Gu`#t3CP)p#O^h=jrCw#W8UrMrj zpf4yL>>^}^xEYF7>PY;d+5-)kYxAo37-N+aAt-dx0}k;%Rs;d#gt^R-W5d4B(3wpZ%#a*KPJ#Hj5&D(&$upJt`TI$wFAca@MF%j3XNF z9*ho*+(xVSm-i>!owJ5(A1DUJ?d+pzJhkzVXomH%BTu+voPxPJdQR@=$BqC8QAc#y zi1tAMujuwmMVGE2Ke_Ovot?M#j!^LIHPm%H(@qR!+_rDbO9x6wsv{qnML5~0;(hey z^YLe;Tm#N*0hYafAJ9v;WNG_Xn$&)XeAAEB;6yCcgLr=cSNM9v7n{z>*gezdFv~bv z`2A2XI{Xo7N%bNdt!bQl=WFQw+%B+jhjH4=?VR`tWXjQ%s_qqX$-lgYb8%?< zOW&X;n+Rq1Lk4z3{}neoeZJ9=3YMlEUg?B#vb^gN-?y>PkJ{h;Z#z^T8S9DaHNO(p zy0^hj){Ba`gZI8J4naRWETR>Gu`ccp0Z+oZ0%aeB0CM|Yd z^9|c{WMO*ER}wR}>9d)fLi~Ml@2npQU7}PXIHkXrrU98j^200OvcRkQdt_YYhSRB|mFv(HYRo){qB2MoESj zeTtlHcfB{Tppu5eEWc0Dqkja7`pt_>sxmLX59rClJ}xuQx{^P2XBqgf4s=69joi5( z)&G57KiDxmsN&K$0b^*ofe5(&Ue3py?lZP?hi%I1Oi(J~>_KCpBOvp z@gZHu$xW%wYrfpZ;GzumBNNGfIIPF zJ<@&$Yu@!HtAcd<#W;_11cT;_MqFr|`=`NdrkHbKFa3sH6SnPU zt|W@2TH1uj8(T`U)1Egqk~-xH$Py_t4wO1*ee8btwnAZp>@fjZ-_jcreMxA@9MdAl z>J&}?sZ$>W$lp5&MT@f^l2N#--|At z{d&6bZRsDeZijvES#a__ra-JZX7E$y*a}-S(jQ~X*LSvz+hRAvk7I-|`L+UO6+FCO zEA$wRSOeY|qgnqCb>AJ;Wb-s?N1CW02neWjBBFGu3Ic*OMToQj(mO~mNkl1%0!o!G zU23SI1tLWekX`}_#h?&c2t7au+y{N%?>pz7^Zow0=iYPf^N;dmpV`@&+1c4CyJ0VT zu2#M}=vg+DT(*>9IG=8-L6@xy-}R*;UKMd!3xIrEKT3Q;lykS%_&SOwP35%kH)jOzO zI6&gr-iuYSVrC||g-%~i@=}-{k7UoyF*i!pcsSm3o8(_VOjq`9N8-mMb)0ZAs;ijN zF7^&C0=olLN}Al_o=r#wt;vwKAIiTeWnaCw#s;bR+WKm3TOAh}T^o6*?qfa9X&z*~ zW!;nWm?9@FKQGb)tI2v(#tJtY$3!S2D;G~*H1GJbO)-DNsucA!;2C@zDJ9U_SIYHi z>_7&IdVW~x-si|`G!fci+A5K^>;zlcj#I&z`>rlFFB}Bfv=0j*5?>!;BQ_M@pf#nX z4SF(|O^;!8t!THMX=J29Iqq%WIf5XQRMWMjc2)!54r~c+5_NWbs(Sj`K!-iesqNjL zKhMU48I5UeZR0T#s|KKgiD$ysj4UTV>SZRmWq(OIvP+9=>&2{;VV)yZ>kGjoeDOx| zOsaUYww|9~Ej6UAYrFIPriA#y-7#2MDdVhsmaFg8<=)=CrgnP-N9~u2WI-z+#DN_B zAb-=b#erf5I2&@W^Kx+Z0cjt*SO#}YDu1&Wg&k8AC5(V@u;U+!ymR@mdCQNB4>aQ& z^=`Gk>U~DysFoV2v9-3AY@!K%u%C9nW1LjJ%Hhjw2r)kH5OW(f$!hG!b#09VCf7Is zk*u(-Nl$`?`cBDwRkACoD&2g-3pIRGy%UulpHg|5ngHcnU6%bI3F z^DPuOoq$emaHfGZH5#Jlpyx&YJ# zX}bHOIk4Pz;OyreSoE(KJ>@`A!^hmG$;b+cGX<|I?`N`6E?=uTQM|_taO0)$H95PV z8oWJUj?AnqYs8R3a(PS^i$V8ca(DC0zS&JOHX|d0APVe(4v5-D0k#s*3Ajcv-ySJ| zFXCY^kFQk}pWAI~v}tReTGeaDCyBsZjo5k+>{3tFor4p!b_sj8t$y36D^eeSlA6yE|LPJ=?HYc~HTl(LN2kt)-ma?T653M4S&4rjS$BJ^l>I3 zXtczmoI1uS>h~?)R&v&G&{1$Zyx;C()Hkmwf({<9fF6c-j4xR&5l$^r`l%*SGa0Pn%s}Nw+E|CjQ(D7ipnZh_+hm1w>Vn)FQyGyik~0W<0GP1f zxSW;Q+6wMVTbBT3L=PrB@$6QJqAIAd$}%)r9yt|sDXGWE%N;k*7?fDh-Vg1uxc59lv9KjiuT5SN*B01 z1xn|Cc4zVo4YHIx?I#$Sc_I^SS|`|xeW4)Jr_N}{_NGo%oA>>qebm_ z-c`svy(zy_A*0rE*;JxSRJi1~sptSgCS|>As#<`{!Pvz-%I~d3fA!4cH>B=oG_1%T z-DK~~Bz3^pGQD5x9{Tjr?3CM$ZV;upmgMEvm@d5tV;Vnhq3;lG?cPgV8|U259V9(1 z)D%fywiN%NrRev=CyO{9O5CP8RLLwqiS&Dep0ARh8}a?c4y;FnHYN3!4k05G3?*ky zkKO=kXj7x>1Sol^pUFfLqnSE%?W-?#jwWzK0a4kT2|EX|bdl-;4S?!>J>t|gctw6>qN%je z*Q83I5{12|-Yl2jBY@J~z~XUyK-Bp;k143)qU=XTEx@PuMV37VScMb!65HK3X8SLv z3eUJ9LXn)HvoO(t+uQ1S30nx?4_Mbyg@+~v?PICuc+6!`3b3 zq=s)kbfxI!9)x^^D(%|^N&;eYy?^RySf+-Iud>UxyAp7pS`rOeEZ%LozTv@;d~hm2 z>EZ#q62P{dYLvE0ZF;0<@?!Gbsfh~Yw+B0n>%ng_wSwY?r&gQW$L%VN*{rrX>aAM~ zI1STWjXsW}LvG2x_a3`rOpXZJ-w-^rDPuFrw1AWr3B24`vFbg!k&Zk3?SY8BHw zIF%Q?csj|=a|VDxIscDAIW+weE)myh2Q5k1LG#NN@ z-5_iFj)o--zOTB_3blC+Im-$en8k-3V%$2{+S}AXERfGEXGlO?ME?k|jJn@}^-A9q z*b{m~!ZcgFOGsEjr`NRMPOP}xPc7Y<=--aHgCqXD>d2aw9_^_qgtXk(HcFh|u;a-& z4nH#*kUTp2C6e_iec&0ppoNg>MWheyuZYJF;2y3wy?%BDE5Xa{l2%q$KER}P_$alu zXi&*-$IXqA1OM|7=Ua}6v*{Nt{2L(KNhwA^Vhi;g@@w1V(ECZrDnvOT#m#(c)S8Bj z>HA4w_XlGKdT-EETZ7AeL@3{Zk8aF6hbiEP3~TnwGAUaQq=!I=1iZ5BNTD_$ z!%lrwIC3fcvZgDrJ7H z5SF`Ce2=|vKW8$rhJ+c%=AP7%e?iN~#d6vLWB0WlEL1gaZslw8DJTeJ`|2e_Cs13` zyKMc=vW=l6k*)-LeI;zT3%p2Jd2ksFLWm0<+{z>dS!C&tfvuJOT#alRE8+IiIkQG@ z4&|%2@=y4lUvYy!atH#&!t$$4WXDN)7|r!}Hz?ziT|Dmuic!xISo?yxTBq!8P@`p3 zBNduOwElW9Sr?|4GyW^;9kALPf7P5IUk6OUTS0nG5rY1(_c~j6D;)zj)@gcbF{e6r@)gvK}%*n}!zEtv}e+@6cSt+@VDab0XG7 zn6Y*R@1ksg1f?&B?j-FmzdtWFnSU0lo9WHIc=|Mh%6FR+pUr5_pWdphyDGxL!gdOL ze)#Q078%w@ckip621|-u(99Qfc@-%ic{TR*HKDrWnx|`ns4NtD^vWH@g2Ni`KAXV& z-c5>9ABE4rexa#h5kp8DTq>-Q_Wm?a={4XzUabU{aj01D;~V#BDG}T`b36c_kbgVc zSHCaQRpt`jFyn3Tt1IiQlT^)EuG>QBSOapqO&UrVdSIj8QYJd|pzM5OhRJ932-p)FG(oBq@B}AdskvK)D9ud;Or@Qh*H=L{3k_!XVkv95=9XOwPbB2bFt!nXp*!Wnq z>hzY6!keLPmqdM0Yb#!FeLd<#ZEak{sB$$qKv#cC7}aIkQS6}ewQ^O|%#zN2i9yID%>AzRYI)Ukj28mzu z`J_~9*r*BdAL;ol#`Og23w*h0aw+}R{HNj3%-1Om%+EAj=R>u|q|B=mqjl#*>4qd# zTBf26sjNDLd6^?}M*-wco;X_~(M&>(>>|?E^EY(sgIe|pnG&~dt;V*rvT|CFdml_R~MWKn74T`@N3j&{4{$#tbAp zlA7s?^_wwgQJP)r;=>)J0l!Se+AR6dA#z(koNwY;hp~M8QC(LYx2?3vl_;ZPU(^cv zP>amo-U{G_`D&(2Gc`o6-tVNh+cXKA*HEXVIS+g8OxjP^N~5|cA7)-vVXk_oULF< z_tC{a@8rj?cevvu^)edM?LNKoj|yG1%hG0C=#?K^32lvs3;d$aiv-!rTBKGZ#;F3k`s9{ z8f|4+3!Od+AyCiI(9~Yx;OLUi%5=N=06PtmD}=d=b3Ye4CVQh(ca5=tkq+92?aLXa zMdd?8wNltDckb!ulPajOzSX4BVWpr8Ga${#0XUnP-%)3P)ZC8YiuWJ8fH^^4 zhs_W86}Wjl8Z^T##^&zU1@GQRugjJpoRiJK@qDx*;?QTCRuidAfmHk!{ef7$m0N^Q z;A+~8;iJ$O`#*|xrL^%;%q}@1?&rs8;xf%&n=c-#XAb11cB-VWF zW-?P5nR6?mM|$mVjzKYH8kU!6z`bFOT=l^%FPo;20RO@^=6Z%Cc}}ZJMCta{u*Ooa zXc9?8$y_TJvgh9w3=JfaCEmtu z6I9`Cd><`e7+!=xf5U0~C>Z^avLO5_V=``8Gx)<^uUz(k$Xo6> zb?-50HAspi{C!1ydRhcN0V{H~(kF75xDb)|CcDtg_jj_7PHcJ^pp*w;{n2-8ARn}0 zZ;o@B?kja}FAxp=`6W+`t`Od$gdL-EQ-t>Rg~qC=vxUF7qUfr|$WTCE3kRf(x=|RK zL>-%HT3w}4gP&W)ViJ6Qk>mCUcIldkFAuV;{P{O|Eo!^9=cNpGF01$hLHT}OI}y}i z%e2F%a4W(G@>`p!y_0GVauMzYj-IyK8vgv%c41=E-SNgpvM3g^UEgg=&(G%`s^{Nt z%VZ{%qLWh`>2DUBuJY;R==-8DFii0R*Stl#L74?Y)!WiVL&}lm74@s!efo(cov#CO zHru|c%JPt!uqnTsZ&0k0Z`6tLaT%d|f{n_Wl~DgncFu5c zc)sLpR0dreD2tl1z2JLqS3kXu=eChwrNea5BP7xt%Hxn;-XSn9Y?0myqJGS?#5+}) zFKS#oEfGhHQ(-}|oMvlcK(z1Y-r*2Sa#&!{e-9r!_Yyd`v+M z@7W-b zn+&j`FSd31*XMgBTsY8D8d%jr-#ayr8F1=*e4Nv_Y!N&np zPO(|SosKSMg$T>8EAV$l9}G?7oYQu2L!Up~Jd8O>!mu5l32siOR?s!V^;Cku=muTfU6k`-zeofY=B7eVfPks6d* zHvV5*79X|U?(4BP^8cbPA-S337Y@BnZ0)G7w}N6;f_7wiuEgt{rn3k>G-h9pgius-oRc;qo%cE2Z|;OUngqfl*SS-3T(x0B44YKv-+oUm zj<2SD@tFLqFvmwEgQ<>DHE%v`U9DkxiCwyUU7=Z6#o1v2xldfwdTpEO9`!lRdU^Bc zz@NKpKc4t+ZSn;@FSF?Lw)|vKDl z1k*P3uW=nzN?7^LNBQWTZ5Kh-lSEw%Kp>tx*s99i?Z!vuvfmw-dpz`8#rvnZWa*5b zrbNk}(V*G>)Xj6CjZyHPTWl%09m%nKH2c=o4~6~d^o+dkX4IaE$Y+QW=&`8W0#&82 zXM2c~ay882Y$$Q?2a$Tel{0OYobP9yLlXFUis{378Jak<#`^wc+8cmr<-woA4R$6Kpfq%72y__7=u2V(-;SVCb3M3PI3eg(JpI0qHk|KLp6YotzcE&nX~ zI{D<0hP@AOW9e6Wk-!W|uXOv@sJlq;RaI5xFnr)Y0zdy3*|*6I2+3T?Q2+AT5(TNp zn$EI1XBa^khf6s7d51w%b={%7yLn_GFq7|SZo@*S78Hbcf+_(#`HR{dKco-G%S62Z zJu@Xdh;;?$ss{?Eb>}SN6P|vwVH+B;9*lo|zSP#RT0Ptl?0X|cT1YZag}2tF5vAs> zDoG?X@ZoOil@i5AhC-j&cn%(*=G{MD=Quv9aQZ3WN5GvZ3EPN9dwTpFFSr4ukD^0D zLQ2EK!z0xNTbIL+3^~A|E;Rhqx!Pb(5zl zT%j9-hS=7FORbJsraEm{F?M(`2d(YE%dBRzBv&z&CEuOE6o^sy=&=g=EfGF>F~{v3 zADLhaPWhUL2JTZIB@y3PbWJxG>j$1ym{{aEn53xPiyZAAT(Y4IJzL?Z$i3^>`;@9s zeKQXJ4uZbr#%{v&4JqGTs%|eC`D5GjiD{IlBr{xLnjFJ7z-^yJFm(j~k?n6=BOKCR0 zR2$xWy70{F+wjXR&bKPPcS1OW)N;qaoQz!~sDi;MLfdo?nYoMwB#EB7S3?Y0TZo#LyJ)}8i*M8AMqbguX4+F-mqLq_Kb|T*Rt!C51VWj}l5(fd{g^)71lIj4 zlCrWpJM~@!)>2xUDjuC_3(n8<#YgM~G>BTi%I-GZ$CQ_}ajMR5$3aQDMq36*BX(~< zS8k<7M}64~VJv6u4eBP-&}_rVKY#wLI&DG*F7YxY$)wVoZeu9lnB29ID-TBO$#$Lq5m8O z4ge8DHHcb8YWxCC-_}2YT64CzRmz(F26h2vfCH51<+Iiz~IQ@pPBi4O69R* zmwsI*aP^}3d!)*AH3R)ddSJg*HKpIDymn$Tcz2>?(vSK@f^FG%`R!cav>yVX<#IV? zrBPYZstMR`{(0sle6=e>hVMTKnm?|ps^Nm9^wKC9fiDlM5NWP>d^vgXzvy(AXFZOO zQ_Cj@dyh0`vocXn=DB>?|K~Q&9q}`A;h%H4kM$>AiL-fC==S9AB?dqrFEZ925R~Iw zWDWBdi~pYI2fV%8S1;Rn=80{Q(qBA)ESG8zk66afT~DC@zIow@Kj>|Yb$O6buvqw~ z-QAmHpJwdtal%1{Y_%ib?Q8jP zx=Oc7Bzw$Ks?l>yPGUD_b8S$Q@U%V$F(3->aQmapt0D!%KPhq7pJjsyySi25xw|HV zQ<>_jt~X5O*l4)>Arn8DG$}+Q>K7l^Sm?^7FFrFln#lBEkEf0);~!U@90#Yidx&2F z)LbXxZO|uG4gZ{~{NeT_ek7##FiDa!c=oekEt_6sfF=d1(AZSh5)rqyzLdt~AXaw8 zId0htK_EV|;%|ed`SFeWijS^p<@6jRNa7}m!Owzj10&`nX;n^#WQhQg#bYnNpKU8f z%dz=*OxsmOsT^0$3{kOw@=-U`ToCxD;f~h-mlK+@L<* zD$D>X?l%;A1KeY&-0|45=STc`{SMfweNPn0dStBH8`dj|eT$u!Mx_YF=%zT|5Z+uA zSRHgIE8y>u=H#p{(W#`FTu&*F#qaZ(PMfkqT!|Bkj#H#Jry^7DWTr@3 zI9{MCrf~$FSMjMi zOv~X2DQ(S3l%=dx@7vPk<$Vk6c@6UJq^4KPL3svhYPWT*h<#2LohE5PE>E9QZV>sD z9_!dv+#%Sr4cq&?uQBLST>L29!E{X8?B`*mC8aBuov(0uRk0MeWApfmlH1aL>;^n; z8#ln=7zp#OX;XkC+GJYq$`jUlbvlYq9Qz`<(Z$yYPsIL_lndg?t6C@#N9C*g!TP=* zg(+J4Mt~uzzxP47Z==0pPxX#%Rvay&`hD@}OeHjciO2cZqp?btk{f-u^kab^SPu5Z zcTBS)XMftD%Mw7@gK2I(AT-3x&R^TwSE*yGS}h6!pRV^cqqNNJ&HeqTW7j zc~O%xB7XA)&?hNfQ{rwVJ=XcjojiD3W|yyYd!HunycQlBR)NELe(kI2=XcCx^5Wbm z6~t%7xloZ@6+@uthJ9NTOUCSR9!>Y_w$`m6Ppd(Q2Cv!vp}S5cy=PJwScV2y*bmJ22aJQ`kK;0$dl{$A~ME1b& zqpaw=xuN(3_3aYTN7vq{<>)LAi;=F~Q`9r|-*Pm_zo3`t(foaP>{EuVNE;}suEGKP zY|Q9*jQ;nFZhiFRsrS_wK)v|j#*iMzO?V!^`Cv(j_=X(!sat+YRH0dKxIWDpH@B%> zNRW3az}%llwmwpu=D~4i7y7G)CPx+Ud`!>Hv2@o!QeJZpOl8ju!eOCL%; z8XTGN9+AZ*f$2lgD8rz%WIDt$OCur5l_@!Lb+N3%mF|<|OsD%a$Whmm9#a^&_;hl> zGPN3?(8_N9NUS8TfC(XKC_d5b?Ghvd_=XPTh2j9OCCgYCu5pbTm8j$Gqhh*ZzN8@B zsWc5Jf9mo_NpjB~+aF1=6KMrC%GT{0iZ^pVeqaoEFko@{?Fr@sy7&+tA3ng{vBy&d zMOP578C-xkDuViwUJ(g*rdjr(uKIdQqDS&8Z1OjEN3rGI>2k-K99Mpi?my=A3OHrMnK1(BhSyXb8}tt&LYIgA(#cy_|^OA2zxCe9*O2+uC1LP&Fy3afc?&tcO|$Ff%}kAd}zD_bqZG}@JXC9%4eqOEV&R_yV? zQ%|h5*gl2G_Ln1&hz zvS#3mV3aGu#6OS1)vUJ_jQvVbWTHPn9CLs4^RB*t_v}5LwN2_lSrc$@(Vs`4?OP~V zOdH#FT-409jYISpH-OUkF|8$sFrkmM-GJ*a$D&nS)W~f?%|EaFcdun5D++x6*P~TW z9sQd!(x{8}DdKmo#J|m6?K-2D5_q{%Rw&NO-=|V|s9G~-4>PQ4kxx!Q_W(|)8L#~1 zR6s#$g@u~rnztmX2l zhFdnOpHeIx&%O7^KToQpl^L$nA}8Y*?J#Ap>Gg*1(`_;AzA2oGFl%&~B~%92lR(4;#0tl0zHM{+tQ^CgXVRXjruR1Ep-aMOCXD&9$Q|F zmaW9}geh1LkU;_pq@?T#G3@K~)x2a6jARvHoL zco+tJ~j34Tv-8VR6{^ z?Kxjm@O=+s=UNG^;ov?_+!Sq?r`L%#3P{q`0GG#@6k2vwR*UI#JH)R2P-5H6Ua)~~ql!5-X8fo%&b9Geh8v~4POgJc4cbseGQBO{SV{8@JIBO? zp5C_d^wF&(c6s*^f?Ro6C(B$hEq*xQyWR*dY(fAE0JoTyF;u2cuA8zu0>V>E>>8$w z1K({aS>b$E8UMgrNb|MjcBGU$D90VOHEZfRYS8R}ODLq49>IW)AeVkVbm3tI6F31p z721K{V{9D(Ji#ST*G>Qf7%{cOB`mD89UK9ad4}`xusC{KFJ54te1vq+jp-go@aG`7 zQri)17y!FIcz1mgz#2>{K$<77c=psnpPHpFxip?zc?RH0Ab{@>>CBNX*Mf+@7AR1`UtE)Be1k0vbhv%KS_I{56gQ4Q=f{_=BE1fF3tR z^-q6dKz~V)z=e~4{(`$UiYMJLehxosM@e!8=UX%5G+uYPrk*>%ZCllEW1VFH+~*Fl zN;d+V0lMyjaS>{yw*b+~0F*Zmj($uBa8$BC;$ai9Et`4|AWaRr0vJ%!9i%x8+!`S6 zVaib14@00`GtG1Cr6LmhQGd9vNRYvF#L-(Y!0LnHZL0w?T(?mi?m#J@m*p=o&yqV= zc?9PguIv>UK%uO+8R+H;0DIy#=|wTwnF}PZoc+s?c<9p9+@^RT#v|irM%u_7y^{SR zSL=b!CjcrMXAxTL*HD#2I)0Q$fx5pfEDgt z7L8pky4+f<-6&3r(7Fn|0L85qodOf;?+`~D3BNa1Ja1NS*(?;&MsVKfw_!fH$NK1&<^70p#?T?)m5aqNe#~bx(LTe?g}D%)f0{WJmZg z`J4B~t;!JAO(f>WFpAgoJS}`vasTBCv2cEqtFxZP}eu-*ro!Q+=8T* zv?zY(avHaAhRt#BfEsoU|I~{eDTsWU(pOUA;_G21rQ7atvlEL~fH5<+wY9l(KU^HG z6SPt`{pZ@IW5-7BRs9N&h(NQ4g8i0SgOF1tt%~1yCtksTEU$l+&4>-42>MM*eF>|p ztH+VR&hGDH`OOOY#|7~3i~sH4&lhuYa>$(Q>;a^~yoa^rw*UTqE9epTlU;2GfBzjr z>m(hNGXOsU2~>&xee8H@Oo;Wp@@WKYWmmaPXDp>bzRu|eDwb5~^KTnV-`nEimCo;J zs6qbz{xYTo7whDG?|Nup%%6AiGgbU|yX{BCBZ;T$`wfw(qb6DCI9^^}o{HIj0o{^k z=jNU&Gb*M^d=K}n@v)2Nf0#HiJy|$B-O_cA=}#pkumw6ZC^`A8Eu2xz_prI6{<3F- z4+;H?R8Pntj4o}_jFIgTpGTZ5=+rHsqltY&-AQ7Hh^uub*4_W8{JoGGXN(&XM%^14 zlB6ozb5)!;2mrt#sz@V{Tfj(eokbidDJUqowis(na0Lc=5PEC>#jz?~o`l+HS;8O= zJ$3zG$o_I0E5|EPGKi-EI$@M}IoR39afyAG@#VMtI-)E8`?IYj47}&enb(GcO&T#s z60R`cs5qUA!9(sV(0ZlUY>6}maFiRVz`D4#N0oLNTQwoSgb-03&EH=SFbChhPN2`0d+OfFqM5fRxEj7rS6l)a-4CcF zLcX}BV`fq1uO$|gC*RC15h<5uzNDKe4L5~%KZk(ZlQ)%D?r#hSHooKnWd_}!9hv1-zEq0QIGOY$_l2)FWCiSR~=B`TL`gZBa} z61BKLD+Uk94O(;S47hipc*c2|!det1-!dWT8a-2u;D2VQFx2dxwO3++29Ayc{^p-3 zw^EI=yt2nmzEjr)yLleSIasFni5OQr`)d920~lU;)PpK{Pu_REQ&58(HVdiin?-$m zZMg34H|c{h+HRsB`PyXX-WR}i0W`a}wUlW!-KcL4(7Zeo%6OqzfcINgre|h|1-9?h zGd0DIW_ZEHS(O?SSMMJnm)&<47-))dl|>sE^ugA^N*D#)`}WOKscH9Vw7w3&JAsw0 z21SC@{gMDDIKCGWze@M(!)jNeAy+Ht-GkQR2M=fwx31G7W`K}_b?hqLUc&#o78UMo#q=m( zB_B0~Wbd1F^a z&YeZv5}-wVvd{#^z+J$Liu4VK@L&5+jc5_iUj6;};^V&rkOhE#$+c`*Oya^NY~TKothntJ;z;(USH|0GWp1iT@@7?m{E|Cw+gQzM`MOxpinT68os zB5tl?D~nYB$}_Dn{*pQ$-tY<_=h_z`_t3EyAzQAAwQ?*R+t4=rWI@EOvqz^kpxr8}j8|VnS>KHxG>}&2UcdsOO`;-|?E0#47w^@B1FFQ)_q@!D6IH3g|0VUJYS# zkv~5XywSsnLo_TFX{U;40ou)w32{z&e{~c#+s1R?0elZ7Pu7m;PWK^$R_is*{x=PHeNK-t@t9u+;`k|i#J*}`=CEUsd{T)0T_1gaB+Dk z_``c61r*j7=W3c6(xJHGxrK8_D;fy)Q6>mVQPOlnuH)YFL--M(&#BRkN4YCFWBc?3 zM^h<)3!fJ8BVL)5iP0d7X$l4skn%h#4AmGwwX;6NTBMP>F`tb6R zXZO?$c&lA_43AcsXFw9Ix`Xe=D^9$j0a0)Ten^pVGd-T5dtRzbi*kud$snT&d=&E5 zGxvWBCMy;3;V*519Qp7gz`$Gg;8YB}=0FJ_$}GOsG#L)2b5q(X)Egm>S;u@^OLpJ* zuAx*}0Vi&2Fq)5eB$@j<)?g~)n+=+Gd)$4wAdaG9=_i#IH1(WDFrLdRjU>q74|}D! z!s)2)L{aA`7C?WXC(h36D((J8;>Ns(be8%8N2@qq3Aj+`j>MuO49uZEC45H!#X-&! zgF50i$gedNvJWbgno72007PF(0?E`Cb9*Ytt&&i%qjh}BHgAI~)ydfVW$~-Khx;q& z!-M2?k)u@^fVG~(Lv_f{NLVc@RL3m&`kcO%{%QPq;RW!)K|*@9;k`fe4X`TMn&#EE zEcR#EN(VCN`VjSVDcYRlI{sU@=>y~i>`q|7qO{V%T8uq5@Dj`sTkXomk%7BJq`4G( ziP3D8`}y~4%#fTX6uW6iR)R8Xk3mcguJ}&xCn`InUcTzPb9a%h3qE0Z?>y2?2tQJX zZE+;50onn5@Kp>}ahuPVh^;9lM6A*U-s1o~e6={Y#%RIlP%u$2#uok9>2o7bEaIf9 zDTE&o->ctN7q2k|Jm}z=IEzRD-`S*KnzpeZgO!m7NYUDf%17=?rM12B;6Drektn%R zfO$EFMFK#jZn6R%6gq_WJzAbG-30(!w>Em^5eH2qC<3;Smoz%PjjWYYf4BF1%bxOH zFvI^@ada=w^m<~gMQc7J-|J4ocmj z#kXG?);135%GY3*VB_1i;rIj|;o^<;KIn>9R7c?j^H;B4mCES_bQy?4pZoxpog%-2 z2EN2qtv2S`X&7a6nnM=|JqvO`93SDy0A5jCUzQ-`hHr=HLn_IB@s9`Qo9^t98BYZS zvJ^fGvT*E4Ujec)qxIy4h@Xf1>RS% zB4EjM+C&++!A*h^=4Kdl7`!0gml)w>6>LEU6vFp-Fknl_z7|H&kwNLr_JS87rI4nE zwc@5|L+>BieKooQ0+sy~Mwy9zS!4P2@4QPITs}{zKP37}5GFN6kEB1WF48FP+@n%S zaJV)4`R0HIMo*c#sXmQ_$T-1wfjLx#cPvOx6lOBynp7sn2v#bmyU`-B^|Ohl89%4c z8uH4^Q|<+g@ED)N70V@`wyp#a&jMELd2`>EXCj4cE;Ajx!3_Wdu7aW|Gp?EoD~tKs z>tzNf0qui`H5fOiB^AHi4**wVfMlvqHRgm6qY(e0P_AwV zEk}*_gl9)hY1_<*XXG{wfTY}ZU<*Cu{eo(d`)Oz9^hogwR3)80wFELK6s*HyQXjX; zA2`k8A10p&OL0w#punJ?i{Z$P%sW&ehMfdTZ7~7r?Y#*rTWL(^bWygRS~0W1-LoJ3 z4%7{JY4BW`HTO6Q{yh*SzB0_8b=^Fus9j>z2xtP#AQ#Y3Oj@(H^Zqyz66LYJ1q7&g z&U@b|uB=SfvO1fGtC>{`Cw7%RMUa>++$)!yb$|^Eu#*ASBgfQpk^9J#ObWM#)P0#D z!acCDHtx|P@@Eajq6@$0U#(!gKE%K_bLcg!gip~8xij`WxtkZZ%2MW-8ta8-24Yr9 zz`~4N{BG_yS+h3Q8n)*>XeQb2oefxy2u)s~VrAwLFVp@sUD7)2cHB`iuv5PHbkd3I zVLweaG{I7vIG-2Z18Qux*oR!Fh)v|F(uT)^Ep(kcl(7!!gVBT5!$2?#Wa0N6)G%t> z5Q~^^*L&-@N0dvhaV=0Ojhe-L=zsYciPE|V`Fc+qs56QTRjym8%bpL-H}KDM#WT6? z?E|$y>wBm#Q1n9sP%u4iRMisGev%Z7?8GAuqSpGcXHBk+06l z2?2r#hz)TpY3aB6wScE(HQM{0FO~W@I$bGG8@wHhnTSEnn|We22}>Fc9$YLyij9X; zl$*xet#Icy2R=Lh*_&J^$0-%Joo)l3CpU`@4Ww?uGI_PDiH95A(D<{95|irQ9MDYE zs3Q&bnD2^tVOe%Ff&C4a6J@P#-F>!NczDYvl6W#wla~?UeLyG}>RhEG+u{xVQC_`w z2sUm*i!m}=sy20fWQr$HpMncb3CMKeGDS=a#W;A*h5+-7%L+hlAI38F`^ z1EdgZJr%;XLb%qKXju&1aB*c;e%31*K0fByC3~5;HK=M@B^`tO+E&8tatvsc+AjP6qy9#lga2;Q08H_JV z5*2%7?m9d9_=iy{S4GU|B}pA8o$CY=EZnt}&+S0#>{iF5vj8@YKbKW=+WycK1^7Me z0B?L#g%Bzh-@w3uKc5H?yEUNG_U&0`D;s4g4Z%h$W7!*3Lk+jpoFT%;Hsp&=o|C;g z)LR^93BOxdY&o|6kMQ3@#il&}Rm;S{V55=**q{c{P}9?@$cf^{Yq&`0wohAMdVb~(&;DFbYLX+H334t09h>;li*i{7*oU=V zSYRF@Unt^dzh1W)*akayM98M74dpx6^ktMJmLD2y8|k^60(VO}k6yX7%|C&BWQus$ zR|CDL*+BPBUa2_{$yeHcAk>KHG5`x~ z)vj$J;7v7(X6`MzTUD%KU?7-KssuUQ!`NfP!qKd(QR^wM>AG5>zcdBY;|q^=m@`*_ z)JG>S77IJfePMb;D@bKr+*x){;)>`2U>wxfP(zt@zgN?X0+z4BoCdGMVzw(=(boki z562z1?nc9*{5+;#O?1cJs>c|19?`MXWGf|YwE$#}V+w`VyS_r|)*e|*vE^b>?ay(C zwo5(>rB)XD+8hDE8jEk0it=ebB1DH<3YQI~%1P(0g>u@Nl^9;1D^7nD16^~qx$@Kb zY-@}K9I8Q`uXN)>rSJp(iEjYM!)a_d5=odO0CDdxzq*%u1^YZAa;13pw}Ar7Z_+wk z%Pazy!oXIQh*w~E^1MP@)_!`~>d29OFXfmEtSFeQA<5R&$Q(5w!HI3>Aab;ZmXU0Z zPuj~?duxK*$QKOJv^FJwZYrNFvbkR~^|M*>{= z)8)$F1)~&Du?-^C%5y`n`#fx#72(5SwZX}Db7RdV2aD?*TilS_BQ$n^I0;nWK+zTO zVF#)VuHi>+gsc|H01JeTtTu!`$H~;#|5CKC9I<*xoNDazb9O0$_gZ!0E2N`28}6G4 z3nc7qx$gbE-`mmx*Ma?GisTF+EUtGf%HnSGw6cVOO?X={7y@?No>9yzcz>;<78ya2 z&;cQdDsNkhJu;;($ZX%gg}PA*+XXc|v;JS@oqIUc>EFkz?IcAiTM;Vjr%Wi42vZ6j z9NH`s!%QWKF(GFfDx^{&N!lEu6k|+I<21%8VHApS8i%F`qcI}H81wsl?Y8?|&-Ls+ zfBdffl9LIWFwsMNSx|aj1`d_oL#d*)a{9@)i5+>Y)I0!Q#Ivk4v+}Ro39nZ`Q zdNxdYH=Akm9$7xa^5x{@O7ZOqeK8GJVUbuS7qp*=XcfVI@=*(|g&h{}3^JUkdcv=< zIP&#cg=NQGaoCsV#=hm`xI5>p!M#}nVJZi04(p*7!8RgfhTI4$M9w1ajMlqJCsy7T zJf0|=g!szSY%>fmo0w??SDW-!ivQE}#od16Z?v5*IBfmRe3PVlRmUgz_7qtSfVbl6 zst(8J4`R0ybA&ck>LOppcmjpTxpM5;=`L7+BV_(TLQA{QB-&q@-JS6kH&<>pMRrxK zn2#f6hhsGCZtjnWB4A<9H+5OCn$NqjDE@O zQX<-R(c$11y(*~;M`b@vB03y1(muNih7D6pdv=XPu`3!IqBShuBq1mUS4jR7t@~F? zs;rx&;Od!oHfwN0db{;-Upq&S&h*=#?TB2g+dDq-d2IEu`#YRtV-u3(0zYR{0`|}^ z^1oP@Q>E-2(V=R12HpD-8{%%ciz~@J?Ph*TRn8g^U?Q=%EtkcGmC18&+R@Yr4~oO% z9OY$XTwI&$T;H;S>Cpey#8HyPz#@oQg1=c?g-Xv)oxPE2$Q@lI4jyX7RlWsACXZxO zg^~b!T7gJJbDyW$vZ?ejrfEr2%IMK$2N{u;7W!Jv7cy91l_}NF*v&x!4$rh2Lm>jX zQx;6;_n*|mvxGP9$VOD6hMvi|@RVTyo9!;4P2~O05(E(EVs&Pv`_k;s!@``~=J!&{ zxK=xQWGr zQ62VrSbcY4x72MAeSD;?PM?857u)utVQL;j$a5=*CHzDQdNE2mZ%=k-Cb#rLSeHO< zAmGLyzM8?FR*-qajWhx#@5xA#dYk_jtv0)u^i)qL`$ALio+-wZBQpXy1`8!Cm*60h ze(s>3l6U3c@U>-8FjUfu^UY(Kli=c;qOHr;z7PZ0La5D@0A1SKx-b#@8b-h`3w9_mPhOQ+$HZnmK*gyxaVOy8&}D--VTd>1*Njm7*|Nn_j@(JhTIXcde_x63^mb zJXQ`g?s~dN<@SD-OfsjlS{J8cyU2K0oV&~tkEw6|48GFs18WiAOqRVng&7C0F<4Pi zX~gkPnmv5!_NZi;XWnGG!1V0Qw!5P)9}1_s=95wnR5`H{T(5Ksd@rF2HKZ;{4XFU* zog6L6PzdaJ=YHO)Ly4zV?FL5?u_YI7m2Kbd;RxEHLjj30-KPu8jFGEnXonc}v$PZ4 zpJTQ6Xs+w-%f`5G{4a8$PsOlsaly;mTbW&^{NG~ z`LK+hq#YGFywID_=kyH3`#gqVr_Jz&#tdh6v2I?{MsRD6$YHrb+)9uOQPXyEg+%s! zu#!B^BtFI*Id=@s4I?fxK;J5wYcuFS=n&mDl}H%nAG+&T3-(dK|dd-jtoTmZY&DtLA}ycKrPj zkJ!4Ly42RL-2Sl#4tJbcbzuFFBUJ4Qcls(Qdxve3Yd7_J^TQJN0Ere+i2T>&JC>Ws zYT$0YPJ$1;T|=CTuyt!|&E=npEnYX)U15G*O3Xvu17qmSp(0ydv42XYp?E2ds&U&2qGvjj{*0;*)`XN?k{0 zJiF_@;E5aVeJtymVv>o1sBF}dp6KR1AK?Tt&*dHL5(YdWPl*HM=UWm7W*Opep*G=$ zzV##NqJxy_v-J@zdHZ{_n&<2Vxe;o`&$D|chD8Zp?~a5C^7ZohZQJZ`-?s`>MLjwV2w4hm<5hBs36xQm5)s zkmGVi#v*my>~j@2*m07O=b$n2*EZFqwdkD+-e{S=-GElQQr(w|*Y!3`*J+SR5v5y@ zGNi1-rj*p_C!*FR9wz*3{ggGwI3F5!En?q<42gHxcvKE*GUw&6dufZ+>WBAmbFpLJh z@M;_EI<)Qd0{eh$Ukl7LdsQ|xi^oS=|LV|S)e&ab#RsoY6DFG@-O z$;W=ZEUs=;rB22p9p7SQF=fX72^v*PkKq>OnL4#eFramrx~T=c_d zc0QHfA8sh0>kQ={vK+rLA)=2mdk3EB^>2@-Stq#q2vFzZf4dHT#0_!%IjcwDR^{e= zo&2dOqrNB3jQlp%Oc*t)=L1ltGbFgz;aGfO~ zW3c@djsK)ow6g71on>Q{|3E_AeD0~!rZix}aNtOT_^a6_*aDM@rhD@)_A{RqPS#8_ zS~G3NceS-082RR-i7i;0)!j;Xh%uRceKHv6gm)A9VHEN1=ry?0dJyL_j;%t;y_{a6 z$9rh$nCR2#Oz}xj!&0jak|0t{OW4LOKceh~pJrk-CdONKnwR*X&K`Kw*pBZDehDk#^II`_m_DQC=0*Zqn+ei})8Q~dn$VH42|e&}9W zX&X~r@OG{OLf*SWu6mZVUQXLi$TQNu5^uQmcJWG!zTFoMEj_$AV)iOFQ%0$Wh*+wz zrSrPy#+N$MBTINRti#U(7Zl1cikoN0cmAxr=B^p!X+4eux2#w!3@l|kuaQ}m-uS`| zLz{fcg8vn<(V6y)ymbk4Fj%iMp*dY9+GzD+4 zFFjhqh01wFbKS~a+}InJ^D~~b?e~;v+OgZ2qe+fS%xQF4twmP6efG|oF^Cs!@}g;$Jjt&%tzmdq8*Og5RvCKGhv}ZU;sXgS!y?veTlf0Dow2pL|EEmlKU8A~ zH0c+8&qcTWFLD?}sKV_|SNmJV#e5k4KP&+n7MY38oQFBDfVm@j@8(b$Dzt~YN*PLTg5Te%s?53UW8 zUrina5uoG?elH{0n%q5HMUJsBm^Pr%nSUiI>jJUGzM{tRGM>|->@#y&IBIV%BUo_N zmO|?Z#~h1O>#n?y**H~Msya}8C^MxjpO1h3uaRU@*(1>O-Oa)lj64Pz9_wRP$w()Y z;d9-J#ls|C=&f2Hyee4v$Pv93WTFXVdOO2Sa1FG+5fG3Cf&3vX&ao@P_XTN}6pHK1 z6Rgju)1T_fB{d9(pEPRg0I^{V1aa_F0m6Cx;Uh%s*ow&g)C+|ZX(OKQ*+RR`*5x7) z^l=g<_1TyfmwW=rPfJga9}_{IZ|#Da%$EJms|nN$~!q*F{f z#xg$QYBew7>|Uz)7ieZ|FB0AfEWJ|;(o1q*S@)HR21Je7pBOl@(TmCszKA>ak~wI8 z2Y05oWP$Le;f|^JlkXXBlDGc`iBe8S?d^G~w!gk%;1@1PE<2fXwoXXM@6-lGcQ3qhYm7+L zF_m(_h?O+?mwKvxFda1CVbWDrQxvRlt~YhH!btw3f!WY;TLI{HQ+qubZr#P+k+~r1 zm7KC|@Tl3L;4YfmJdkhvNG+HMEF&*Jm!^uK+w*@ONI}yj zCDO#_>Wf1DOZ_~%fT5>sct}ae&zOM@-B>5kq-S;iA$j-E)YF^29336aUz(cLs%Bs8 zC61-$+`4nepuKxYsX^p9LijGlUa;BR*u;d7A+s37H1QGVVXKGP@0sY!fsaCHe@Nb6 znoz6*F;RkA2Vw$KbI-sR+WP~uM3AW)(d56k^(Cl_QS!S%Xz$=?Fw!g8K6c(+@~p|! z=8+YPG6P|L(a&|FX=;BIuad^teW``5MemlEgLTR-S(fu|rBaDM$l!JUELWcAiLf}C z)?~`gh4*JWYHMpz8*eQ6y_(~4Ss?Q;n}TR#B!4B zdsc-sFdgLaE}%rzfD-BY7!U+hM;Q^wmYN`UAR+99Zu2xp1YjSyvk06M-VcP&G5BxB z{qR$ZepVxb@-$-DK90pRN%JsC8~n@_xUOm7x1%g(1NDz=C8``Ed=hRo3ytA;c>u!)NTK=VoO#- zJU}P5%@+w3xj;zfN?F+uouQ5 zz3OR$B8(=e+FPvw@R%#}gOLE-QgKbtWnR?@!d+)5)#uW4^rDslpL2>~kRI~wd-vWB0}QynQP!oviM0HZ22(5{C*_X6T| zOs%MI3*b6kr{Lqpkt&4Bii0jNsPceXy+%V75C`dKMZlp1` guide to set up SkyPilot on your on-prem cluster. + +Alternatively, you can also deploy Kubernetes on your on-prem clusters using off-the-shelf tools, such as `kubeadm `_, `k3s `_ or `Rancher `_. diff --git a/docs/source/reservations/existing-machines.rst b/docs/source/reservations/existing-machines.rst new file mode 100644 index 00000000000..2f9ac2a2441 --- /dev/null +++ b/docs/source/reservations/existing-machines.rst @@ -0,0 +1,153 @@ +.. _existing-machines: + +Deploy SkyPilot on existing machines +==================================== + +This guide will help you deploy SkyPilot on your existing machines β€” whether they are on-premises or reserved instances on a cloud provider. + +**Given a list of IP addresses and SSH credentials,** +SkyPilot will install necessary dependencies on the remote machines and configure itself to run jobs and services on the cluster. + +.. + Figure v1 (for deploy.sh): https://docs.google.com/drawings/d/1Jp1tTu1kxF-bIrS6LRMqoJ1dnxlFvn-iobVsXElXfAg/edit?usp=sharing + Figure v2: https://docs.google.com/drawings/d/1hMvOe1HX0ESoUbCvUowla2zO5YBacsdruo0dFqML9vo/edit?usp=sharing + Figure v2 Dark: https://docs.google.com/drawings/d/1AEdf9i3SO6MVnD7d-hwRumIfVndzNDqQmrFvRwwVEiU/edit + +.. figure:: ../images/sky-existing-infra-workflow-light.png + :width: 85% + :align: center + :alt: Deploying SkyPilot on existing machines + :class: no-scaled-link, only-light + + Given a list of IP addresses and SSH keys, ``sky local up`` will install necessary dependencies on the remote machines and configure SkyPilot to run jobs and services on the cluster. + +.. figure:: ../images/sky-existing-infra-workflow-dark.png + :width: 85% + :align: center + :alt: Deploying SkyPilot on existing machines + :class: no-scaled-link, only-dark + + Given a list of IP addresses and SSH keys, ``sky local up`` will install necessary dependencies on the remote machines and configure SkyPilot to run jobs and services on the cluster. + + +.. note:: + + Behind the scenes, SkyPilot deploys a lightweight Kubernetes cluster on the remote machines using `k3s `_. + + **Note that no Kubernetes knowledge is required for running this guide.** SkyPilot abstracts away the complexity of Kubernetes and provides a simple interface to run your jobs and services. + +Prerequisites +------------- + +**Local machine (typically your laptop):** + +* `kubectl `_ +* `SkyPilot `_ + +**Remote machines (your cluster, optionally with GPUs):** + +* Debian-based OS (tested on Debian 11) +* SSH access from local machine to all remote machines with key-based authentication and passwordless sudo +* All machines must use the same SSH key and username +* All machines must have network access to each other +* Port 6443 must be accessible on at least one node from your local machine + +Deploying SkyPilot +------------------ + +1. Create a file ``ips.txt`` with the IP addresses of your machines with one IP per line. + The first node will be used as the head node β€” this node must have port 6443 accessible from your local machine. + + Here is an example ``ips.txt`` file: + + .. code-block:: text + + 192.168.1.1 + 192.168.1.2 + 192.168.1.3 + + In this example, the first node (``192.168.1.1``) has port 6443 open and will be used as the head node. + +2. Run ``sky local up`` and pass the ``ips.txt`` file, SSH username, and SSH key as arguments: + + .. code-block:: bash + + IP_FILE=ips.txt + SSH_USER=username + SSH_KEY=path/to/ssh/key + sky local up --ips $IP_FILE --ssh-user SSH_USER --ssh-key-path $SSH_KEY + + SkyPilot will deploy a Kubernetes cluster on the remote machines, set up GPU support, configure Kubernetes credentials on your local machine, and set up SkyPilot to operate with the new cluster. + + Example output of ``sky local up``: + + .. code-block:: console + + $ sky local up --ips ips.txt --ssh-user gcpuser --ssh-key-path ~/.ssh/id_rsa + Found existing kube config. It will be backed up to ~/.kube/config.bak. + To view detailed progress: tail -n100 -f ~/sky_logs/sky-2024-09-23-18-53-14-165534/local_up.log + βœ” K3s successfully deployed on head node. + βœ” K3s successfully deployed on worker node. + βœ” kubectl configured for the remote cluster. + βœ” Remote k3s is running. + βœ” Nvidia GPU Operator installed successfully. + Cluster deployment done. You can now run tasks on this cluster. + E.g., run a task with: sky launch --cloud kubernetes -- echo hello world. + πŸŽ‰ Remote cluster deployed successfully. + + +4. To verify that the cluster is running, run: + + .. code-block:: bash + + sky check kubernetes + + You can now use SkyPilot to launch your :ref:`development clusters ` and :ref:`training jobs ` on your own infrastructure. + + .. code-block:: console + + $ sky show-gpus --cloud kubernetes + Kubernetes GPUs + GPU QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS + L4 1, 2, 4 12 12 + H100 1, 2, 4, 8 16 16 + + Kubernetes per node GPU availability + NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS + my-cluster-0 L4 4 4 + my-cluster-1 L4 4 4 + my-cluster-2 L4 2 2 + my-cluster-3 L4 2 2 + my-cluster-4 H100 8 8 + my-cluster-5 H100 8 8 + + $ sky launch --cloud kubernetes --gpus H100:1 -- nvidia-smi + + .. tip:: + + You can also use ``kubectl`` to interact and perform administrative operations on the cluster. + +What happens behind the scenes? +------------------------------- + +When you run ``sky local up``, SkyPilot runs the following operations: + +1. Install and run `k3s `_ Kubernetes distribution as a systemd service on the remote machines. +2. [If GPUs are present] Install `Nvidia GPU Operator `_ on the newly provisioned k3s cluster. Note that this step does not modify your local nvidia driver/cuda installation, and only runs inside the cluster. +3. Expose the Kubernetes API server on the head node over port 6443. API calls are on this port are secured with a key pair generated by the cluster. +4. Configure ``kubectl`` on your local machine to connect to the remote cluster. + + +Cleanup +------- + +To clean up all state created by SkyPilot on your machines, use the ``--cleanup`` flag: + +.. code-block:: bash + + IP_FILE=ips.txt + SSH_USER=username + SSH_KEY=path/to/ssh/key + sky local up --ip $IP_FILE --ssh-user SSH_USER --ssh-key-path $SSH_KEY --cleanup + +This will stop all Kubernetes services on the remote machines. diff --git a/docs/source/reservations/reservations.rst b/docs/source/reservations/reservations.rst index 8d0625846f7..800a34c802c 100644 --- a/docs/source/reservations/reservations.rst +++ b/docs/source/reservations/reservations.rst @@ -204,5 +204,5 @@ Unlike short-term reservations above, long-term reservations are typically more SkyPilot supports long-term reservations and on-premise clusters through Kubernetes, i.e., you can set up a Kubernetes cluster on top of your reserved resources and interact with them through SkyPilot. -See the simple steps to set up a Kubernetes cluster on existing machines in :ref:`kubernetes-overview`. +See the simple steps to set up a Kubernetes cluster on existing machines in :ref:`Using Existing Machines ` or :ref:`bring your existing Kubernetes cluster `. diff --git a/sky/cli.py b/sky/cli.py index f334a4181b8..c538c99aeb3 100644 --- a/sky/cli.py +++ b/sky/cli.py @@ -5072,15 +5072,7 @@ def local(): pass -@click.option('--gpus/--no-gpus', - default=True, - is_flag=True, - help='Launch cluster without GPU support even ' - 'if GPUs are detected on the host.') -@local.command('up', cls=_DocumentedCodeCommand) -@usage_lib.entrypoint -def local_up(gpus: bool): - """Creates a local cluster.""" +def _deploy_local_cluster(gpus: bool): cluster_created = False # Check if GPUs are available on the host @@ -5206,6 +5198,124 @@ def local_up(gpus: bool): f'{gpu_hint}') +def _deploy_remote_cluster(ip_file: str, ssh_user: str, ssh_key_path: str, + cleanup: bool): + success = False + path_to_package = os.path.dirname(os.path.dirname(__file__)) + up_script_path = os.path.join(path_to_package, 'sky/utils/kubernetes', + 'deploy_remote_cluster.sh') + # Get directory of script and run it from there + cwd = os.path.dirname(os.path.abspath(up_script_path)) + + deploy_command = f'{up_script_path} {ip_file} {ssh_user} {ssh_key_path}' + if cleanup: + deploy_command += ' --cleanup' + + # Convert the command to a format suitable for subprocess + deploy_command = shlex.split(deploy_command) + + # Setup logging paths + run_timestamp = backend_utils.get_run_timestamp() + log_path = os.path.join(constants.SKY_LOGS_DIRECTORY, run_timestamp, + 'local_up.log') + tail_cmd = 'tail -n100 -f ' + log_path + + # Check if ~/.kube/config exists: + if os.path.exists(os.path.expanduser('~/.kube/config')): + click.echo('Found existing kube config. ' + 'It will be backed up to ~/.kube/config.bak.') + style = colorama.Style + click.echo('To view detailed progress: ' + f'{style.BRIGHT}{tail_cmd}{style.RESET_ALL}') + if cleanup: + msg_str = 'Cleaning up remote cluster...' + else: + msg_str = 'Deploying remote cluster...' + with rich_utils.safe_status(f'[bold cyan]{msg_str}'): + returncode, _, stderr = log_lib.run_with_log( + cmd=deploy_command, + log_path=log_path, + require_outputs=True, + stream_logs=False, + line_processor=log_utils.SkyRemoteUpLineProcessor(), + cwd=cwd) + if returncode == 0: + success = True + else: + with ux_utils.print_exception_no_traceback(): + raise RuntimeError( + 'Failed to deploy remote cluster. ' + f'Full log: {log_path}' + f'\nError: {style.BRIGHT}{stderr}{style.RESET_ALL}') + + if success: + if cleanup: + click.echo(f'{colorama.Fore.GREEN}' + 'πŸŽ‰ Remote cluster cleaned up successfully.' + f'{style.RESET_ALL}') + else: + click.echo('Cluster deployment done. You can now run tasks on ' + 'this cluster.\nE.g., run a task with: ' + 'sky launch --cloud kubernetes -- echo hello world.' + f'\n{colorama.Fore.GREEN}πŸŽ‰ Remote cluster deployed ' + f'successfully. {style.RESET_ALL}') + + +@click.option('--gpus/--no-gpus', + default=True, + is_flag=True, + help='Launch cluster without GPU support even ' + 'if GPUs are detected on the host.') +@click.option( + '--ips', + type=str, + required=False, + help='Path to the file containing IP addresses of remote machines.') +@click.option('--ssh-user', + type=str, + required=False, + help='SSH username for accessing remote machines.') +@click.option('--ssh-key-path', + type=str, + required=False, + help='Path to the SSH private key.') +@click.option('--cleanup', + is_flag=True, + help='Clean up the remote cluster instead of deploying it.') +@local.command('up', cls=_DocumentedCodeCommand) +@usage_lib.entrypoint +def local_up(gpus: bool, ips: str, ssh_user: str, ssh_key_path: str, + cleanup: bool): + """Creates a local or remote cluster.""" + + def _validate_args(ips, ssh_user, ssh_key_path, cleanup): + # If any of --ips, --ssh-user, or --ssh-key-path is specified, + # all must be specified + if bool(ips) or bool(ssh_user) or bool(ssh_key_path): + if not (ips and ssh_user and ssh_key_path): + raise click.BadParameter( + 'All --ips, --ssh-user, and --ssh-key-path ' + 'must be specified together.') + + # --cleanup can only be used if --ips, --ssh-user and --ssh-key-path + # are all provided + if cleanup and not (ips and ssh_user and ssh_key_path): + raise click.BadParameter('--cleanup can only be used with ' + '--ips, --ssh-user and --ssh-key-path.') + + _validate_args(ips, ssh_user, ssh_key_path, cleanup) + + # If remote deployment arguments are specified, run remote up script + if ips and ssh_user and ssh_key_path: + # Convert ips and ssh_key_path to absolute paths + ips = os.path.abspath(ips) + ssh_key_path = os.path.abspath(ssh_key_path) + _deploy_remote_cluster(ips, ssh_user, ssh_key_path, cleanup) + else: + # Run local deployment (kind) if no remote args are specified + _deploy_local_cluster(gpus) + + @local.command('down', cls=_DocumentedCodeCommand) @usage_lib.entrypoint def local_down(): diff --git a/sky/utils/kubernetes/deploy_remote_cluster.sh b/sky/utils/kubernetes/deploy_remote_cluster.sh new file mode 100755 index 00000000000..94736474289 --- /dev/null +++ b/sky/utils/kubernetes/deploy_remote_cluster.sh @@ -0,0 +1,243 @@ +#!/bin/bash +# Refer to https://skypilot.readthedocs.io/en/latest/reservations/existing-machines.html for details on how to use this script. +set -e + +# Colors for nicer UX +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +NC='\033[0m' # No color + +# Variables +IPS_FILE=$1 +USER=$2 +SSH_KEY=$3 +K3S_TOKEN=mytoken # Any string can be used as the token +CLEANUP=false +INSTALL_GPU=false + +if [[ "$4" == "--cleanup" ]]; then + CLEANUP=true +fi + +# Basic argument checks +if [ -z "$IPS_FILE" ] || [ -z "$USER" ] || [ -z "$SSH_KEY" ]; then + >&2 echo -e "${RED}Error: Missing required arguments.${NC}" + >&2 echo "Usage: ./deploy_remote_cluster.sh ips.txt username path/to/ssh/key [--cleanup]" + exit 1 +fi + +# Check if SSH key exists +if [ ! -f "$SSH_KEY" ]; then + >&2 echo -e "${RED}Error: SSH key not found: $SSH_KEY${NC}" + exit 1 +fi + +# Check if IPs file exists +if [ ! -f "$IPS_FILE" ]; then + >&2 echo -e "${RED}Error: IPs file not found: $IPS_FILE${NC}" + exit 1 +fi + +# Get head node and worker nodes from the IPs file +HEAD_NODE=$(head -n 1 "$IPS_FILE") +WORKER_NODES=$(tail -n +2 "$IPS_FILE") + +# Check if the IPs file is empty or not formatted correctly +if [ -z "$HEAD_NODE" ]; then + >&2 echo -e "${RED}Error: IPs file is empty or not formatted correctly.${NC}" + exit 1 +fi + +# Function to show a progress message +progress_message() { + echo -e "${YELLOW}➜ $1${NC}" +} + +# Step to display success +success_message() { + echo -e "${GREEN}βœ” $1${NC}" +} + +# Function to run a command on a remote machine via SSH +run_remote() { + local NODE_IP=$1 + local CMD=$2 + # echo -e "${YELLOW}Running command on $NODE_IP...${NC}" + ssh -o StrictHostKeyChecking=no -i "$SSH_KEY" "$USER@$NODE_IP" "$CMD" +} + +# Function to uninstall k3s and clean up the state on a remote machine +cleanup_server_node() { + local NODE_IP=$1 + echo -e "${YELLOW}Cleaning up head node $NODE_IP...${NC}" + run_remote "$NODE_IP" " + echo 'Uninstalling k3s...' && + /usr/local/bin/k3s-uninstall.sh || true && + sudo rm -rf /etc/rancher /var/lib/rancher /var/lib/kubelet /etc/kubernetes ~/.kube + " + echo -e "${GREEN}Node $NODE_IP cleaned up successfully.${NC}" +} + +# Function to uninstall k3s and clean up the state on a remote machine +cleanup_agent_node() { + local NODE_IP=$1 + echo -e "${YELLOW}Cleaning up node $NODE_IP...${NC}" + run_remote "$NODE_IP" " + echo 'Uninstalling k3s...' && + /usr/local/bin/k3s-agent-uninstall.sh || true && + sudo rm -rf /etc/rancher /var/lib/rancher /var/lib/kubelet /etc/kubernetes ~/.kube + " + echo -e "${GREEN}Node $NODE_IP cleaned up successfully.${NC}" +} + +check_gpu() { + local NODE_IP=$1 + run_remote "$NODE_IP" " + if command -v nvidia-smi &> /dev/null; then + nvidia-smi --list-gpus | grep 'GPU 0' + fi + " +} + +# Pre-flight checks +run_remote "$HEAD_NODE" "echo 'SSH connection successful'" +# TODO: Add more pre-flight checks here, including checking if port 6443 is accessible + +# If --cleanup flag is set, uninstall k3s and exit +if [ "$CLEANUP" == "true" ]; then + echo -e "${YELLOW}Starting cleanup...${NC}" + + # Clean up head node + cleanup_server_node "$HEAD_NODE" + + # Clean up worker nodes + for NODE in $WORKER_NODES; do + cleanup_agent_node "$NODE" + done + + echo -e "${GREEN}Cleanup completed successfully.${NC}" + exit 0 +fi + +# Step 1: Install k3s on the head node +progress_message "Deploying Kubernetes on head node ($HEAD_NODE)..." +run_remote "$HEAD_NODE" " + curl -sfL https://get.k3s.io | K3S_TOKEN=$K3S_TOKEN sh - && + mkdir -p ~/.kube && + sudo cp /etc/rancher/k3s/k3s.yaml ~/.kube/config && + sudo chown \$(id -u):\$(id -g) ~/.kube/config && + for i in {1..3}; do + if kubectl wait --for=condition=ready node --all --timeout=2m --kubeconfig ~/.kube/config; then + break + else + echo 'Waiting for nodes to be ready...' + sleep 5 + fi + done + if [ $i -eq 3 ]; then + echo 'Failed to wait for nodes to be ready after 3 attempts' + exit 1 + fi" +success_message "K3s deployed on head node." + +# Check if head node has a GPU +if check_gpu "$HEAD_NODE"; then + echo -e "${YELLOW}GPU detected on head node ($HEAD_NODE).${NC}" + INSTALL_GPU=true +fi + +# Fetch the head node's internal IP (this will be passed to worker nodes) +MASTER_ADDR=$(run_remote "$HEAD_NODE" "hostname -I | awk '{print \$1}'") + +echo -e "${GREEN}Master node internal IP: $MASTER_ADDR${NC}" + +# Step 2: Install k3s on worker nodes and join them to the master node +for NODE in $WORKER_NODES; do + progress_message "Deploying Kubernetes on worker node ($NODE)..." + run_remote "$NODE" " + curl -sfL https://get.k3s.io | K3S_URL=https://$MASTER_ADDR:6443 K3S_TOKEN=$K3S_TOKEN sh -" + success_message "Kubernetes deployed on worker node ($NODE)." + + # Check if worker node has a GPU + if check_gpu "$NODE"; then + echo -e "${YELLOW}GPU detected on worker node ($NODE).${NC}" + INSTALL_GPU=true + fi +done +# Step 3: Configure local kubectl to connect to the cluster +progress_message "Configuring local kubectl to connect to the cluster..." +scp -o StrictHostKeyChecking=no -i "$SSH_KEY" "$USER@$HEAD_NODE":~/.kube/config ~/.kube/config + +# Back up the original kubeconfig file if it exists +KUBECONFIG_FILE="$HOME/.kube/config" +if [[ -f "$KUBECONFIG_FILE" ]]; then + echo "Backing up existing kubeconfig to $KUBECONFIG_FILE.bak" + cp "$KUBECONFIG_FILE" "$KUBECONFIG_FILE.bak" +fi + +# Update kubeconfig for the local machine to use the master node's IP +# Temporary file to hold the modified kubeconfig +TEMP_FILE=$(mktemp) + +# Remove the certificate-authority-data, and replace the server with the master address +awk ' + BEGIN { in_cluster = 0 } + /^clusters:/ { in_cluster = 1 } + /^users:/ { in_cluster = 0 } + in_cluster && /^ *certificate-authority-data:/ { next } + in_cluster && /^ *server:/ { + print " server: https://'${HEAD_NODE}:6443'" + print " insecure-skip-tls-verify: true" + next + } + { print } +' "$KUBECONFIG_FILE" > "$TEMP_FILE" + +# Replace the original kubeconfig with the modified one +mv "$TEMP_FILE" "$KUBECONFIG_FILE" + +success_message "kubectl configured to connect to the cluster." + +echo "Cluster deployment completed. You can now run 'kubectl get nodes' to verify the setup." + +# Install GPU operator if a GPU was detected on any node +if [ "$INSTALL_GPU" == "true" ]; then + echo -e "${YELLOW}GPU detected in the cluster. Installing Nvidia GPU Operator...${NC}" + run_remote "$HEAD_NODE" " + curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 && + chmod 700 get_helm.sh && + ./get_helm.sh && + helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update && + kubectl create namespace gpu-operator --kubeconfig ~/.kube/config || true && + sudo ln -s /sbin/ldconfig /sbin/ldconfig.real || true && + helm install gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operator \ + --set 'toolkit.env[0].name=CONTAINERD_CONFIG' \ + --set 'toolkit.env[0].value=/var/lib/rancher/k3s/agent/etc/containerd/config.toml' \ + --set 'toolkit.env[1].name=CONTAINERD_SOCKET' \ + --set 'toolkit.env[1].value=/run/k3s/containerd/containerd.sock' \ + --set 'toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS' \ + --set 'toolkit.env[2].value=nvidia' && + echo 'Waiting for GPU operator installation...' && + while ! kubectl describe nodes --kubeconfig ~/.kube/config | grep -q 'nvidia.com/gpu:'; do + echo 'Waiting for GPU operator...' + sleep 5 + done + echo 'GPU operator installed successfully.'" + success_message "GPU Operator installed." +else + echo -e "${YELLOW}No GPUs detected. Skipping GPU Operator installation.${NC}" +fi + +# Configure SkyPilot +progress_message "Configuring SkyPilot..." +sky check kubernetes +success_message "SkyPilot configured successfully." + +# Display final success message +echo -e "${GREEN}==== πŸŽ‰ Kubernetes cluster deployment completed successfully πŸŽ‰ ====${NC}" +echo "You can now interact with your Kubernetes cluster through SkyPilot: " +echo " β€’ List available GPUs: sky show-gpus --cloud kubernetes" +echo " β€’ Launch a GPU development pod: sky launch -c devbox --cloud kubernetes --gpus A100:1" +echo " β€’ Connect to pod with SSH: ssh devbox" +echo " β€’ Connect to pod with VSCode: code --remote ssh-remote+devbox '/'" diff --git a/sky/utils/log_utils.py b/sky/utils/log_utils.py index 90928b8014d..8f7a152392e 100644 --- a/sky/utils/log_utils.py +++ b/sky/utils/log_utils.py @@ -1,6 +1,7 @@ """Logging utils.""" import enum -from typing import List, Optional +import types +from typing import List, Optional, Type import colorama import pendulum @@ -15,13 +16,15 @@ class LineProcessor(object): """A processor for log lines.""" - def __enter__(self): + def __enter__(self) -> None: pass - def process_line(self, log_line): + def process_line(self, log_line: str) -> None: pass - def __exit__(self, except_type, except_value, traceback): + def __exit__(self, except_type: Optional[Type[BaseException]], + except_value: Optional[BaseException], + traceback: Optional[types.TracebackType]) -> None: del except_type, except_value, traceback # unused pass @@ -34,12 +37,12 @@ class ProvisionStatus(enum.Enum): RUNTIME_SETUP = 1 PULLING_DOCKER_IMAGES = 2 - def __enter__(self): + def __enter__(self) -> None: self.state = self.ProvisionStatus.LAUNCH self.status_display = rich_utils.safe_status('[bold cyan]Launching') self.status_display.start() - def process_line(self, log_line): + def process_line(self, log_line: str) -> None: if ('Success.' in log_line and self.state == self.ProvisionStatus.LAUNCH): logger.info(f'{colorama.Fore.GREEN}Head node is up.' @@ -60,7 +63,9 @@ def process_line(self, log_line): '[bold cyan]Launching - Preparing SkyPilot runtime') self.state = self.ProvisionStatus.RUNTIME_SETUP - def __exit__(self, except_type, except_value, traceback): + def __exit__(self, except_type: Optional[Type[BaseException]], + except_value: Optional[BaseException], + traceback: Optional[types.TracebackType]) -> None: del except_type, except_value, traceback # unused self.status_display.stop() @@ -68,13 +73,13 @@ def __exit__(self, except_type, except_value, traceback): class SkyLocalUpLineProcessor(LineProcessor): """A processor for `sky local up` log lines.""" - def __enter__(self): + def __enter__(self) -> None: status = rich_utils.safe_status('[bold cyan]Creating local cluster - ' 'initializing Kubernetes') self.status_display = status self.status_display.start() - def process_line(self, log_line): + def process_line(self, log_line: str) -> None: if 'Kind cluster created.' in log_line: logger.info(f'{colorama.Fore.GREEN}Kubernetes is running.' f'{colorama.Style.RESET_ALL}') @@ -124,7 +129,80 @@ def process_line(self, log_line): f'{colorama.Fore.GREEN}Nginx Ingress Controller installed.' f'{colorama.Style.RESET_ALL}') - def __exit__(self, except_type, except_value, traceback): + def __exit__(self, except_type: Optional[Type[BaseException]], + except_value: Optional[BaseException], + traceback: Optional[types.TracebackType]) -> None: + del except_type, except_value, traceback # unused + self.status_display.stop() + + +class SkyRemoteUpLineProcessor(LineProcessor): + """A processor for deploy_remote_cluster.sh log lines.""" + + def __enter__(self) -> None: + status = rich_utils.safe_status('[bold cyan]Creating remote cluster') + self.status_display = status + self.status_display.start() + + def process_line(self, log_line: str) -> None: + # Pre-flight checks + if 'SSH connection successful' in log_line: + logger.info(f'{colorama.Fore.GREEN}SSH connection established.' + f'{colorama.Style.RESET_ALL}') + + # Kubernetes installation steps + if 'Deploying Kubernetes on head node' in log_line: + self.status_display.update('[bold cyan]Creating remote cluster - ' + 'deploying Kubernetes on head node') + if 'K3s deployed on head node.' in log_line: + logger.info(f'{colorama.Fore.GREEN}' + 'βœ” K3s successfully deployed on head node.' + f'{colorama.Style.RESET_ALL}') + + # Worker nodes + if 'Deploying Kubernetes on worker node' in log_line: + self.status_display.update('[bold cyan]Creating remote cluster - ' + 'deploying Kubernetes on worker nodes') + if 'Kubernetes deployed on worker node' in log_line: + logger.info(f'{colorama.Fore.GREEN}' + 'βœ” K3s successfully deployed on worker node.' + f'{colorama.Style.RESET_ALL}') + + # Cluster configuration + if 'Configuring local kubectl to connect to the cluster...' in log_line: + self.status_display.update('[bold cyan]Creating remote cluster - ' + 'configuring local kubectl') + if 'kubectl configured to connect to the cluster.' in log_line: + logger.info(f'{colorama.Fore.GREEN}' + 'βœ” kubectl configured for the remote cluster.' + f'{colorama.Style.RESET_ALL}') + + # GPU operator installation + if 'Installing Nvidia GPU Operator...' in log_line: + self.status_display.update('[bold cyan]Creating remote cluster - ' + 'installing Nvidia GPU Operator') + if 'GPU Operator installed.' in log_line: + logger.info(f'{colorama.Fore.GREEN}' + 'βœ” Nvidia GPU Operator installed successfully.' + f'{colorama.Style.RESET_ALL}') + + # Cleanup steps + if 'Cleaning up head node' in log_line: + self.status_display.update('[bold cyan]Cleaning up head node') + if 'Cleaning up node' in log_line: + self.status_display.update('[bold cyan]Cleaning up worker node') + if 'cleaned up successfully' in log_line: + logger.info(f'{colorama.Fore.GREEN}' + f'{log_line.strip()}{colorama.Style.RESET_ALL}') + + # Final status + if 'Cluster deployment completed.' in log_line: + logger.info(f'{colorama.Fore.GREEN}βœ” Remote k3s is running.' + f'{colorama.Style.RESET_ALL}') + + def __exit__(self, except_type: Optional[Type[BaseException]], + except_value: Optional[BaseException], + traceback: Optional[types.TracebackType]) -> None: del except_type, except_value, traceback # unused self.status_display.stop() From e6a3b830fb2a12871815773af6171d42e0416e89 Mon Sep 17 00:00:00 2001 From: Romil Bhardwaj Date: Sat, 28 Sep 2024 23:04:52 -0700 Subject: [PATCH 26/93] [k8s] Fix incluster auth after multi-context support (#4014) * Make incluster auth work * lint * rename * rename * pop allowed_contexts from config * lint * comments * comments * lint --- sky/authentication.py | 5 ++ sky/clouds/kubernetes.py | 42 +++++++++++--- sky/provision/kubernetes/config.py | 9 +-- sky/provision/kubernetes/instance.py | 13 +++-- sky/provision/kubernetes/network_utils.py | 15 ++--- sky/provision/kubernetes/utils.py | 69 +++++++++++++++++------ sky/utils/command_runner.py | 2 +- sky/utils/command_runner.pyi | 2 +- sky/utils/controller_utils.py | 8 +++ 9 files changed, 122 insertions(+), 43 deletions(-) diff --git a/sky/authentication.py b/sky/authentication.py index 67b4bcd576f..eb51aad02ad 100644 --- a/sky/authentication.py +++ b/sky/authentication.py @@ -380,6 +380,11 @@ def setup_kubernetes_authentication(config: Dict[str, Any]) -> Dict[str, Any]: secret_field_name = clouds.Kubernetes().ssh_key_secret_field_name context = config['provider'].get( 'context', kubernetes_utils.get_current_kube_config_context_name()) + if context == kubernetes_utils.IN_CLUSTER_REGION: + # If the context is set to IN_CLUSTER_REGION, we are running in a pod + # with in-cluster configuration. We need to set the context to None + # to use the mounted service account. + context = None namespace = config['provider'].get( 'namespace', kubernetes_utils.get_kube_config_context_namespace(context)) diff --git a/sky/clouds/kubernetes.py b/sky/clouds/kubernetes.py index 2c1e753bccf..da85246e9ea 100644 --- a/sky/clouds/kubernetes.py +++ b/sky/clouds/kubernetes.py @@ -129,11 +129,24 @@ def _log_skipped_contexts_once(cls, skipped_contexts: Tuple[str, 'Ignoring these contexts.') @classmethod - def _existing_allowed_contexts(cls) -> List[str]: - """Get existing allowed contexts.""" + def _existing_allowed_contexts(cls) -> List[Optional[str]]: + """Get existing allowed contexts. + + If None is returned in the list, it means that we are running in a pod + with in-cluster auth. In this case, we specify None context, which will + use the service account mounted in the pod. + """ all_contexts = kubernetes_utils.get_all_kube_config_context_names() - if all_contexts is None: + if len(all_contexts) == 0: return [] + if all_contexts == [None]: + # If only one context is found and it is None, we are running in a + # pod with in-cluster auth. In this case, we allow it to be used + # without checking against allowed_contexts. + # TODO(romilb): We may want check in-cluster auth against + # allowed_contexts in the future by adding a special context name + # for in-cluster auth. + return [None] all_contexts = set(all_contexts) allowed_contexts = skypilot_config.get_nested( @@ -164,7 +177,15 @@ def regions_with_offering(cls, instance_type: Optional[str], del accelerators, zone, use_spot # unused existing_contexts = cls._existing_allowed_contexts() - regions = [clouds.Region(context) for context in existing_contexts] + regions = [] + for context in existing_contexts: + if context is None: + # If running in-cluster, we allow the region to be set to the + # singleton region since there is no context name available. + regions.append(clouds.Region( + kubernetes_utils.IN_CLUSTER_REGION)) + else: + regions.append(clouds.Region(context)) if region is not None: regions = [r for r in regions if r.name == region] @@ -541,13 +562,20 @@ def instance_type_exists(self, instance_type: str) -> bool: def validate_region_zone(self, region: Optional[str], zone: Optional[str]): if region == self._LEGACY_SINGLETON_REGION: # For backward compatibility, we allow the region to be set to the - # legacy singletonton region. + # legacy singleton region. # TODO: Remove this after 0.9.0. return region, zone + if region == kubernetes_utils.IN_CLUSTER_REGION: + # If running incluster, we set region to IN_CLUSTER_REGION + # since there is no context name available. + return region, zone + all_contexts = kubernetes_utils.get_all_kube_config_context_names() - if all_contexts is None: - all_contexts = [] + if all_contexts == [None]: + # If [None] context is returned, use the singleton region since we + # are running in a pod with in-cluster auth. + all_contexts = [kubernetes_utils.IN_CLUSTER_REGION] if region not in all_contexts: raise ValueError( f'Context {region} not found in kubeconfig. Kubernetes only ' diff --git a/sky/provision/kubernetes/config.py b/sky/provision/kubernetes/config.py index e377f3029b8..370430720f0 100644 --- a/sky/provision/kubernetes/config.py +++ b/sky/provision/kubernetes/config.py @@ -247,7 +247,8 @@ def _get_resource(container_resources: Dict[str, Any], resource_name: str, def _configure_autoscaler_service_account( - namespace: str, context: str, provider_config: Dict[str, Any]) -> None: + namespace: str, context: Optional[str], + provider_config: Dict[str, Any]) -> None: account_field = 'autoscaler_service_account' if account_field not in provider_config: logger.info('_configure_autoscaler_service_account: ' @@ -281,7 +282,7 @@ def _configure_autoscaler_service_account( f'{created_msg(account_field, name)}') -def _configure_autoscaler_role(namespace: str, context: str, +def _configure_autoscaler_role(namespace: str, context: Optional[str], provider_config: Dict[str, Any], role_field: str) -> None: """ Reads the role from the provider config, creates if it does not exist. @@ -330,7 +331,7 @@ def _configure_autoscaler_role(namespace: str, context: str, def _configure_autoscaler_role_binding( namespace: str, - context: str, + context: Optional[str], provider_config: Dict[str, Any], binding_field: str, override_name: Optional[str] = None, @@ -620,7 +621,7 @@ def _configure_fuse_mounting(provider_config: Dict[str, Any]) -> None: f'in namespace {fuse_device_manager_namespace!r}') -def _configure_services(namespace: str, context: str, +def _configure_services(namespace: str, context: Optional[str], provider_config: Dict[str, Any]) -> None: service_field = 'services' if service_field not in provider_config: diff --git a/sky/provision/kubernetes/instance.py b/sky/provision/kubernetes/instance.py index f9ee75e466b..8da13d5ad0f 100644 --- a/sky/provision/kubernetes/instance.py +++ b/sky/provision/kubernetes/instance.py @@ -302,7 +302,8 @@ def _check_init_containers(pod): time.sleep(1) -def _set_env_vars_in_pods(namespace: str, context: str, new_pods: List): +def _set_env_vars_in_pods(namespace: str, context: Optional[str], + new_pods: List): """Setting environment variables in pods. Once all containers are ready, we can exec into them and set env vars. @@ -330,7 +331,7 @@ def _set_env_vars_in_pods(namespace: str, context: str, new_pods: List): new_pod.metadata.name, rc, stdout) -def _check_user_privilege(namespace: str, context: str, +def _check_user_privilege(namespace: str, context: Optional[str], new_nodes: List) -> None: # Checks if the default user has sufficient privilege to set up # the kubernetes instance pod. @@ -366,7 +367,8 @@ def _check_user_privilege(namespace: str, context: str, 'from the image.') -def _setup_ssh_in_pods(namespace: str, context: str, new_nodes: List) -> None: +def _setup_ssh_in_pods(namespace: str, context: Optional[str], + new_nodes: List) -> None: # Setting up ssh for the pod instance. This is already setup for # the jump pod so it does not need to be run for it. set_k8s_ssh_cmd = ( @@ -410,7 +412,7 @@ def _setup_ssh_in_pods(namespace: str, context: str, new_nodes: List) -> None: logger.info(f'{"-"*20}End: Set up SSH in pod {pod_name!r} {"-"*20}') -def _label_pod(namespace: str, context: str, pod_name: str, +def _label_pod(namespace: str, context: Optional[str], pod_name: str, label: Dict[str, str]) -> None: """Label a pod.""" kubernetes.core_api(context).patch_namespaced_pod( @@ -647,7 +649,8 @@ def stop_instances( raise NotImplementedError() -def _terminate_node(namespace: str, context: str, pod_name: str) -> None: +def _terminate_node(namespace: str, context: Optional[str], + pod_name: str) -> None: """Terminate a pod.""" logger.debug('terminate_instances: calling delete_namespaced_pod') try: diff --git a/sky/provision/kubernetes/network_utils.py b/sky/provision/kubernetes/network_utils.py index a1d919a6766..b16482e5072 100644 --- a/sky/provision/kubernetes/network_utils.py +++ b/sky/provision/kubernetes/network_utils.py @@ -132,7 +132,7 @@ def fill_ingress_template(namespace: str, service_details: List[Tuple[str, int, def create_or_replace_namespaced_ingress( - namespace: str, context: str, ingress_name: str, + namespace: str, context: Optional[str], ingress_name: str, ingress_spec: Dict[str, Union[str, int]]) -> None: """Creates an ingress resource for the specified service.""" networking_api = kubernetes.networking_api(context) @@ -156,7 +156,7 @@ def create_or_replace_namespaced_ingress( _request_timeout=kubernetes.API_TIMEOUT) -def delete_namespaced_ingress(namespace: str, context: str, +def delete_namespaced_ingress(namespace: str, context: Optional[str], ingress_name: str) -> None: """Deletes an ingress resource.""" networking_api = kubernetes.networking_api(context) @@ -171,7 +171,7 @@ def delete_namespaced_ingress(namespace: str, context: str, def create_or_replace_namespaced_service( - namespace: str, context: str, service_name: str, + namespace: str, context: Optional[str], service_name: str, service_spec: Dict[str, Union[str, int]]) -> None: """Creates a service resource for the specified service.""" core_api = kubernetes.core_api(context) @@ -208,7 +208,7 @@ def delete_namespaced_service(namespace: str, service_name: str) -> None: raise e -def ingress_controller_exists(context: str, +def ingress_controller_exists(context: Optional[str], ingress_class_name: str = 'nginx') -> bool: """Checks if an ingress controller exists in the cluster.""" networking_api = kubernetes.networking_api(context) @@ -220,7 +220,7 @@ def ingress_controller_exists(context: str, def get_ingress_external_ip_and_ports( - context: str, + context: Optional[str], namespace: str = 'ingress-nginx' ) -> Tuple[Optional[str], Optional[Tuple[int, int]]]: """Returns external ip and ports for the ingress controller.""" @@ -258,7 +258,7 @@ def get_ingress_external_ip_and_ports( return external_ip, None -def get_loadbalancer_ip(context: str, +def get_loadbalancer_ip(context: Optional[str], namespace: str, service_name: str, timeout: int = 0) -> Optional[str]: @@ -284,7 +284,8 @@ def get_loadbalancer_ip(context: str, return ip -def get_pod_ip(context: str, namespace: str, pod_name: str) -> Optional[str]: +def get_pod_ip(context: Optional[str], namespace: str, + pod_name: str) -> Optional[str]: """Returns the IP address of the pod.""" core_api = kubernetes.core_api(context) pod = core_api.read_namespaced_pod(pod_name, diff --git a/sky/provision/kubernetes/utils.py b/sky/provision/kubernetes/utils.py index f31652030a5..0498cc7f59f 100644 --- a/sky/provision/kubernetes/utils.py +++ b/sky/provision/kubernetes/utils.py @@ -33,6 +33,7 @@ # TODO(romilb): Move constants to constants.py DEFAULT_NAMESPACE = 'default' +IN_CLUSTER_REGION = 'in-cluster' DEFAULT_SERVICE_ACCOUNT_NAME = 'skypilot-service-account' @@ -310,7 +311,7 @@ class KarpenterLabelFormatter(SkyPilotLabelFormatter): @functools.lru_cache() def detect_gpu_label_formatter( - context: str + context: Optional[str] ) -> Tuple[Optional[GPULabelFormatter], Dict[str, List[Tuple[str, str]]]]: """Detects the GPU label formatter for the Kubernetes cluster @@ -342,7 +343,7 @@ def detect_gpu_label_formatter( @functools.lru_cache(maxsize=10) -def detect_gpu_resource(context: str) -> Tuple[bool, Set[str]]: +def detect_gpu_resource(context: Optional[str]) -> Tuple[bool, Set[str]]: """Checks if the Kubernetes cluster has nvidia.com/gpu resource. If nvidia.com/gpu resource is missing, that typically means that the @@ -402,7 +403,7 @@ def get_all_pods_in_kubernetes_cluster( return pods -def check_instance_fits(context: str, +def check_instance_fits(context: Optional[str], instance: str) -> Tuple[bool, Optional[str]]: """Checks if the instance fits on the Kubernetes cluster. @@ -488,7 +489,7 @@ def check_cpu_mem_fits(candidate_instance_type: 'KubernetesInstanceType', return fits, reason -def get_gpu_label_key_value(context: str, +def get_gpu_label_key_value(context: Optional[str], acc_type: str, check_mode=False) -> Tuple[str, str]: """Returns the label key and value for the given GPU type. @@ -651,11 +652,14 @@ def get_external_ip(network_mode: Optional[ return parsed_url.hostname -def check_credentials(context: str, timeout: int = kubernetes.API_TIMEOUT) -> \ +def check_credentials(context: Optional[str], + timeout: int = kubernetes.API_TIMEOUT) -> \ Tuple[bool, Optional[str]]: """Check if the credentials in kubeconfig file are valid Args: + context (Optional[str]): The Kubernetes context to use. If none, uses + in-cluster auth to check credentials, if available. timeout (int): Timeout in seconds for the test API call Returns: @@ -817,22 +821,42 @@ def get_current_kube_config_context_name() -> Optional[str]: return None -def get_all_kube_config_context_names() -> Optional[List[str]]: +def is_incluster_config_available() -> bool: + """Check if in-cluster auth is available. + + Note: We cannot use load_incluster_config() to check if in-cluster config + is available because it will load the in-cluster config (if available) + and modify the current global kubernetes config. We simply check if the + service account token file exists to determine if in-cluster config may + be available. + """ + return os.path.exists('/var/run/secrets/kubernetes.io/serviceaccount/token') + + +def get_all_kube_config_context_names() -> List[Optional[str]]: """Get all kubernetes context names from the kubeconfig file. + If running in-cluster, returns [None] to indicate in-cluster config. + We should not cache the result of this function as the admin policy may update the contexts. Returns: - List[str] | None: The list of kubernetes context names if it exists, - None otherwise + List[Optional[str]]: The list of kubernetes context names if + available, an empty list otherwise. If running in-cluster, + returns [None] to indicate in-cluster config. """ k8s = kubernetes.kubernetes try: all_contexts, _ = k8s.config.list_kube_config_contexts() + # all_contexts will always have at least one context. If kubeconfig + # does not have any contexts defined, it will raise ConfigException. return [context['name'] for context in all_contexts] except k8s.config.config_exception.ConfigException: - return None + # If running in cluster, return [None] to indicate in-cluster config + if is_incluster_config_available(): + return [None] + return [] @functools.lru_cache() @@ -1046,7 +1070,7 @@ def get_ssh_proxy_command( k8s_ssh_target: str, network_mode: kubernetes_enums.KubernetesNetworkingMode, private_key_path: str, - context: str, + context: Optional[str], namespace: str, ) -> str: """Generates the SSH proxy command to connect to the pod. @@ -1144,7 +1168,8 @@ def create_proxy_command_script() -> str: return port_fwd_proxy_cmd_path -def setup_ssh_jump_svc(ssh_jump_name: str, namespace: str, context: str, +def setup_ssh_jump_svc(ssh_jump_name: str, namespace: str, + context: Optional[str], service_type: kubernetes_enums.KubernetesServiceType): """Sets up Kubernetes service resource to access for SSH jump pod. @@ -1216,7 +1241,8 @@ def setup_ssh_jump_svc(ssh_jump_name: str, namespace: str, context: str, def setup_ssh_jump_pod(ssh_jump_name: str, ssh_jump_image: str, - ssh_key_secret: str, namespace: str, context: str): + ssh_key_secret: str, namespace: str, + context: Optional[str]): """Sets up Kubernetes RBAC and pod for SSH jump host. Our Kubernetes implementation uses a SSH jump pod to reach SkyPilot clusters @@ -1296,7 +1322,8 @@ def setup_ssh_jump_pod(ssh_jump_name: str, ssh_jump_image: str, logger.info(f'Created SSH Jump Host {ssh_jump_name}.') -def clean_zombie_ssh_jump_pod(namespace: str, context: str, node_id: str): +def clean_zombie_ssh_jump_pod(namespace: str, context: Optional[str], + node_id: str): """Analyzes SSH jump pod and removes if it is in a bad state Prevents the existence of a dangling SSH jump pod. This could happen @@ -1618,7 +1645,8 @@ def check_nvidia_runtime_class(context: Optional[str] = None) -> bool: return nvidia_exists -def check_secret_exists(secret_name: str, namespace: str, context: str) -> bool: +def check_secret_exists(secret_name: str, namespace: str, + context: Optional[str]) -> bool: """Checks if a secret exists in a namespace Args: @@ -1836,7 +1864,7 @@ def get_namespace_from_config(provider_config: Dict[str, Any]) -> str: def filter_pods(namespace: str, - context: str, + context: Optional[str], tag_filters: Dict[str, str], status_filters: Optional[List[str]] = None) -> Dict[str, Any]: """Filters pods by tags and status.""" @@ -1962,6 +1990,11 @@ def set_autodown_annotations(handle: 'backends.CloudVmRayResourceHandle', context=context) -def get_context_from_config(provider_config: Dict[str, Any]) -> str: - return provider_config.get('context', - get_current_kube_config_context_name()) +def get_context_from_config(provider_config: Dict[str, Any]) -> Optional[str]: + context = provider_config.get('context', + get_current_kube_config_context_name()) + if context == IN_CLUSTER_REGION: + # If the context (also used as the region) is set to IN_CLUSTER_REGION + # we need to use in-cluster auth. + context = None + return context diff --git a/sky/utils/command_runner.py b/sky/utils/command_runner.py index 1cb1dfc88e6..3d4bcb0af9a 100644 --- a/sky/utils/command_runner.py +++ b/sky/utils/command_runner.py @@ -653,7 +653,7 @@ class KubernetesCommandRunner(CommandRunner): def __init__( self, - node: Tuple[Tuple[str, str], str], + node: Tuple[Tuple[str, Optional[str]], str], **kwargs, ): """Initialize KubernetesCommandRunner. diff --git a/sky/utils/command_runner.pyi b/sky/utils/command_runner.pyi index e2bf2e5031c..51b22a259ea 100644 --- a/sky/utils/command_runner.pyi +++ b/sky/utils/command_runner.pyi @@ -204,7 +204,7 @@ class KubernetesCommandRunner(CommandRunner): def __init__( self, - node: Tuple[Tuple[str, str], str], + node: Tuple[Tuple[str, Optional[str]], str], ) -> None: ... diff --git a/sky/utils/controller_utils.py b/sky/utils/controller_utils.py index 118f9a2b718..39045962a78 100644 --- a/sky/utils/controller_utils.py +++ b/sky/utils/controller_utils.py @@ -363,6 +363,14 @@ def shared_controller_vars_to_fill( # again on the controller. This is required since admin_policy is not # installed on the controller. local_user_config.pop('admin_policy', None) + # Remove allowed_contexts from local_user_config since the controller + # may be running in a Kubernetes cluster with in-cluster auth and may + # not have kubeconfig available to it. This is the typical case since + # remote_identity default for Kubernetes is SERVICE_ACCOUNT. + # TODO(romilb): We should check the cloud the controller is running on + # before popping allowed_contexts. If it is not on Kubernetes, + # we may be able to use allowed_contexts. + local_user_config.pop('allowed_contexts', None) with tempfile.NamedTemporaryFile( delete=False, suffix=_LOCAL_SKYPILOT_CONFIG_PATH_SUFFIX) as temp_file: From 8dd003176336dd00f90f0f599eb6622edd5ac1f6 Mon Sep 17 00:00:00 2001 From: Andy Lee Date: Sun, 29 Sep 2024 16:09:28 -0700 Subject: [PATCH 27/93] [Docs] Fix inconsistent example in spot instance documentation (#3996) * [Docs] Fix inconsistent example in spot instance documentation Fixes #3995 * Update docs/source/serving/spot-policy.rst Co-authored-by: Zhanghao Wu --------- Co-authored-by: Zhanghao Wu --- docs/source/serving/spot-policy.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/serving/spot-policy.rst b/docs/source/serving/spot-policy.rst index ff23b328705..f9785d0eeb0 100644 --- a/docs/source/serving/spot-policy.rst +++ b/docs/source/serving/spot-policy.rst @@ -96,7 +96,7 @@ When the service is up, we can check the status of the service and the replicas http-server 3 1 - 1 mins ago 1x GCP(vCPU=2) PROVISIONING us-east1 http-server 4 1 - 1 min ago 1x GCP(vCPU=2) PROVISIONING us-central1 -When the required number of spot replicas are not available, SkyServe will provision the number of on-demand replicas needed to meet the target number of replicas. For example, when the target number is 2 and only 1 spot replica is ready, SkyServe will provision 1 on-demand replica to meet the target number of replicas. +When the required number of spot replicas are not available, SkyServe will provision on-demand replicas to meet the target number of replicas. For example, when the target number is 2 and no spot replicas are ready, SkyServe will provision 2 on-demand replicas to meet the target number of replicas. .. code-block:: console @@ -157,4 +157,4 @@ Eventually, when the spot availability is back, SkyServe will automatically scal Service Replicas SERVICE_NAME ID VERSION ENDPOINT LAUNCHED RESOURCES STATUS REGION http-server 2 1 http://34.68.226.193:8081 10 mins ago 1x GCP([Spot]vCPU=2) READY us-central1 - http-server 5 1 http://34.121.49.94:8081 1 min ago 1x GCP([Spot]vCPU=2) READY us-central1 \ No newline at end of file + http-server 5 1 http://34.121.49.94:8081 1 min ago 1x GCP([Spot]vCPU=2) READY us-central1 From e437e96bca4f6f500f228840eac430b0f223393b Mon Sep 17 00:00:00 2001 From: Tian Xia Date: Mon, 30 Sep 2024 14:12:05 -0700 Subject: [PATCH 28/93] [Examples] AWS Neuron Accelerator Example. (#4020) * [Examples] AWS Neuron Accelerator Example. * add example * auto calculate tp size & use ubuntu 2204 * add mix acc example * fix * rename --- examples/aws-neuron/inferentia.yaml | 62 ++++++++++++++++ examples/aws-neuron/mix-accelerator.yaml | 74 +++++++++++++++++++ sky/clouds/aws.py | 3 + .../data_fetchers/fetch_aws.py | 41 +++++----- 4 files changed, 163 insertions(+), 17 deletions(-) create mode 100644 examples/aws-neuron/inferentia.yaml create mode 100644 examples/aws-neuron/mix-accelerator.yaml diff --git a/examples/aws-neuron/inferentia.yaml b/examples/aws-neuron/inferentia.yaml new file mode 100644 index 00000000000..0d0773b3d09 --- /dev/null +++ b/examples/aws-neuron/inferentia.yaml @@ -0,0 +1,62 @@ +resources: + accelerators: Inferentia:6 + disk_size: 512 + ports: 9000 + +envs: + MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct + HF_TOKEN: # fill + +setup: | + # Install transformers-neuronx and its dependencies + sudo apt-get install -y python3.10-venv g++ + python3.10 -m venv aws_neuron_venv_pytorch + source aws_neuron_venv_pytorch/bin/activate + pip install ipykernel + python3.10 -m ipykernel install --user --name aws_neuron_venv_pytorch --display-name "Python (torch-neuronx)" + pip install jupyter notebook + pip install environment_kernels + python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com + python -m pip install wget + python -m pip install awscli + python -m pip install --upgrade neuronx-cc==2.* --pre torch-neuronx==2.1.* torchvision transformers-neuronx + + # Install latest version of triton. + # Reference: https://github.com/vllm-project/vllm/issues/6987 + pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple triton-nightly + + # Install vLLM from source. Avoid using dir name 'vllm' due to import conflict. + # Reference: https://github.com/vllm-project/vllm/issues/1814#issuecomment-1837122930 + git clone https://github.com/vllm-project/vllm.git vllm_repo + cd vllm_repo + pip install -U -r requirements-neuron.txt + VLLM_TARGET_DEVICE="neuron" pip install -e . + + python -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}')" + + sudo apt update + sudo apt install -y numactl + +run: | + source aws_neuron_venv_pytorch/bin/activate + # Calculate the tensor parallel size. vLLM requires the tensor parallel size + # to be a factor of the number of attention heads, which is 32 for the model. + # Here we calculate the largest power of 2 that is less than or equal to the + # number of GPUs per node. + TENSOR_PARALLEL_SIZE=1 + while [ $(($TENSOR_PARALLEL_SIZE * 2)) -le $SKYPILOT_NUM_GPUS_PER_NODE ]; do + TENSOR_PARALLEL_SIZE=$(($TENSOR_PARALLEL_SIZE * 2)) + done + NEURON_RT_VISIBLE_CORES="0-$(($TENSOR_PARALLEL_SIZE - 1))" + OMP_NUM_THREADS=$SKYPILOT_NUM_GPUS_PER_NODE + MASTER_PORT=12355 + LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/home/ubuntu/miniconda3/lib" + numactl --cpunodebind=0 --membind=0 \ + python3 -m vllm.entrypoints.openai.api_server \ + --device neuron \ + --model $MODEL_NAME \ + --tensor-parallel-size $TENSOR_PARALLEL_SIZE \ + --max-num-seqs 16 \ + --max-model-len 32 \ + --block-size 32 \ + --port 9000 diff --git a/examples/aws-neuron/mix-accelerator.yaml b/examples/aws-neuron/mix-accelerator.yaml new file mode 100644 index 00000000000..fc452a06804 --- /dev/null +++ b/examples/aws-neuron/mix-accelerator.yaml @@ -0,0 +1,74 @@ +resources: + accelerators: {A100:1, Inferentia:6} + disk_size: 512 + ports: 9000 + +envs: + MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct + HF_TOKEN: # fill + +setup: | + if command -v nvidia-smi; then + pip install vllm==0.4.2 + pip install flash-attn==2.5.9.post1 + else + # Install transformers-neuronx and its dependencies + sudo apt-get install -y python3.10-venv g++ + python3.10 -m venv aws_neuron_venv_pytorch + source aws_neuron_venv_pytorch/bin/activate + pip install ipykernel + python3.10 -m ipykernel install --user --name aws_neuron_venv_pytorch --display-name "Python (torch-neuronx)" + pip install jupyter notebook + pip install environment_kernels + python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com + python -m pip install wget + python -m pip install awscli + python -m pip install --upgrade neuronx-cc==2.* --pre torch-neuronx==2.1.* torchvision transformers-neuronx + + # Install latest version of triton. + # Reference: https://github.com/vllm-project/vllm/issues/6987 + pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple triton-nightly + + # Install vLLM from source. Avoid using dir name 'vllm' due to import conflict. + # Reference: https://github.com/vllm-project/vllm/issues/1814#issuecomment-1837122930 + git clone https://github.com/vllm-project/vllm.git vllm_repo + cd vllm_repo + pip install -U -r requirements-neuron.txt + VLLM_TARGET_DEVICE="neuron" pip install -e . + + python -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}')" + + sudo apt update + sudo apt install -y numactl + fi + +run: | + if command -v nvidia-smi; then + TENSOR_PARALLEL_SIZE=$SKYPILOT_NUM_GPUS_PER_NODE + PREFIX="" + DEVICE="cuda" + else + source aws_neuron_venv_pytorch/bin/activate + # Calculate the tensor parallel size. vLLM requires the tensor parallel size + # to be a factor of the number of attention heads, which is 32 for the model. + # Here we calculate the largest power of 2 that is less than or equal to the + # number of GPUs per node. + TENSOR_PARALLEL_SIZE=1 + while [ $(($TENSOR_PARALLEL_SIZE * 2)) -le $SKYPILOT_NUM_GPUS_PER_NODE ]; do + TENSOR_PARALLEL_SIZE=$(($TENSOR_PARALLEL_SIZE * 2)) + done + NEURON_RT_VISIBLE_CORES="0-$(($TENSOR_PARALLEL_SIZE - 1))" + OMP_NUM_THREADS=$SKYPILOT_NUM_GPUS_PER_NODE + MASTER_PORT=12355 + LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/home/ubuntu/miniconda3/lib" + PREFIX="numactl --cpunodebind=0 --membind=0" + DEVICE="neuron" + fi + $PREFIX python3 -m vllm.entrypoints.openai.api_server \ + --device $DEVICE \ + --model $MODEL_NAME \ + --tensor-parallel-size $TENSOR_PARALLEL_SIZE \ + --max-num-seqs 16 \ + --max-model-len 32 \ + --block-size 32 \ + --port 9000 diff --git a/sky/clouds/aws.py b/sky/clouds/aws.py index 4ca57d75420..be1ecce0350 100644 --- a/sky/clouds/aws.py +++ b/sky/clouds/aws.py @@ -225,6 +225,9 @@ def _get_default_ami(cls, region_name: str, instance_type: str) -> str: if acc_name == 'K80': image_id = service_catalog.get_image_id_from_tag( 'skypilot:k80-ubuntu-2004', region_name, clouds='aws') + if acc_name in ['Trainium', 'Inferentia']: + image_id = service_catalog.get_image_id_from_tag( + 'skypilot:neuron-ubuntu-2204', region_name, clouds='aws') if image_id is not None: return image_id # Raise ResourcesUnavailableError to make sure the failover in diff --git a/sky/clouds/service_catalog/data_fetchers/fetch_aws.py b/sky/clouds/service_catalog/data_fetchers/fetch_aws.py index 1e1d6e98c03..e0e5ffa21a1 100644 --- a/sky/clouds/service_catalog/data_fetchers/fetch_aws.py +++ b/sky/clouds/service_catalog/data_fetchers/fetch_aws.py @@ -379,26 +379,33 @@ def get_all_regions_instance_types_df(regions: Set[str]) -> 'pd.DataFrame': # # Deep Learning AMI GPU PyTorch 1.10.0 (Ubuntu 18.04) 20211208 # Nvidia driver: 470.57.02, CUDA Version: 11.4 -_GPU_UBUNTU_DATE_PYTORCH = [ - ('gpu', '20.04', '20231103', '2.1.0'), - ('gpu', '18.04', '20221114', '1.10.0'), - ('k80', '20.04', '20211208', '1.10.0'), - ('k80', '18.04', '20211208', '1.10.0'), +# +# Neuron (Inferentia / Trainium): +# https://aws.amazon.com/releasenotes/aws-deep-learning-ami-base-neuron-ubuntu-20-04/ # pylint: disable=line-too-long +# Deep Learning Base Neuron AMI (Ubuntu 20.04) 20240923 +# TODO(tian): find out the driver version. +# Neuron driver: +_GPU_DESC_UBUNTU_DATE = [ + ('gpu', 'AMI GPU PyTorch 2.1.0', '20.04', '20231103'), + ('gpu', 'AMI GPU PyTorch 1.10.0', '18.04', '20221114'), + ('k80', 'AMI GPU PyTorch 1.10.0', '20.04', '20211208'), + ('k80', 'AMI GPU PyTorch 1.10.0', '18.04', '20211208'), + ('neuron', 'Base Neuron AMI', '22.04', '20240923'), ] -def _fetch_image_id(region: str, ubuntu_version: str, creation_date: str, - pytorch_version: str) -> Optional[str]: +def _fetch_image_id(region: str, description: str, ubuntu_version: str, + creation_date: str) -> Optional[str]: try: image = subprocess.check_output(f"""\ aws ec2 describe-images --region {region} --owners amazon \\ - --filters 'Name=name,Values="Deep Learning AMI GPU PyTorch {pytorch_version} (Ubuntu {ubuntu_version}) {creation_date}"' \\ + --filters 'Name=name,Values="Deep Learning {description} (Ubuntu {ubuntu_version}) {creation_date}"' \\ 'Name=state,Values=available' --query 'Images[:1].ImageId' --output text """, shell=True) except subprocess.CalledProcessError as e: - print(f'Failed {region}, {ubuntu_version}, {creation_date}. ' - 'Trying next date.') + print(f'Failed {region}, {description}, {ubuntu_version}, ' + f'{creation_date}. Trying next date.') print(f'{type(e)}: {e}') image_id = None else: @@ -407,21 +414,21 @@ def _fetch_image_id(region: str, ubuntu_version: str, creation_date: str, return image_id -def _get_image_row( - region: str, gpu: str, ubuntu_version: str, date: str, - pytorch_version) -> Tuple[str, str, str, str, Optional[str], str]: - print(f'Getting image for {region}, {ubuntu_version}, {gpu}') - image_id = _fetch_image_id(region, ubuntu_version, date, pytorch_version) +def _get_image_row(region: str, gpu: str, description: str, ubuntu_version: str, + date: str) -> Tuple[str, str, str, str, Optional[str], str]: + print(f'Getting image for {region}, {description}, {ubuntu_version}, {gpu}') + image_id = _fetch_image_id(region, description, ubuntu_version, date) if image_id is None: # not found - print(f'Failed to find image for {region}, {ubuntu_version}, {gpu}') + print(f'Failed to find image for {region}, {description}, ' + f'{ubuntu_version}, {gpu}') tag = f'skypilot:{gpu}-ubuntu-{ubuntu_version.replace(".", "")}' return tag, region, 'ubuntu', ubuntu_version, image_id, date def get_all_regions_images_df(regions: Set[str]) -> 'pd.DataFrame': image_metas = [ - (r, *i) for r, i in itertools.product(regions, _GPU_UBUNTU_DATE_PYTORCH) + (r, *i) for r, i in itertools.product(regions, _GPU_DESC_UBUNTU_DATE) ] with mp_pool.Pool() as pool: results = pool.starmap(_get_image_row, image_metas) From 62222ee53cacb6a8965626c89e90f9fb2b6a3940 Mon Sep 17 00:00:00 2001 From: yika-luo Date: Mon, 30 Sep 2024 17:11:48 -0700 Subject: [PATCH 29/93] [UX] Remove requirement to specify cloud in Resources to use labels (#4022) Co-authored-by: Yika Luo --- sky/resources.py | 26 ++++++++++++++------------ tests/unit_tests/test_resources.py | 30 +++++++++++++++++++++++++++++- 2 files changed, 43 insertions(+), 13 deletions(-) diff --git a/sky/resources.py b/sky/resources.py index 2f19cd1aa01..e9a522cef48 100644 --- a/sky/resources.py +++ b/sky/resources.py @@ -966,20 +966,22 @@ def _try_validate_labels(self) -> None: """ if not self._labels: return - - if self.cloud is None: - # Because each cloud has its own label format, we cannot validate - # the labels without knowing the cloud. - with ux_utils.print_exception_no_traceback(): - raise ValueError( - 'Cloud must be specified when labels are provided.') - - # Check if the label key value pairs are valid. + if self.cloud is not None: + validated_clouds = [self.cloud] + else: + # If no specific cloud is set, validate label against ALL clouds. + # The label will be dropped if invalid for any one of the cloud + validated_clouds = sky_check.get_cached_enabled_clouds_or_refresh() invalid_table = log_utils.create_table(['Label', 'Reason']) for key, value in self._labels.items(): - valid, err_msg = self.cloud.is_label_valid(key, value) - if not valid: - invalid_table.add_row([f'{key}: {value}', err_msg]) + for cloud in validated_clouds: + valid, err_msg = cloud.is_label_valid(key, value) + if not valid: + invalid_table.add_row([ + f'{key}: {value}', + f'Label rejected due to {cloud}: {err_msg}' + ]) + break if len(invalid_table.rows) > 0: with ux_utils.print_exception_no_traceback(): raise ValueError( diff --git a/tests/unit_tests/test_resources.py b/tests/unit_tests/test_resources.py index 01b83132a1b..5006fc454aa 100644 --- a/tests/unit_tests/test_resources.py +++ b/tests/unit_tests/test_resources.py @@ -6,6 +6,7 @@ import pytest from sky import clouds +from sky import global_user_state from sky import skypilot_config from sky.resources import Resources from sky.utils import resources_utils @@ -34,7 +35,8 @@ def test_get_reservations_available_resources(): def _run_label_test(allowed_labels: Dict[str, str], - invalid_labels: Dict[str, str], cloud: clouds.Cloud): + invalid_labels: Dict[str, str], + cloud: clouds.Cloud = None): """Run a test for labels with the given allowed and invalid labels.""" r_allowed = Resources(cloud=cloud, labels=allowed_labels) # Should pass assert r_allowed.labels == allowed_labels, ('Allowed labels ' @@ -92,6 +94,32 @@ def test_kubernetes_labels_resources(): _run_label_test(allowed_labels, invalid_labels, cloud) +def test_no_cloud_labels_resources(): + global_user_state.set_enabled_clouds(['aws', 'gcp']) + allowed_labels = { + **GLOBAL_VALID_LABELS, + } + invalid_labels = { + **GLOBAL_INVALID_LABELS, + 'aws:cannotstartwithaws': 'value', + 'domain/key': 'value', # Invalid for GCP + } + _run_label_test(allowed_labels, invalid_labels) + + +def test_no_cloud_labels_resources_single_enabled_cloud(): + global_user_state.set_enabled_clouds(['aws']) + allowed_labels = { + **GLOBAL_VALID_LABELS, + 'domain/key': 'value', # Valid for AWS + } + invalid_labels = { + **GLOBAL_INVALID_LABELS, + 'aws:cannotstartwithaws': 'value', + } + _run_label_test(allowed_labels, invalid_labels) + + @mock.patch('sky.clouds.service_catalog.instance_type_exists', return_value=True) @mock.patch('sky.clouds.service_catalog.get_accelerators_from_instance_type', From b1f22c4d5fe0a3cc25d1df0a8a05d4230a28b702 Mon Sep 17 00:00:00 2001 From: Tian Xia Date: Mon, 30 Sep 2024 23:37:47 -0700 Subject: [PATCH 30/93] [Docs] Add readme for inferentia example (#4024) * [Docs] Add readme for inferentia example * fold yaml * Update examples/aws-neuron/README.md Co-authored-by: Zhanghao Wu * Update examples/aws-neuron/README.md Co-authored-by: Zhanghao Wu * Update examples/aws-neuron/README.md Co-authored-by: Zhanghao Wu --------- Co-authored-by: Zhanghao Wu --- examples/aws-neuron/README.md | 117 ++++++++++++++++++ ...ccelerator.yaml => multi-accelerator.yaml} | 0 2 files changed, 117 insertions(+) create mode 100644 examples/aws-neuron/README.md rename examples/aws-neuron/{mix-accelerator.yaml => multi-accelerator.yaml} (100%) diff --git a/examples/aws-neuron/README.md b/examples/aws-neuron/README.md new file mode 100644 index 00000000000..38ffba7b885 --- /dev/null +++ b/examples/aws-neuron/README.md @@ -0,0 +1,117 @@ +# AWS Inferentia + +SkyPilot supports AWS Inferentia accelerators. The Neuron SDK is a runtime and compiler for running deep learning models on AWS Inferentia chips. Here is an example of how to use the Neuron SDK to launch a Llama 3 8b model on an Inferentia chip: + +```bash +$ sky launch -c aws-inf inferentia.yaml --env HF_TOKEN=hf_xxx +``` + +To send an example request to the model, you can use the following command: + +```bash +$ ENDPOINT=$(sky status aws-inf --endpoint 9000) +$ curl http://$ENDPOINT/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "meta-llama/Meta-Llama-3-8B-Instruct", + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "Who are you?" + } + ], + "stop_token_ids": [128009, 128001] + }' +{"id":"chat-0631550312c143d88ca6d477d0df6c2c","object":"chat.completion","created":1727751137,"model":"meta-llama/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"I'm a helpful assistant! I","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":25,"total_tokens":32,"completion_tokens":7},"prompt_logprobs":null} +``` + +## Using multiple accelerator choices + +You can also specify multiple resources in a task YAML to allow SkyPilot to find the cheapest available resources for you. Specifically, you can specify both Neuron accelerators and Nvidia GPUs in the same YAML file. Here is an example (See [multi-accelerator.yaml](./multi-accelerator.yaml)): + +

+ +Example YAML for multiple accelerators. + +```yaml +resources: + accelerators: {A100:1, Inferentia:6} + disk_size: 512 + ports: 9000 + +envs: + MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct + HF_TOKEN: # fill + +setup: | + if command -v nvidia-smi; then + pip install vllm==0.4.2 + pip install flash-attn==2.5.9.post1 + else + # Install transformers-neuronx and its dependencies + sudo apt-get install -y python3.10-venv g++ + python3.10 -m venv aws_neuron_venv_pytorch + source aws_neuron_venv_pytorch/bin/activate + pip install ipykernel + python3.10 -m ipykernel install --user --name aws_neuron_venv_pytorch --display-name "Python (torch-neuronx)" + pip install jupyter notebook + pip install environment_kernels + python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com + python -m pip install wget + python -m pip install awscli + python -m pip install --upgrade neuronx-cc==2.* --pre torch-neuronx==2.1.* torchvision transformers-neuronx + + # Install latest version of triton. + # Reference: https://github.com/vllm-project/vllm/issues/6987 + pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple triton-nightly + + # Install vLLM from source. Avoid using dir name 'vllm' due to import conflict. + # Reference: https://github.com/vllm-project/vllm/issues/1814#issuecomment-1837122930 + git clone https://github.com/vllm-project/vllm.git vllm_repo + cd vllm_repo + pip install -U -r requirements-neuron.txt + VLLM_TARGET_DEVICE="neuron" pip install -e . + + python -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}')" + + sudo apt update + sudo apt install -y numactl + fi + +run: | + if command -v nvidia-smi; then + TENSOR_PARALLEL_SIZE=$SKYPILOT_NUM_GPUS_PER_NODE + PREFIX="" + DEVICE="cuda" + else + source aws_neuron_venv_pytorch/bin/activate + # Calculate the tensor parallel size. vLLM requires the tensor parallel size + # to be a factor of the number of attention heads, which is 32 for the model. + # Here we calculate the largest power of 2 that is less than or equal to the + # number of GPUs per node. + TENSOR_PARALLEL_SIZE=1 + while [ $(($TENSOR_PARALLEL_SIZE * 2)) -le $SKYPILOT_NUM_GPUS_PER_NODE ]; do + TENSOR_PARALLEL_SIZE=$(($TENSOR_PARALLEL_SIZE * 2)) + done + NEURON_RT_VISIBLE_CORES="0-$(($TENSOR_PARALLEL_SIZE - 1))" + OMP_NUM_THREADS=$SKYPILOT_NUM_GPUS_PER_NODE + MASTER_PORT=12355 + LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/home/ubuntu/miniconda3/lib" + PREFIX="numactl --cpunodebind=0 --membind=0" + DEVICE="neuron" + fi + $PREFIX python3 -m vllm.entrypoints.openai.api_server \ + --device $DEVICE \ + --model $MODEL_NAME \ + --tensor-parallel-size $TENSOR_PARALLEL_SIZE \ + --max-num-seqs 16 \ + --max-model-len 32 \ + --block-size 32 \ + --port 9000 +``` + +
diff --git a/examples/aws-neuron/mix-accelerator.yaml b/examples/aws-neuron/multi-accelerator.yaml similarity index 100% rename from examples/aws-neuron/mix-accelerator.yaml rename to examples/aws-neuron/multi-accelerator.yaml From 12706e94d70534b65259d8ebb4054c1207546890 Mon Sep 17 00:00:00 2001 From: Andy Lee Date: Fri, 4 Oct 2024 10:41:32 -0700 Subject: [PATCH 31/93] [Serve] Refactor to Fix Type Checking Errors (#3999) * refactor: make type cheker happy * Update sky/serve/service.py Co-authored-by: Tian Xia * Update sky/serve/replica_managers.py Co-authored-by: Tian Xia * Update sky/serve/replica_managers.py Co-authored-by: Tian Xia * fix: get_db_path comments * fix: filter with lambda * format * better way to filter out `None` * Update sky/serve/service.py Co-authored-by: Tian Xia * Revert "Update sky/serve/service.py" This reverts commit b7f3eb9e6b2a746872873ab6fded2683b92fe86a. --------- Co-authored-by: Tian Xia --- sky/jobs/state.py | 16 ++++++++++++---- sky/serve/autoscalers.py | 4 ++++ sky/serve/controller.py | 2 +- sky/serve/load_balancer.py | 2 +- sky/serve/replica_managers.py | 4 +++- sky/serve/serve_state.py | 16 ++++++++++++---- sky/serve/service.py | 22 ++++++++++++---------- 7 files changed, 45 insertions(+), 21 deletions(-) diff --git a/sky/jobs/state.py b/sky/jobs/state.py index 6ea68da59f8..2ef5b578b7a 100644 --- a/sky/jobs/state.py +++ b/sky/jobs/state.py @@ -20,10 +20,18 @@ logger = sky_logging.init_logger(__name__) -_DB_PATH = pathlib.Path('~/.sky/spot_jobs.db') -_DB_PATH = _DB_PATH.expanduser().absolute() -_DB_PATH.parents[0].mkdir(parents=True, exist_ok=True) -_DB_PATH = str(_DB_PATH) + +def _get_db_path() -> str: + """Workaround to collapse multi-step Path ops for type checker. + Ensures _DB_PATH is str, avoiding Union[Path, str] inference. + """ + path = pathlib.Path('~/.sky/spot_jobs.db') + path = path.expanduser().absolute() + path.parents[0].mkdir(parents=True, exist_ok=True) + return str(path) + + +_DB_PATH = _get_db_path() # Module-level connection/cursor; thread-safe as the module is only imported # once. diff --git a/sky/serve/autoscalers.py b/sky/serve/autoscalers.py index 0a6b84111c6..a4278f192fb 100644 --- a/sky/serve/autoscalers.py +++ b/sky/serve/autoscalers.py @@ -131,6 +131,10 @@ def _load_dynamic_states(self, dynamic_states: Dict[str, Any]) -> None: """Load dynamic states to autoscaler.""" raise NotImplementedError + def get_decision_interval(self) -> int: + """Get the decision interval for the autoscaler.""" + raise NotImplementedError + def load_dynamic_states(self, dynamic_states: Dict[str, Any]) -> None: """Load dynamic states to autoscaler.""" self.latest_version_ever_ready = dynamic_states.pop( diff --git a/sky/serve/controller.py b/sky/serve/controller.py index 8efc8789a8f..361a1293d21 100644 --- a/sky/serve/controller.py +++ b/sky/serve/controller.py @@ -153,7 +153,7 @@ def configure_logger(): logger.info('SkyServe Controller started on ' f'http://{self._host}:{self._port}') - uvicorn.run(self._app, host={self._host}, port=self._port) + uvicorn.run(self._app, host=self._host, port=self._port) # TODO(tian): Probably we should support service that will stop the VM in diff --git a/sky/serve/load_balancer.py b/sky/serve/load_balancer.py index 24d0958489d..c15f71e214a 100644 --- a/sky/serve/load_balancer.py +++ b/sky/serve/load_balancer.py @@ -79,7 +79,7 @@ async def _sync_with_controller(self): 'request_aggregator': self._request_aggregator.to_dict() }, - timeout=5, + timeout=aiohttp.ClientTimeout(5), ) as response: # Clean up after reporting request info to avoid OOM. self._request_aggregator.clear() diff --git a/sky/serve/replica_managers.py b/sky/serve/replica_managers.py index 81cc13c8abd..337b28ba61b 100644 --- a/sky/serve/replica_managers.py +++ b/sky/serve/replica_managers.py @@ -36,6 +36,7 @@ from sky.utils import ux_utils if typing.TYPE_CHECKING: + from sky import resources from sky.serve import service_spec logger = sky_logging.init_logger(__name__) @@ -172,9 +173,10 @@ def _get_resources_ports(task_yaml: str) -> str: task = sky.Task.from_yaml(task_yaml) # Already checked all ports are the same in sky.serve.core.up assert len(task.resources) >= 1, task - task_resources = list(task.resources)[0] + task_resources: 'resources.Resources' = list(task.resources)[0] # Already checked the resources have and only have one port # before upload the task yaml. + assert task_resources.ports is not None return task_resources.ports[0] diff --git a/sky/serve/serve_state.py b/sky/serve/serve_state.py index 7ddf22ccb81..cbc8ef3d8cc 100644 --- a/sky/serve/serve_state.py +++ b/sky/serve/serve_state.py @@ -17,10 +17,18 @@ from sky.serve import replica_managers from sky.serve import service_spec -_DB_PATH = pathlib.Path(constants.SKYSERVE_METADATA_DIR) / 'services.db' -_DB_PATH = _DB_PATH.expanduser().absolute() -_DB_PATH.parents[0].mkdir(parents=True, exist_ok=True) -_DB_PATH = str(_DB_PATH) + +def _get_db_path() -> str: + """Workaround to collapse multi-step Path ops for type checker. + Ensures _DB_PATH is str, avoiding Union[Path, str] inference. + """ + path = pathlib.Path(constants.SKYSERVE_METADATA_DIR) / 'services.db' + path = path.expanduser().absolute() + path.parents[0].mkdir(parents=True, exist_ok=True) + return str(path) + + +_DB_PATH: str = _get_db_path() def create_table(cursor: 'sqlite3.Cursor', conn: 'sqlite3.Connection') -> None: diff --git a/sky/serve/service.py b/sky/serve/service.py index b1ef35cbc68..956a4839a87 100644 --- a/sky/serve/service.py +++ b/sky/serve/service.py @@ -9,7 +9,7 @@ import shutil import time import traceback -from typing import Dict, List +from typing import Dict import filelock @@ -116,15 +116,17 @@ def _cleanup(service_name: str) -> bool: logger.error(f'Replica {info.replica_id} failed to terminate.') versions = serve_state.get_service_versions(service_name) serve_state.remove_service_versions(service_name) - success = True - for version in versions: + + def cleanup_version_storage(version: int) -> bool: task_yaml: str = serve_utils.generate_task_yaml_file_name( service_name, version) logger.info(f'Cleaning up storage for version {version}, ' f'task_yaml: {task_yaml}') - success = success and cleanup_storage(task_yaml) - if not success: + return cleanup_storage(task_yaml) + + if not all(map(cleanup_version_storage, versions)): failed = True + return failed @@ -213,6 +215,7 @@ def _get_host(): # TODO(tian): Support HTTPS. controller_addr = f'http://{controller_host}:{controller_port}' + load_balancer_port = common_utils.find_free_port( constants.LOAD_BALANCER_PORT_START) @@ -236,13 +239,12 @@ def _get_host(): serve_state.set_service_status_and_active_versions( service_name, serve_state.ServiceStatus.SHUTTING_DOWN) finally: - process_to_kill: List[multiprocessing.Process] = [] - if load_balancer_process is not None: - process_to_kill.append(load_balancer_process) - if controller_process is not None: - process_to_kill.append(controller_process) # Kill load balancer process first since it will raise errors if failed # to connect to the controller. Then the controller process. + process_to_kill = [ + proc for proc in [load_balancer_process, controller_process] + if proc is not None + ] subprocess_utils.kill_children_processes( [process.pid for process in process_to_kill], force=True) for process in process_to_kill: From 81e19038efe6988ca4d3e78dfb24140c587fa32c Mon Sep 17 00:00:00 2001 From: Maksym Taran Date: Fri, 4 Oct 2024 17:35:30 -0700 Subject: [PATCH 32/93] Add a missing newline that was breaking formatting (#4037) What it says on the tin. --- docs/source/getting-started/installation.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/source/getting-started/installation.rst b/docs/source/getting-started/installation.rst index 9f251a5aafe..cf6115ee9e8 100644 --- a/docs/source/getting-started/installation.rst +++ b/docs/source/getting-started/installation.rst @@ -302,6 +302,7 @@ Fluidstack ~~~~~~~~~~~~~~~~~~ `Fluidstack `__ is a cloud provider offering low-cost GPUs. To configure Fluidstack access, go to the `Home `__ page on your Fluidstack console to generate an API key and then add the :code:`API key` to :code:`~/.fluidstack/api_key` : + .. code-block:: shell mkdir -p ~/.fluidstack From 1efd48a4df350b54f1e6d2b28afff19391aec0b4 Mon Sep 17 00:00:00 2001 From: Andy Lee Date: Sat, 5 Oct 2024 14:18:19 -0700 Subject: [PATCH 33/93] Stop using deprecated `on_event()` decorator (#4033) * Stop using deprecated `on_event()` decorator Fixes #3997 Replace deprecated `@app.on_event('startup')` decorator with lifespan event handler in `sky/serve/controller.py`. * Remove the `@app.on_event('startup')` decorator. * Add a lifespan event handler to configure the logger. * Update the `SkyServeController` class to use the lifespan event handler. * format and add decorator * format --- sky/serve/controller.py | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/sky/serve/controller.py b/sky/serve/controller.py index 361a1293d21..5d49c1aa307 100644 --- a/sky/serve/controller.py +++ b/sky/serve/controller.py @@ -2,6 +2,7 @@ Responsible for autoscaling and replica management. """ +import contextlib import logging import threading import time @@ -49,7 +50,14 @@ def __init__(self, service_name: str, service_spec: serve.SkyServiceSpec, autoscalers.Autoscaler.from_spec(service_name, service_spec)) self._host = host self._port = port - self._app = fastapi.FastAPI() + self._app = fastapi.FastAPI(lifespan=self.lifespan) + + @contextlib.asynccontextmanager + async def lifespan(self, _: fastapi.FastAPI): + uvicorn_access_logger = logging.getLogger('uvicorn.access') + for handler in uvicorn_access_logger.handlers: + handler.setFormatter(sky_logging.FORMATTER) + yield def _run_autoscaler(self): logger.info('Starting autoscaler.') @@ -142,12 +150,6 @@ async def update_service(request: fastapi.Request): f'{common_utils.format_exception(e)}') return {'message': 'Error'} - @self._app.on_event('startup') - def configure_logger(): - uvicorn_access_logger = logging.getLogger('uvicorn.access') - for handler in uvicorn_access_logger.handlers: - handler.setFormatter(sky_logging.FORMATTER) - threading.Thread(target=self._run_autoscaler).start() logger.info('SkyServe Controller started on ' From f4886bed755a3a6ba62554ef359fbe1dcd174d78 Mon Sep 17 00:00:00 2001 From: krishnived <81918756+krishnived@users.noreply.github.com> Date: Sat, 5 Oct 2024 17:31:34 -0500 Subject: [PATCH 34/93] Fixed Typo in Readme.md (#4039) Update README.md fix typo --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index a5287dbb3cd..dc7de3ea574 100644 --- a/README.md +++ b/README.md @@ -26,7 +26,7 @@ ---- :fire: *News* :fire: -- [Sep, 2024] Point, Launch and Serve **Llama 3.2** on on Kubernetes or Any Cloud: [**example**](./llm/llama-3_2/) +- [Sep, 2024] Point, Launch and Serve **Llama 3.2** on Kubernetes or Any Cloud: [**example**](./llm/llama-3_2/) - [Sep, 2024] Run and deploy [**Pixtral**](./llm/pixtral), the first open-source multimodal model from Mistral AI. - [Jul, 2024] [**Finetune**](./llm/llama-3_1-finetuning/) and [**serve**](./llm/llama-3_1/) **Llama 3.1** on your infra - [Jun, 2024] Reproduce **GPT** with [llm.c](https://github.com/karpathy/llm.c/discussions/481) on any cloud: [**guide**](./llm/gpt-2/) From b0a1ea2c54612a17569f80560445336e64c6821f Mon Sep 17 00:00:00 2001 From: Romil Bhardwaj Date: Sun, 6 Oct 2024 10:44:33 -0700 Subject: [PATCH 35/93] [docs] Add docs for internal load balancers on k8s (#4028) * Add internal ports docs * Add internal ports docs * Add internal ports docs * Add internal ports docs --- .../reference/kubernetes/kubernetes-ports.rst | 48 +++++++++++++++++-- 1 file changed, 45 insertions(+), 3 deletions(-) diff --git a/docs/source/reference/kubernetes/kubernetes-ports.rst b/docs/source/reference/kubernetes/kubernetes-ports.rst index 0f538363131..3824b651717 100644 --- a/docs/source/reference/kubernetes/kubernetes-ports.rst +++ b/docs/source/reference/kubernetes/kubernetes-ports.rst @@ -1,7 +1,7 @@ .. _kubernetes-ports: Exposing Services on Kubernetes -------------------------------- +=============================== .. note:: This is a guide on how to configure an existing Kubernetes cluster (along with the caveats involved) to successfully expose ports and services externally through SkyPilot. @@ -23,7 +23,7 @@ If your cluster does not support LoadBalancer services, SkyPilot can also use `a .. _kubernetes-loadbalancer: LoadBalancer Service -^^^^^^^^^^^^^^^^^^^^ +-------------------- This mode exposes ports through a Kubernetes `LoadBalancer Service `__. This is the default mode used by SkyPilot. @@ -52,11 +52,53 @@ These load balancers will be automatically terminated when the cluster is delete To work around this issue, make sure all your ports have services running behind them. +Internal Load Balancers +^^^^^^^^^^^^^^^^^^^^^^^ + +To restrict your services to be accessible only within the cluster, you can set all SkyPilot services to use `internal load balancers `_. + +Depending on your cloud, set the appropriate annotation in the SkyPilot config file (``~/.sky/config.yaml``): + +.. tab-set:: + + .. tab-item:: GCP + :sync: internal-lb-gke + + .. code-block:: yaml + + # ~/.sky/config.yaml + kubernetes: + custom_metadata: + annotations: + networking.gke.io/load-balancer-type: "Internal" + + .. tab-item:: AWS + :sync: internal-lb-aws + + .. code-block:: yaml + + # ~/.sky/config.yaml + kubernetes: + custom_metadata: + annotations: + service.beta.kubernetes.io/aws-load-balancer-internal: "true" + + .. tab-item:: Azure + :sync: internal-lb-azure + + .. code-block:: yaml + + # ~/.sky/config.yaml + kubernetes: + custom_metadata: + annotations: + service.beta.kubernetes.io/azure-load-balancer-internal: "true" + .. _kubernetes-ingress: Nginx Ingress -^^^^^^^^^^^^^ +------------- This mode exposes ports by creating a Kubernetes `Ingress `_ backed by an existing `Nginx Ingress Controller `_. From d5b6d89c83ea1ee7258f68314da4c6f8add83e04 Mon Sep 17 00:00:00 2001 From: Andy Lee Date: Sun, 6 Oct 2024 13:30:14 -0700 Subject: [PATCH 36/93] Fix error handling in service update process (#4034) * Fix error handling in service update process Fixes #4030 Address error handling inconsistency in service update process. * **sky/serve/controller.py** - Modify `/controller/update_service` endpoint to return appropriate HTTP status codes. - Return 400 for client errors and 500 for server errors. - Use `responses.JSONResponse` for returning responses. * **sky/serve/serve_utils.py** - Update `update_service_encoded` function to handle different status codes. - Raise exceptions based on the response body for 400 and 500 status codes. * **sky/utils/subprocess_utils.py** - Add `stream_logs` parameter in the comment to reflect the code. * format * apply to load_balancer_sync for consistency --- sky/serve/controller.py | 21 ++++++++++++++------- sky/serve/serve_utils.py | 4 ++++ sky/utils/subprocess_utils.py | 1 + 3 files changed, 19 insertions(+), 7 deletions(-) diff --git a/sky/serve/controller.py b/sky/serve/controller.py index 5d49c1aa307..580964273ef 100644 --- a/sky/serve/controller.py +++ b/sky/serve/controller.py @@ -10,6 +10,7 @@ from typing import Any, Dict, List import fastapi +from fastapi import responses import uvicorn from sky import serve @@ -96,7 +97,8 @@ def _run_autoscaler(self): def run(self) -> None: @self._app.post('/controller/load_balancer_sync') - async def load_balancer_sync(request: fastapi.Request): + async def load_balancer_sync( + request: fastapi.Request) -> fastapi.Response: request_data = await request.json() # TODO(MaoZiming): Check aggregator type. request_aggregator: Dict[str, Any] = request_data.get( @@ -104,18 +106,21 @@ async def load_balancer_sync(request: fastapi.Request): timestamps: List[int] = request_aggregator.get('timestamps', []) logger.info(f'Received {len(timestamps)} inflight requests.') self._autoscaler.collect_request_information(request_aggregator) - return { + return responses.JSONResponse(content={ 'ready_replica_urls': self._replica_manager.get_active_replica_urls() - } + }, + status_code=200) @self._app.post('/controller/update_service') - async def update_service(request: fastapi.Request): + async def update_service(request: fastapi.Request) -> fastapi.Response: request_data = await request.json() try: version = request_data.get('version', None) if version is None: - return {'message': 'Error: version is not specified.'} + return responses.JSONResponse( + content={'message': 'Error: version is not specified.'}, + status_code=400) update_mode_str = request_data.get( 'mode', serve_utils.DEFAULT_UPDATE_MODE.value) update_mode = serve_utils.UpdateMode(update_mode_str) @@ -144,11 +149,13 @@ async def update_service(request: fastapi.Request): self._autoscaler.update_version(version, service, update_mode=update_mode) - return {'message': 'Success'} + return responses.JSONResponse(content={'message': 'Success'}, + status_code=200) except Exception as e: # pylint: disable=broad-except logger.error(f'Error in update_service: ' f'{common_utils.format_exception(e)}') - return {'message': 'Error'} + return responses.JSONResponse(content={'message': 'Error'}, + status_code=500) threading.Thread(target=self._run_autoscaler).start() diff --git a/sky/serve/serve_utils.py b/sky/serve/serve_utils.py index 4a6467a6a32..0ecf34135a7 100644 --- a/sky/serve/serve_utils.py +++ b/sky/serve/serve_utils.py @@ -302,6 +302,10 @@ def update_service_encoded(service_name: str, version: int, mode: str) -> str: raise ValueError('The service is up-ed in an old version and does not ' 'support update. Please `sky serve down` ' 'it first and relaunch the service. ') + elif resp.status_code == 400: + raise ValueError(f'Client error during service update: {resp.text}') + elif resp.status_code == 500: + raise RuntimeError(f'Server error during service update: {resp.text}') elif resp.status_code != 200: raise ValueError(f'Failed to update service: {resp.text}') diff --git a/sky/utils/subprocess_utils.py b/sky/utils/subprocess_utils.py index d1779352a81..303e3ddad99 100644 --- a/sky/utils/subprocess_utils.py +++ b/sky/utils/subprocess_utils.py @@ -77,6 +77,7 @@ def handle_returncode(returncode: int, command: The command that was run. error_msg: The error message to print. stderr: The stderr of the command. + stream_logs: Whether to stream logs. """ echo = logger.error if stream_logs else logger.debug if returncode != 0: From 3f898abe10c1aa2a05da0e00f0c6a9a947b53bbc Mon Sep 17 00:00:00 2001 From: yika-luo Date: Mon, 7 Oct 2024 18:45:58 -0700 Subject: [PATCH 37/93] [Storage] Add .skyignore support (#4038) * [Storage] Add .skyignore support * lint fix * fix lint * Make sky job launch consistent with sky launch * remove unused comments * Don't use .git/info/exclude when .skyignore is present * Don't use .git/info/exclude when .skyignore is present 2 * Update SkyPilot Reference Page * address comments * Handle all files under current dir * link * no absolute path * use / in front of individual files and dirs * correct **/ --------- Co-authored-by: Yika Luo --- .../examples/syncing-code-artifacts.rst | 31 ++++++++-- docs/source/reference/yaml-spec.rst | 4 +- sky/backends/backend_utils.py | 24 ++++---- sky/backends/cloud_vm_ray_backend.py | 4 +- sky/data/storage.py | 12 ++-- sky/data/storage_utils.py | 58 ++++++++++++++++++- sky/skylet/constants.py | 2 + sky/utils/command_runner.py | 39 +++++++------ sky/utils/command_runner.pyi | 3 +- tests/unit_tests/test_storage_utils.py | 55 ++++++++++++++++++ 10 files changed, 184 insertions(+), 48 deletions(-) create mode 100644 tests/unit_tests/test_storage_utils.py diff --git a/docs/source/examples/syncing-code-artifacts.rst b/docs/source/examples/syncing-code-artifacts.rst index 814bd00fb25..ded8d03f739 100644 --- a/docs/source/examples/syncing-code-artifacts.rst +++ b/docs/source/examples/syncing-code-artifacts.rst @@ -47,10 +47,30 @@ scripts, access checkpoints, etc.). .. note:: + **Exclude files from syncing** + For large, multi-gigabyte workdirs, uploading may be slow because they - are synced to the remote VM(s) with :code:`rsync`. To exclude large files in - your workdir from being uploaded, add them to the :code:`.gitignore` file - (or a ``.git/info/exclude`` file) under the workdir. + are synced to the remote VM(s). To exclude large files in + your workdir from being uploaded, add them to a :code:`.skyignore` file + under your workdir. :code:`.skyignore` follows RSYNC filter rules. + + Example :code:`.skyignore` file: + + .. code-block:: + + # Files that match pattern under ONLY CURRENT directory + /hello.py + /*.txt + /dir + + # Files that match pattern under ALL directories + *.txt + hello.py + + # Files that match pattern under a directory ./dir/ + /dir/*.txt + + Do NOT use ``.`` to indicate local directory (e.g. ``./hello.py``). .. note:: @@ -101,9 +121,8 @@ pass the ``--no-setup`` flag to ``sky launch``. For example, ``sky launch --no-s .. note:: - Items listed in a :code:`.gitignore` file (or a ``.git/info/exclude`` file) - under a local file_mount source are also ignored (the same behavior as - handling ``workdir``). + Items listed in a :code:`.skyignore` file under the local file_mount source + are also ignored (the same behavior as handling ``workdir``). .. note:: diff --git a/docs/source/reference/yaml-spec.rst b/docs/source/reference/yaml-spec.rst index 228cbd7c88f..c5339bcc184 100644 --- a/docs/source/reference/yaml-spec.rst +++ b/docs/source/reference/yaml-spec.rst @@ -22,8 +22,8 @@ Available fields: # If a relative path is used, it's evaluated relative to the location from # which `sky` is called. # - # If a .gitignore file (or a .git/info/exclude file) exists in the working - # directory, files and directories listed in it will be excluded from syncing. + # To exclude files from syncing, add them to a .skyignore file under your working directory. + # Details: https://skypilot.readthedocs.io/en/latest/examples/syncing-code-artifacts.html#uploading-code-and-project-files workdir: ~/my-task-code # Number of nodes (optional; defaults to 1) to launch including the head node. diff --git a/sky/backends/backend_utils.py b/sky/backends/backend_utils.py index b83817b9b42..24f638a12b9 100644 --- a/sky/backends/backend_utils.py +++ b/sky/backends/backend_utils.py @@ -280,18 +280,22 @@ def path_size_megabytes(path: str) -> int: If successful: the size of 'path' in megabytes, rounded down. Otherwise, -1. """ - resolved_path = pathlib.Path(path).expanduser().resolve() git_exclude_filter = '' - if (resolved_path / command_runner.GIT_EXCLUDE).exists(): - # Ensure file exists; otherwise, rsync will error out. - # - # We shlex.quote() because the path may contain spaces: - # 'my dir/.git/info/exclude' - # Without quoting rsync fails. - git_exclude_filter = command_runner.RSYNC_EXCLUDE_OPTION.format( - shlex.quote(str(resolved_path / command_runner.GIT_EXCLUDE))) + resolved_path = pathlib.Path(path).expanduser().resolve() + if (resolved_path / constants.SKY_IGNORE_FILE).exists(): + rsync_filter = command_runner.RSYNC_FILTER_SKYIGNORE + else: + rsync_filter = command_runner.RSYNC_FILTER_GITIGNORE + if (resolved_path / command_runner.GIT_EXCLUDE).exists(): + # Ensure file exists; otherwise, rsync will error out. + # + # We shlex.quote() because the path may contain spaces: + # 'my dir/.git/info/exclude' + # Without quoting rsync fails. + git_exclude_filter = command_runner.RSYNC_EXCLUDE_OPTION.format( + shlex.quote(str(resolved_path / command_runner.GIT_EXCLUDE))) rsync_command = (f'rsync {command_runner.RSYNC_DISPLAY_OPTION} ' - f'{command_runner.RSYNC_FILTER_OPTION} ' + f'{rsync_filter} ' f'{git_exclude_filter} --dry-run {path!r}') rsync_output = '' try: diff --git a/sky/backends/cloud_vm_ray_backend.py b/sky/backends/cloud_vm_ray_backend.py index 4d6e0eb4fb7..714e4fc14eb 100644 --- a/sky/backends/cloud_vm_ray_backend.py +++ b/sky/backends/cloud_vm_ray_backend.py @@ -3056,7 +3056,7 @@ def _sync_workdir(self, handle: CloudVmRayResourceHandle, logger.warning( f'{fore.YELLOW}The size of workdir {workdir!r} ' f'is {dir_size} MB. Try to keep workdir small or use ' - '.gitignore to exclude large files, as large sizes will slow ' + '.skyignore to exclude large files, as large sizes will slow ' f'down rsync.{style.RESET_ALL}') log_path = os.path.join(self.log_dir, 'workdir_sync.log') @@ -4470,7 +4470,7 @@ def _execute_file_mounts(self, handle: CloudVmRayResourceHandle, logger.warning( f'{fore.YELLOW}The size of file mount src {src!r} ' f'is {src_size} MB. Try to keep src small or use ' - '.gitignore to exclude large files, as large sizes ' + '.skyignore to exclude large files, as large sizes ' f'will slow down rsync. {style.RESET_ALL}') if os.path.islink(full_src): logger.warning( diff --git a/sky/data/storage.py b/sky/data/storage.py index 5214799d2f3..78174ad1ed5 100644 --- a/sky/data/storage.py +++ b/sky/data/storage.py @@ -1298,8 +1298,7 @@ def get_file_sync_command(base_dir_path, file_names): def get_dir_sync_command(src_dir_path, dest_dir_name): # we exclude .git directory from the sync - excluded_list = storage_utils.get_excluded_files_from_gitignore( - src_dir_path) + excluded_list = storage_utils.get_excluded_files(src_dir_path) excluded_list.append('.git/*') excludes = ' '.join([ f'--exclude {shlex.quote(file_name)}' @@ -1764,8 +1763,7 @@ def get_file_sync_command(base_dir_path, file_names): return sync_command def get_dir_sync_command(src_dir_path, dest_dir_name): - excluded_list = storage_utils.get_excluded_files_from_gitignore( - src_dir_path) + excluded_list = storage_utils.get_excluded_files(src_dir_path) # we exclude .git directory from the sync excluded_list.append(r'^\.git/.*$') excludes = '|'.join(excluded_list) @@ -2490,8 +2488,7 @@ def get_file_sync_command(base_dir_path, file_names) -> str: def get_dir_sync_command(src_dir_path, dest_dir_name) -> str: # we exclude .git directory from the sync - excluded_list = storage_utils.get_excluded_files_from_gitignore( - src_dir_path) + excluded_list = storage_utils.get_excluded_files(src_dir_path) excluded_list.append('.git/') excludes_list = ';'.join( [file_name.rstrip('*') for file_name in excluded_list]) @@ -2895,8 +2892,7 @@ def get_file_sync_command(base_dir_path, file_names): def get_dir_sync_command(src_dir_path, dest_dir_name): # we exclude .git directory from the sync - excluded_list = storage_utils.get_excluded_files_from_gitignore( - src_dir_path) + excluded_list = storage_utils.get_excluded_files(src_dir_path) excluded_list.append('.git/*') excludes = ' '.join([ f'--exclude {shlex.quote(file_name)}' diff --git a/sky/data/storage_utils.py b/sky/data/storage_utils.py index 245325806a3..a1295d5e3ee 100644 --- a/sky/data/storage_utils.py +++ b/sky/data/storage_utils.py @@ -1,4 +1,5 @@ """Utility functions for the storage module.""" +import glob import os import shlex import subprocess @@ -8,6 +9,8 @@ from sky import exceptions from sky import sky_logging +from sky.skylet import constants +from sky.utils import common_utils from sky.utils import log_utils from sky.utils.cli_utils import status_utils @@ -63,6 +66,42 @@ def format_storage_table(storages: List[Dict[str, Any]], return 'No existing storage.' +def get_excluded_files_from_skyignore(src_dir_path: str) -> List[str]: + """List files and patterns ignored by the .skyignore file + in the given source directory. + """ + excluded_list: List[str] = [] + expand_src_dir_path = os.path.expanduser(src_dir_path) + skyignore_path = os.path.join(expand_src_dir_path, + constants.SKY_IGNORE_FILE) + + try: + with open(skyignore_path, 'r', encoding='utf-8') as f: + for line in f: + line = line.strip() + if line and not line.startswith('#'): + # Make parsing consistent with rsync. + # Rsync uses '/' as current directory. + if line.startswith('/'): + line = '.' + line + else: + line = '**/' + line + # Find all files matching the pattern. + matching_files = glob.glob(os.path.join( + expand_src_dir_path, line), + recursive=True) + # Process filenames to comply with cloud rsync format. + for i in range(len(matching_files)): + matching_files[i] = os.path.relpath( + matching_files[i], expand_src_dir_path) + excluded_list.extend(matching_files) + except IOError as e: + logger.warning(f'Error reading {skyignore_path}: ' + f'{common_utils.format_exception(e, use_bracket=True)}') + + return excluded_list + + def get_excluded_files_from_gitignore(src_dir_path: str) -> List[str]: """ Lists files and patterns ignored by git in the source directory @@ -78,7 +117,8 @@ def get_excluded_files_from_gitignore(src_dir_path: str) -> List[str]: expand_src_dir_path = os.path.expanduser(src_dir_path) git_exclude_path = os.path.join(expand_src_dir_path, '.git/info/exclude') - gitignore_path = os.path.join(expand_src_dir_path, '.gitignore') + gitignore_path = os.path.join(expand_src_dir_path, + constants.GIT_IGNORE_FILE) git_exclude_exists = os.path.isfile(git_exclude_path) gitignore_exists = os.path.isfile(gitignore_path) @@ -162,3 +202,19 @@ def get_excluded_files_from_gitignore(src_dir_path: str) -> List[str]: to_be_excluded += '*' excluded_list.append(to_be_excluded) return excluded_list + + +def get_excluded_files(src_dir_path: str) -> List[str]: + # TODO: this could return a huge list of files, + # should think of ways to optimize. + """ List files and directories to be excluded.""" + expand_src_dir_path = os.path.expanduser(src_dir_path) + skyignore_path = os.path.join(expand_src_dir_path, + constants.SKY_IGNORE_FILE) + if os.path.exists(skyignore_path): + logger.info(f'Exclude files to sync to cluster based on ' + f'{constants.SKY_IGNORE_FILE}.') + return get_excluded_files_from_skyignore(src_dir_path) + logger.info(f'Exclude files to sync to cluster based on ' + f'{constants.GIT_IGNORE_FILE}.') + return get_excluded_files_from_gitignore(src_dir_path) diff --git a/sky/skylet/constants.py b/sky/skylet/constants.py index f23dc8100b5..5729d75c968 100644 --- a/sky/skylet/constants.py +++ b/sky/skylet/constants.py @@ -7,6 +7,8 @@ SKY_LOGS_DIRECTORY = '~/sky_logs' SKY_REMOTE_WORKDIR = '~/sky_workdir' +SKY_IGNORE_FILE = '.skyignore' +GIT_IGNORE_FILE = '.gitignore' # Default Ray port is 6379. Default Ray dashboard port is 8265. # Default Ray tempdir is /tmp/ray. diff --git a/sky/utils/command_runner.py b/sky/utils/command_runner.py index 3d4bcb0af9a..2936e7c5e62 100644 --- a/sky/utils/command_runner.py +++ b/sky/utils/command_runner.py @@ -16,8 +16,6 @@ logger = sky_logging.init_logger(__name__) -# The git exclude file to support. -GIT_EXCLUDE = '.git/info/exclude' # Rsync options # TODO(zhwu): This will print a per-file progress bar (with -P), # shooting a lot of messages to the output. --info=progress2 is used @@ -30,7 +28,10 @@ # Note that "-" is mandatory for rsync and means all patterns in the ignore # files are treated as *exclude* patterns. Non-exclude patterns, e.g., "! # do_not_exclude" doesn't work, even though git allows it. -RSYNC_FILTER_OPTION = '--filter=\'dir-merge,- .gitignore\'' +RSYNC_FILTER_SKYIGNORE = f'--filter=\'dir-merge,- {constants.SKY_IGNORE_FILE}\'' +RSYNC_FILTER_GITIGNORE = f'--filter=\'dir-merge,- {constants.GIT_IGNORE_FILE}\'' +# The git exclude file to support. +GIT_EXCLUDE = '.git/info/exclude' RSYNC_EXCLUDE_OPTION = '--exclude-from={}' _HASH_MAX_LENGTH = 10 @@ -237,21 +238,23 @@ def _rsync( rsync_command += ['rsync', RSYNC_DISPLAY_OPTION] # --filter - rsync_command.append(RSYNC_FILTER_OPTION) - - if up: - # Build --exclude-from argument. - # The source is a local path, so we need to resolve it. - resolved_source = pathlib.Path(source).expanduser().resolve() - if (resolved_source / GIT_EXCLUDE).exists(): - # Ensure file exists; otherwise, rsync will error out. - # - # We shlex.quote() because the path may contain spaces: - # 'my dir/.git/info/exclude' - # Without quoting rsync fails. - rsync_command.append( - RSYNC_EXCLUDE_OPTION.format( - shlex.quote(str(resolved_source / GIT_EXCLUDE)))) + # The source is a local path, so we need to resolve it. + resolved_source = pathlib.Path(source).expanduser().resolve() + if (resolved_source / constants.SKY_IGNORE_FILE).exists(): + rsync_command.append(RSYNC_FILTER_SKYIGNORE) + else: + rsync_command.append(RSYNC_FILTER_GITIGNORE) + if up: + # Build --exclude-from argument. + if (resolved_source / GIT_EXCLUDE).exists(): + # Ensure file exists; otherwise, rsync will error out. + # + # We shlex.quote() because the path may contain spaces: + # 'my dir/.git/info/exclude' + # Without quoting rsync fails. + rsync_command.append( + RSYNC_EXCLUDE_OPTION.format( + shlex.quote(str(resolved_source / GIT_EXCLUDE)))) rsync_command.append(f'-e {shlex.quote(rsh_option)}') diff --git a/sky/utils/command_runner.pyi b/sky/utils/command_runner.pyi index 51b22a259ea..a2c524e4e5d 100644 --- a/sky/utils/command_runner.pyi +++ b/sky/utils/command_runner.pyi @@ -16,7 +16,8 @@ from sky.utils import subprocess_utils as subprocess_utils GIT_EXCLUDE: str RSYNC_DISPLAY_OPTION: str -RSYNC_FILTER_OPTION: str +RSYNC_FILTER_GITIGNORE: str +RSYNC_FILTER_SKYIGNORE: str RSYNC_EXCLUDE_OPTION: str ALIAS_SUDO_TO_EMPTY_FOR_ROOT_CMD: str diff --git a/tests/unit_tests/test_storage_utils.py b/tests/unit_tests/test_storage_utils.py new file mode 100644 index 00000000000..cd1e436390b --- /dev/null +++ b/tests/unit_tests/test_storage_utils.py @@ -0,0 +1,55 @@ +import os +import tempfile + +from sky.data import storage_utils +from sky.skylet import constants + + +def test_get_excluded_files_from_skyignore_no_file(): + excluded_files = storage_utils.get_excluded_files_from_skyignore('.') + assert len(excluded_files) == 0 + + +def test_get_excluded_files_from_skyignore(): + with tempfile.TemporaryDirectory() as temp_dir: + # Create workdir + dirs = ['remove_dir', 'dir', 'dir/subdir', 'dir/subdir/remove_dir'] + files = [ + 'remove.py', 'remove.sh', 'remove.a', 'keep.py', 'remove.a', + 'dir/keep.txt', 'dir/remove.sh', 'dir/keep.a', 'dir/remove.b', + 'dir/remove.a', 'dir/subdir/keep.b', 'dir/subdir/remove.py' + ] + for dir_name in dirs: + os.makedirs(os.path.join(temp_dir, dir_name), exist_ok=True) + for file_path in files: + full_path = os.path.join(temp_dir, file_path) + with open(full_path, 'w') as f: + f.write('test content') + + # Create skyignore file + skyignore_content = """ + # Current directory + /remove.py + /remove_dir + /*.a + /dir/*.b + # Pattern match for all subdirectories + *.sh + remove.a + """ + skyignore_path = os.path.join(temp_dir, constants.SKY_IGNORE_FILE) + with open(skyignore_path, 'w') as f: + f.write(skyignore_content) + + # Test function + excluded_files = storage_utils.get_excluded_files_from_skyignore( + temp_dir) + + # Validate results + expected_excluded_files = [ + 'remove.py', 'remove_dir', 'remove.sh', 'remove.a', 'dir/remove.sh', + 'dir/remove.b', 'remove.a', 'dir/remove.a' + ] + for file_path in expected_excluded_files: + assert file_path in excluded_files + assert len(excluded_files) == len(expected_excluded_files) From ea881b46b0a93fac8f1012bfc27af2e5234c605c Mon Sep 17 00:00:00 2001 From: Jay Thomason Date: Wed, 9 Oct 2024 18:13:40 -0700 Subject: [PATCH 38/93] aws: use IDMSv2 in zone shell cmd (#4052) * aws: use IDMSv2 in zone shell cmd Usage of IDMSv2 is considered a best practice for security. see: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-service.html This change was tested manually on an ec2 instance using a test script. * fix formatting * fix whitespace * prefer idmsv2 in aws template for node config --- sky/clouds/aws.py | 5 ++++- sky/templates/aws-ray.yml.j2 | 3 +++ 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/sky/clouds/aws.py b/sky/clouds/aws.py index be1ecce0350..2207a977f25 100644 --- a/sky/clouds/aws.py +++ b/sky/clouds/aws.py @@ -299,7 +299,10 @@ def get_zone_shell_cmd(cls) -> Optional[str]: # The command for getting the current zone is from: # https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-identity-documents.html # pylint: disable=line-too-long command_str = ( - 'curl -s http://169.254.169.254/latest/dynamic/instance-identity/document' # pylint: disable=line-too-long + 'TOKEN=`curl -X PUT "http://169.254.169.254/latest/api/token" ' + '-H "X-aws-ec2-metadata-token-ttl-seconds: 21600"` && ' + 'curl -H "X-aws-ec2-metadata-token: $TOKEN" -s ' + 'http://169.254.169.254/latest/dynamic/instance-identity/document' f' | {constants.SKY_PYTHON_CMD} -u -c "import sys, json; ' 'print(json.load(sys.stdin)[\'availabilityZone\'])"') return command_str diff --git a/sky/templates/aws-ray.yml.j2 b/sky/templates/aws-ray.yml.j2 index 6afdf381cc0..11c3c3e1a3c 100644 --- a/sky/templates/aws-ray.yml.j2 +++ b/sky/templates/aws-ray.yml.j2 @@ -131,6 +131,9 @@ available_node_types: - Key: {{ label_key }} Value: {{ label_value|tojson }} {%- endfor %} + # Use IDMSv2 + MetadataOptions: + HttpTokens: required head_node_type: ray.head.default From 5491cf3e3e3945e5a9938df583e4155cff90d765 Mon Sep 17 00:00:00 2001 From: Tian Xia Date: Wed, 9 Oct 2024 22:01:27 -0700 Subject: [PATCH 39/93] [K8s] Add user hash to the kind config for multi-user system permission issue (#4045) * [K8s] Remove the kind config after `sky local down` for multi-user system permission issue * upd * fix * resolve comments --- sky/cli.py | 3 ++- sky/utils/kubernetes/create_cluster.sh | 13 ++++++++----- 2 files changed, 10 insertions(+), 6 deletions(-) diff --git a/sky/cli.py b/sky/cli.py index c538c99aeb3..093db23adbf 100644 --- a/sky/cli.py +++ b/sky/cli.py @@ -5097,7 +5097,8 @@ def _deploy_local_cluster(gpus: bool): # Get directory of script and run it from there cwd = os.path.dirname(os.path.abspath(up_script_path)) - run_command = up_script_path + ' --gpus' if gpus else up_script_path + run_command = up_script_path + f' {common_utils.get_user_hash()}' + run_command = run_command + ' --gpus' if gpus else run_command run_command = shlex.split(run_command) # Setup logging paths diff --git a/sky/utils/kubernetes/create_cluster.sh b/sky/utils/kubernetes/create_cluster.sh index 52bbd1804e8..7c5c4cea57f 100755 --- a/sky/utils/kubernetes/create_cluster.sh +++ b/sky/utils/kubernetes/create_cluster.sh @@ -12,9 +12,11 @@ IMAGE_GPU="us-central1-docker.pkg.dev/skypilot-375900/skypilotk8s/skypilot-gpu:l PORT_RANGE_START=30000 PORT_RANGE_END=30100 +USER_HASH=$1 + # Check for GPU flag ENABLE_GPUS=false -if [[ "$1" == "--gpus" ]]; then +if [[ "$2" == "--gpus" ]]; then ENABLE_GPUS=true fi @@ -88,16 +90,17 @@ if kind get clusters | grep -q skypilot; then fi # Generate cluster YAML -echo "Generating /tmp/skypilot-kind.yaml" +YAML_PATH="/tmp/skypilot-kind-$USER_HASH.yaml" +echo "Generating $YAML_PATH" # Add GPUs flag to the generate_kind_config.py command if GPUs are enabled if $ENABLE_GPUS; then - python -m sky.utils.kubernetes.generate_kind_config --path /tmp/skypilot-kind.yaml --port-start ${PORT_RANGE_START} --port-end ${PORT_RANGE_END} --gpus + python -m sky.utils.kubernetes.generate_kind_config --path $YAML_PATH --port-start ${PORT_RANGE_START} --port-end ${PORT_RANGE_END} --gpus else - python -m sky.utils.kubernetes.generate_kind_config --path /tmp/skypilot-kind.yaml --port-start ${PORT_RANGE_START} --port-end ${PORT_RANGE_END} + python -m sky.utils.kubernetes.generate_kind_config --path $YAML_PATH --port-start ${PORT_RANGE_START} --port-end ${PORT_RANGE_END} fi -kind create cluster --config /tmp/skypilot-kind.yaml --name skypilot +kind create cluster --config $YAML_PATH --name skypilot echo "Kind cluster created." From 292febc85cee658bd4370dd1894f2a6c97a5e264 Mon Sep 17 00:00:00 2001 From: Eric Meier Date: Thu, 10 Oct 2024 16:41:12 -0700 Subject: [PATCH 40/93] Up paperspace MAX_POLLS_FOR_UP_OR_STOP (#3952) --- sky/provision/paperspace/instance.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sky/provision/paperspace/instance.py b/sky/provision/paperspace/instance.py index ce1a4768c24..5804362d102 100644 --- a/sky/provision/paperspace/instance.py +++ b/sky/provision/paperspace/instance.py @@ -14,7 +14,7 @@ POLL_INTERVAL = 5 MAX_POLLS = 60 // POLL_INTERVAL # Stopping instances can take several minutes, so we increase the timeout -MAX_POLLS_FOR_UP_OR_STOP = MAX_POLLS * 8 +MAX_POLLS_FOR_UP_OR_STOP = MAX_POLLS * 16 logger = sky_logging.init_logger(__name__) From d0d221fae659ccce73df5684fca53e0719dab814 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Thu, 10 Oct 2024 22:38:25 -0700 Subject: [PATCH 41/93] [k8s] Fix rsync for context name with `:` and `/` (#4065) * [kubernetes] Fix context name with colon * comment * remove additional / * Move the encoding to kubernetes only --- sky/utils/command_runner.py | 9 ++++++++- sky/utils/kubernetes/rsync_helper.sh | 8 +++++++- 2 files changed, 15 insertions(+), 2 deletions(-) diff --git a/sky/utils/command_runner.py b/sky/utils/command_runner.py index 2936e7c5e62..c94970ce764 100644 --- a/sky/utils/command_runner.py +++ b/sky/utils/command_runner.py @@ -831,10 +831,17 @@ def get_remote_home_dir() -> str: # Build command. helper_path = os.path.join(os.path.abspath(os.path.dirname(__file__)), 'kubernetes', 'rsync_helper.sh') + namespace_context = f'{self.namespace}+{self.context}' + # Avoid rsync interpreting :, /, and + in namespace_context as the + # default delimiter for options and arguments. + # rsync_helper.sh will parse the namespace_context by reverting the + # encoding and pass it to kubectl exec. + encoded_namespace_context = namespace_context.replace( + ':', '%3A').replace('/', '%2F').replace('+', '%2B') self._rsync( source, target, - node_destination=f'{self.pod_name}@{self.namespace}+{self.context}', + node_destination=f'{self.pod_name}@{encoded_namespace_context}', up=up, rsh_option=helper_path, log_path=log_path, diff --git a/sky/utils/kubernetes/rsync_helper.sh b/sky/utils/kubernetes/rsync_helper.sh index 30b63fe6a15..0ee93d8521a 100755 --- a/sky/utils/kubernetes/rsync_helper.sh +++ b/sky/utils/kubernetes/rsync_helper.sh @@ -4,9 +4,15 @@ shift pod=$1 shift -namespace_context=$1 +echo "pod: $pod" >&2 +encoded_namespace_context=$1 +# Revert the encoded namespace+context to the original string. +namespace_context=$(echo "$encoded_namespace_context" | sed 's|%3A|:|g' | sed 's|%2B|+|g' | sed 's|%2F|/|g') +echo "namespace_context: $namespace_context" >&2 namespace=$(echo $namespace_context | cut -d+ -f1) +echo "namespace: $namespace" >&2 context=$(echo $namespace_context | grep '+' >/dev/null && echo $namespace_context | cut -d+ -f2- || echo "") +echo "context: $context" >&2 context_lower=$(echo "$context" | tr '[:upper:]' '[:lower:]') shift if [ -z "$context" ] || [ "$context_lower" = "none" ]; then From f63850b5e9e45b954c1f7b46b36250b13a5a5b1b Mon Sep 17 00:00:00 2001 From: Romil Bhardwaj Date: Fri, 11 Oct 2024 10:05:37 -0700 Subject: [PATCH 42/93] [k8s] Add `sky status` flag to query global Kubernetes status (#4040) * sky global status for kubernetes * add parsing for jobs controllers * better parsing for jobs controllers * sorting * linting * wip * lint * cleanup * comment cleanup * comment cleanup * comment cleanup * update docst * update docstr * update docstr * refactor to avoid cyclic import * lint * merge * Fix context name in sky show-gpus * lint * lint * fixes * fixes * comments --- sky/cli.py | 95 +++++++++++++- sky/data/storage_utils.py | 7 +- sky/jobs/__init__.py | 2 + sky/jobs/core.py | 78 ++++++++++++ sky/jobs/utils.py | 29 +++-- sky/provision/kubernetes/utils.py | 25 ++++ sky/utils/cli_utils/status_utils.py | 189 ++++++++++++++++++++++++---- sky/utils/common_utils.py | 20 +++ 8 files changed, 409 insertions(+), 36 deletions(-) diff --git a/sky/cli.py b/sky/cli.py index 093db23adbf..70c4a13704f 100644 --- a/sky/cli.py +++ b/sky/cli.py @@ -1458,6 +1458,79 @@ def _get_services(service_names: Optional[List[str]], return num_services, msg +def _status_kubernetes(show_all: bool): + """Show all SkyPilot resources in the current Kubernetes context. + + Args: + show_all (bool): Show all job information (e.g., start time, failures). + """ + context = kubernetes_utils.get_current_kube_config_context_name() + try: + pods = kubernetes_utils.get_skypilot_pods(context) + except exceptions.ResourcesUnavailableError as e: + with ux_utils.print_exception_no_traceback(): + raise ValueError('Failed to get SkyPilot pods from ' + f'Kubernetes: {str(e)}') from e + all_clusters, jobs_controllers, serve_controllers = ( + status_utils.process_skypilot_pods(pods, context)) + all_jobs = [] + with rich_utils.safe_status( + '[bold cyan]Checking in-progress managed jobs[/]') as spinner: + for i, (_, job_controller_info) in enumerate(jobs_controllers.items()): + user = job_controller_info['user'] + pod = job_controller_info['pods'][0] + status_message = ('[bold cyan]Checking managed jobs controller') + if len(jobs_controllers) > 1: + status_message += f's ({i+1}/{len(jobs_controllers)})' + spinner.update(f'{status_message}[/]') + try: + job_list = managed_jobs.queue_from_kubernetes_pod( + pod.metadata.name) + except RuntimeError as e: + logger.warning('Failed to get managed jobs from controller ' + f'{pod.metadata.name}: {str(e)}') + job_list = [] + # Add user field to jobs + for job in job_list: + job['user'] = user + all_jobs.extend(job_list) + # Reconcile cluster state between managed jobs and clusters: + # To maintain a clear separation between regular SkyPilot clusters + # and those from managed jobs, we need to exclude the latter from + # the main cluster list. + # We do this by reconstructing managed job cluster names from each + # job's name and ID. We then use this set to filter out managed + # clusters from the main cluster list. This is necessary because there + # are no identifiers distinguishing clusters from managed jobs from + # regular clusters. + managed_job_cluster_names = set() + for job in all_jobs: + # Managed job cluster name is - + managed_cluster_name = f'{job["job_name"]}-{job["job_id"]}' + managed_job_cluster_names.add(managed_cluster_name) + unmanaged_clusters = [ + c for c in all_clusters + if c['cluster_name'] not in managed_job_cluster_names + ] + click.echo(f'{colorama.Fore.CYAN}{colorama.Style.BRIGHT}' + f'Kubernetes cluster state (context: {context})' + f'{colorama.Style.RESET_ALL}') + status_utils.show_kubernetes_cluster_status_table(unmanaged_clusters, + show_all) + if all_jobs: + click.echo(f'\n{colorama.Fore.CYAN}{colorama.Style.BRIGHT}' + f'Managed jobs' + f'{colorama.Style.RESET_ALL}') + msg = managed_jobs.format_job_table(all_jobs, show_all=show_all) + click.echo(msg) + if serve_controllers: + # TODO: Parse serve controllers and show services separately. + # Currently we show a hint that services are shown as clusters. + click.echo(f'\n{colorama.Style.DIM}Hint: SkyServe replica pods are ' + 'shown in the "SkyPilot clusters" section.' + f'{colorama.Style.RESET_ALL}') + + @cli.command() @click.option('--all', '-a', @@ -1503,6 +1576,14 @@ def _get_services(service_names: Optional[List[str]], is_flag=True, required=False, help='Also show sky serve services, if any.') +@click.option( + '--kubernetes', + '--k8s', + default=False, + is_flag=True, + required=False, + help='[Experimental] Show all SkyPilot resources (including from other ' + 'users) in the current Kubernetes context.') @click.argument('clusters', required=False, type=str, @@ -1512,7 +1593,7 @@ def _get_services(service_names: Optional[List[str]], # pylint: disable=redefined-builtin def status(all: bool, refresh: bool, ip: bool, endpoints: bool, endpoint: Optional[int], show_managed_jobs: bool, - show_services: bool, clusters: List[str]): + show_services: bool, kubernetes: bool, clusters: List[str]): # NOTE(dev): Keep the docstring consistent between the Python API and CLI. """Show clusters. @@ -1571,6 +1652,9 @@ def status(all: bool, refresh: bool, ip: bool, endpoints: bool, or for autostop-enabled clusters, use ``--refresh`` to query the latest cluster statuses from the cloud providers. """ + if kubernetes: + _status_kubernetes(all) + return # Using a pool with 2 worker to run the managed job query and sky serve # service query in parallel to speed up. The pool provides a AsyncResult # object that can be used as a future. @@ -3113,7 +3197,12 @@ def _output(): print_section_titles = False # If cloud is kubernetes, we want to show real-time capacity if kubernetes_is_enabled and (cloud is None or cloud_is_kubernetes): - context = region + if region: + context = region + else: + # If region is not specified, we use the current context + context = ( + kubernetes_utils.get_current_kube_config_context_name()) try: # If --cloud kubernetes is not specified, we want to catch # the case where no GPUs are available on the cluster and @@ -3128,7 +3217,7 @@ def _output(): else: print_section_titles = True yield (f'{colorama.Fore.CYAN}{colorama.Style.BRIGHT}' - f'Kubernetes GPUs (Context: {context})' + f'Kubernetes GPUs (context: {context})' f'{colorama.Style.RESET_ALL}\n') yield from k8s_realtime_table.get_string() k8s_node_table = _get_kubernetes_node_info_table(context) diff --git a/sky/data/storage_utils.py b/sky/data/storage_utils.py index a1295d5e3ee..7b5bf48d5db 100644 --- a/sky/data/storage_utils.py +++ b/sky/data/storage_utils.py @@ -12,7 +12,6 @@ from sky.skylet import constants from sky.utils import common_utils from sky.utils import log_utils -from sky.utils.cli_utils import status_utils logger = sky_logging.init_logger(__name__) @@ -22,6 +21,8 @@ 'to the cloud storage for {path!r}' 'due to the following error: {error_msg!r}') +_LAST_USE_TRUNC_LENGTH = 25 + def format_storage_table(storages: List[Dict[str, Any]], show_all: bool = False) -> str: @@ -46,8 +47,8 @@ def format_storage_table(storages: List[Dict[str, Any]], if show_all: command = row['last_use'] else: - command = status_utils.truncate_long_string( - row['last_use'], status_utils.COMMAND_TRUNC_LENGTH) + command = common_utils.truncate_long_string(row['last_use'], + _LAST_USE_TRUNC_LENGTH) storage_table.add_row([ # NAME row['name'], diff --git a/sky/jobs/__init__.py b/sky/jobs/__init__.py index 922bb613ff7..5688ca7c7a2 100644 --- a/sky/jobs/__init__.py +++ b/sky/jobs/__init__.py @@ -8,6 +8,7 @@ from sky.jobs.core import cancel from sky.jobs.core import launch from sky.jobs.core import queue +from sky.jobs.core import queue_from_kubernetes_pod from sky.jobs.core import tail_logs from sky.jobs.recovery_strategy import DEFAULT_RECOVERY_STRATEGY from sky.jobs.recovery_strategy import RECOVERY_STRATEGIES @@ -34,6 +35,7 @@ 'cancel', 'launch', 'queue', + 'queue_from_kubernetes_pod', 'tail_logs', # utils 'ManagedJobCodeGen', diff --git a/sky/jobs/core.py b/sky/jobs/core.py index c4f59f65eca..2cfc2783b4b 100644 --- a/sky/jobs/core.py +++ b/sky/jobs/core.py @@ -9,6 +9,7 @@ import sky from sky import backends from sky import exceptions +from sky import provision as provision_lib from sky import sky_logging from sky import status_lib from sky import task as task_lib @@ -16,6 +17,7 @@ from sky.clouds.service_catalog import common as service_catalog_common from sky.jobs import constants as managed_job_constants from sky.jobs import utils as managed_job_utils +from sky.provision import common from sky.skylet import constants as skylet_constants from sky.usage import usage_lib from sky.utils import admin_policy_utils @@ -138,6 +140,82 @@ def launch( _disable_controller_check=True) +def queue_from_kubernetes_pod( + pod_name: str, + context: Optional[str] = None, + skip_finished: bool = False) -> List[Dict[str, Any]]: + """Gets the jobs queue from a specific controller pod. + + Args: + pod_name (str): The name of the controller pod to query for jobs. + context (Optional[str]): The Kubernetes context to use. If None, the + current context is used. + skip_finished (bool): If True, does not return finished jobs. + + Returns: + [ + { + 'job_id': int, + 'job_name': str, + 'resources': str, + 'submitted_at': (float) timestamp of submission, + 'end_at': (float) timestamp of end, + 'duration': (float) duration in seconds, + 'recovery_count': (int) Number of retries, + 'status': (sky.jobs.ManagedJobStatus) of the job, + 'cluster_resources': (str) resources of the cluster, + 'region': (str) region of the cluster, + } + ] + + Raises: + RuntimeError: If there's an error fetching the managed jobs. + """ + # Create dummy cluster info to get the command runner. + provider_config = {'context': context} + instances = { + pod_name: [ + common.InstanceInfo(instance_id=pod_name, + internal_ip='', + external_ip='', + tags={}) + ] + } # Internal IP is not required for Kubernetes + cluster_info = common.ClusterInfo(provider_name='kubernetes', + head_instance_id=pod_name, + provider_config=provider_config, + instances=instances) + managed_jobs_runner = provision_lib.get_command_runners( + 'kubernetes', cluster_info)[0] + + code = managed_job_utils.ManagedJobCodeGen.get_job_table() + returncode, job_table_payload, stderr = managed_jobs_runner.run( + code, + require_outputs=True, + separate_stderr=True, + stream_logs=False, + ) + try: + subprocess_utils.handle_returncode(returncode, + code, + 'Failed to fetch managed jobs', + job_table_payload + stderr, + stream_logs=False) + except exceptions.CommandError as e: + raise RuntimeError(str(e)) from e + + jobs = managed_job_utils.load_managed_job_queue(job_table_payload) + if skip_finished: + # Filter out the finished jobs. If a multi-task job is partially + # finished, we will include all its tasks. + non_finished_tasks = list( + filter(lambda job: not job['status'].is_terminal(), jobs)) + non_finished_job_ids = {job['job_id'] for job in non_finished_tasks} + jobs = list( + filter(lambda job: job['job_id'] in non_finished_job_ids, jobs)) + return jobs + + @usage_lib.entrypoint def queue(refresh: bool, skip_finished: bool = False) -> List[Dict[str, Any]]: # NOTE(dev): Keep the docstring consistent between the Python API and CLI. diff --git a/sky/jobs/utils.py b/sky/jobs/utils.py index 524a0cb0478..d46404bd4fd 100644 --- a/sky/jobs/utils.py +++ b/sky/jobs/utils.py @@ -599,11 +599,20 @@ def format_job_table( a list of "rows" (each of which is a list of str). """ jobs = collections.defaultdict(list) + # Check if the tasks have user information. + tasks_have_user = any([task.get('user') for task in tasks]) + if max_jobs and tasks_have_user: + raise ValueError('max_jobs is not supported when tasks have user info.') + + def get_hash(task): + if tasks_have_user: + return (task['user'], task['job_id']) + return task['job_id'] + for task in tasks: # The tasks within the same job_id are already sorted # by the task_id. - jobs[task['job_id']].append(task) - jobs = dict(jobs) + jobs[get_hash(task)].append(task) status_counts: Dict[str, int] = collections.defaultdict(int) for job_tasks in jobs.values(): @@ -611,17 +620,14 @@ def format_job_table( if not managed_job_status.is_terminal(): status_counts[managed_job_status.value] += 1 - if max_jobs is not None: - job_ids = sorted(jobs.keys(), reverse=True) - job_ids = job_ids[:max_jobs] - jobs = {job_id: jobs[job_id] for job_id in job_ids} - columns = [ 'ID', 'TASK', 'NAME', 'RESOURCES', 'SUBMITTED', 'TOT. DURATION', 'JOB DURATION', '#RECOVERIES', 'STATUS' ] if show_all: columns += ['STARTED', 'CLUSTER', 'REGION', 'FAILURE'] + if tasks_have_user: + columns.insert(0, 'USER') job_table = log_utils.create_table(columns) status_counts: Dict[str, int] = collections.defaultdict(int) @@ -636,9 +642,9 @@ def format_job_table( for task in all_tasks: # The tasks within the same job_id are already sorted # by the task_id. - jobs[task['job_id']].append(task) + jobs[get_hash(task)].append(task) - for job_id, job_tasks in jobs.items(): + for job_hash, job_tasks in jobs.items(): if len(job_tasks) > 1: # Aggregate the tasks into a new row in the table. job_name = job_tasks[0]['job_name'] @@ -674,6 +680,7 @@ def format_job_table( if not managed_job_status.is_terminal(): status_str += f' (task: {current_task_id})' + job_id = job_hash[1] if tasks_have_user else job_hash job_values = [ job_id, '', @@ -692,6 +699,8 @@ def format_job_table( '-', failure_reason if failure_reason is not None else '-', ]) + if tasks_have_user: + job_values.insert(0, job_tasks[0].get('user', '-')) job_table.add_row(job_values) for task in job_tasks: @@ -724,6 +733,8 @@ def format_job_table( task['failure_reason'] if task['failure_reason'] is not None else '-', ]) + if tasks_have_user: + values.insert(0, task.get('user', '-')) job_table.add_row(values) if len(job_tasks) > 1: diff --git a/sky/provision/kubernetes/utils.py b/sky/provision/kubernetes/utils.py index 0498cc7f59f..3924074838e 100644 --- a/sky/provision/kubernetes/utils.py +++ b/sky/provision/kubernetes/utils.py @@ -1998,3 +1998,28 @@ def get_context_from_config(provider_config: Dict[str, Any]) -> Optional[str]: # we need to use in-cluster auth. context = None return context + + +def get_skypilot_pods(context: Optional[str] = None) -> List[Any]: + """Gets all SkyPilot pods in the Kubernetes cluster. + + Args: + context: Kubernetes context to use. If None, uses the current context. + + Returns: + A list of Kubernetes pod objects. + """ + if context is None: + context = get_current_kube_config_context_name() + + try: + pods = kubernetes.core_api(context).list_pod_for_all_namespaces( + label_selector='skypilot-cluster', + _request_timeout=kubernetes.API_TIMEOUT).items + except kubernetes.max_retry_error(): + raise exceptions.ResourcesUnavailableError( + 'Timed out trying to get SkyPilot pods from Kubernetes cluster. ' + 'Please check if the cluster is healthy and retry. To debug, run: ' + 'kubectl get pods --selector=skypilot-cluster --all-namespaces' + ) from None + return pods diff --git a/sky/utils/cli_utils/status_utils.py b/sky/utils/cli_utils/status_utils.py index 3a783f03bb4..09172f24814 100644 --- a/sky/utils/cli_utils/status_utils.py +++ b/sky/utils/cli_utils/status_utils.py @@ -1,12 +1,16 @@ """Utilities for sky status.""" -from typing import Any, Callable, Dict, List, Optional +from typing import Any, Callable, Dict, List, Optional, Tuple import click import colorama from sky import backends +from sky import clouds as sky_clouds +from sky import resources as resources_lib from sky import status_lib +from sky.provision.kubernetes import utils as kubernetes_utils from sky.skylet import constants +from sky.utils import common_utils from sky.utils import log_utils from sky.utils import resources_utils @@ -19,25 +23,6 @@ _ClusterCostReportRecord = Dict[str, Any] -def truncate_long_string(s: str, max_length: int = 35) -> str: - if len(s) <= max_length: - return s - splits = s.split(' ') - if len(splits[0]) > max_length: - return splits[0][:max_length] + '...' # Use '…'? - # Truncate on word boundary. - i = 0 - total = 0 - for i, part in enumerate(splits): - total += len(part) - if total >= max_length: - break - prefix = ' '.join(splits[:i]) - if len(prefix) < max_length: - prefix += s[len(prefix):max_length] - return prefix + '...' - - class StatusColumn: """One column of the displayed cluster table""" @@ -54,7 +39,7 @@ def __init__(self, def calc(self, record): val = self.calc_func(record) if self.trunc_length != 0: - val = truncate_long_string(str(val), self.trunc_length) + val = common_utils.truncate_long_string(str(val), self.trunc_length) return val @@ -316,3 +301,165 @@ def _get_estimated_cost_for_cost_report( return '-' return f'$ {cost:.2f}' + + +def show_kubernetes_cluster_status_table(clusters: List[Any], + show_all: bool) -> None: + """Compute cluster table values and display for Kubernetes clusters.""" + status_columns = [ + StatusColumn('USER', lambda c: c['user']), + StatusColumn('NAME', lambda c: c['cluster_name']), + StatusColumn( + 'LAUNCHED', + lambda c: log_utils.readable_time_duration(c['launched_at'])), + StatusColumn('RESOURCES', + lambda c: c['resources_str'], + trunc_length=70 if not show_all else 0), + StatusColumn('STATUS', lambda c: c['status'].colored_str()), + # TODO(romilb): We should consider adding POD_NAME field here when --all + # is passed to help users fetch pod name programmatically. + ] + + columns = [ + col.name for col in status_columns if col.show_by_default or show_all + ] + cluster_table = log_utils.create_table(columns) + + # Sort table by user, then by cluster name + sorted_clusters = sorted(clusters, + key=lambda c: (c['user'], c['cluster_name'])) + + for cluster in sorted_clusters: + row = [] + for status_column in status_columns: + if status_column.show_by_default or show_all: + row.append(status_column.calc(cluster)) + cluster_table.add_row(row) + + if clusters: + click.echo(f'{colorama.Fore.CYAN}{colorama.Style.BRIGHT}' + f'SkyPilot clusters' + f'{colorama.Style.RESET_ALL}') + click.echo(cluster_table) + else: + click.echo('No SkyPilot resources found in the ' + 'active Kubernetes context.') + + +def process_skypilot_pods( + pods: List[Any], + context: Optional[str] = None +) -> Tuple[List[Dict[Any, Any]], Dict[str, Any], Dict[str, Any]]: + """Process SkyPilot pods on k8s to extract cluster and controller info. + + Args: + pods: List of Kubernetes pod objects. + context: Kubernetes context name, used to detect GPU label formatter. + + Returns: + A tuple containing: + - List of dictionaries with cluster information. + - Dictionary of job controller information. + - Dictionary of serve controller information. + + Each dictionary contains the following keys: + 'cluster_name_on_cloud': The cluster_name_on_cloud used by SkyPilot + 'cluster_name': The cluster name without the user hash + 'user': The user who created the cluster. Fetched from pod label + 'status': The cluster status (assumed UP if pod exists) + 'pods': List of pod objects in the cluster + 'launched_at': Timestamp of when the cluster was launched + 'resources': sky.Resources object for the cluster + """ + clusters: Dict[str, Dict] = {} + jobs_controllers: Dict[str, Dict] = {} + serve_controllers: Dict[str, Dict] = {} + + for pod in pods: + cluster_name_on_cloud = pod.metadata.labels.get('skypilot-cluster') + cluster_name = cluster_name_on_cloud.rsplit( + '-', 1 + )[0] # Remove the user hash to get cluster name (e.g., mycluster-2ea4) + + # Check if cluster name is name of a controller + # Can't use controller_utils.Controllers.from_name(cluster_name) + # because hash is different across users + if 'controller' in cluster_name_on_cloud: + start_time = pod.status.start_time.timestamp() + controller_info = { + 'cluster_name_on_cloud': cluster_name_on_cloud, + 'cluster_name': cluster_name, + 'user': pod.metadata.labels.get('skypilot-user'), + 'status': status_lib.ClusterStatus.UP, + # Assuming UP if pod exists + 'pods': [pod], + 'launched_at': start_time + } + if 'sky-jobs-controller' in cluster_name_on_cloud: + jobs_controllers[cluster_name_on_cloud] = controller_info + elif 'sky-serve-controller' in cluster_name_on_cloud: + serve_controllers[cluster_name_on_cloud] = controller_info + + if cluster_name_on_cloud not in clusters: + # Parse the start time for the cluster + start_time = pod.status.start_time + if start_time is not None: + start_time = pod.status.start_time.timestamp() + + # Parse resources + cpu_request = kubernetes_utils.parse_cpu_or_gpu_resource( + pod.spec.containers[0].resources.requests.get('cpu', '0')) + memory_request = kubernetes_utils.parse_memory_resource( + pod.spec.containers[0].resources.requests.get('memory', '0'), + unit='G') + gpu_count = kubernetes_utils.parse_cpu_or_gpu_resource( + pod.spec.containers[0].resources.requests.get( + 'nvidia.com/gpu', '0')) + if gpu_count > 0: + label_formatter, _ = ( + kubernetes_utils.detect_gpu_label_formatter(context)) + assert label_formatter is not None, ( + 'GPU label formatter cannot be None if there are pods ' + f'requesting GPUs: {pod.metadata.name}') + gpu_label = label_formatter.get_label_key() + # Get GPU name from pod node selector + if pod.spec.node_selector is not None: + gpu_name = label_formatter.get_accelerator_from_label_value( + pod.spec.node_selector.get(gpu_label)) + + resources = resources_lib.Resources( + cloud=sky_clouds.Kubernetes(), + cpus=int(cpu_request), + memory=int(memory_request), + accelerators=(f'{gpu_name}:{gpu_count}' + if gpu_count > 0 else None)) + if pod.status.phase == 'Pending': + # If pod is pending, do not show it in the status + continue + + clusters[cluster_name_on_cloud] = { + 'cluster_name_on_cloud': cluster_name_on_cloud, + 'cluster_name': cluster_name, + 'user': pod.metadata.labels.get('skypilot-user'), + 'status': status_lib.ClusterStatus.UP, + 'pods': [], + 'launched_at': start_time, + 'resources': resources, + } + else: + # Update start_time if this pod started earlier + pod_start_time = pod.status.start_time + if pod_start_time is not None: + pod_start_time = pod_start_time.timestamp() + if pod_start_time < clusters[cluster_name_on_cloud][ + 'launched_at']: + clusters[cluster_name_on_cloud][ + 'launched_at'] = pod_start_time + clusters[cluster_name_on_cloud]['pods'].append(pod) + # Update resources_str in clusters: + for cluster_name, cluster in clusters.items(): + resources = cluster['resources'] + num_pods = len(cluster['pods']) + resources_str = f'{num_pods}x {resources}' + cluster['resources_str'] = resources_str + return list(clusters.values()), jobs_controllers, serve_controllers diff --git a/sky/utils/common_utils.py b/sky/utils/common_utils.py index dffe784cc33..4a8e6aa37d6 100644 --- a/sky/utils/common_utils.py +++ b/sky/utils/common_utils.py @@ -679,3 +679,23 @@ def new_func(*args, **kwargs): return func(*args, **kwargs) return new_func + + +def truncate_long_string(s: str, max_length: int = 35) -> str: + """Truncate a string to a maximum length, preserving whole words.""" + if len(s) <= max_length: + return s + splits = s.split(' ') + if len(splits[0]) > max_length: + return splits[0][:max_length] + '...' # Use '…'? + # Truncate on word boundary. + i = 0 + total = 0 + for i, part in enumerate(splits): + total += len(part) + if total >= max_length: + break + prefix = ' '.join(splits[:i]) + if len(prefix) < max_length: + prefix += s[len(prefix):max_length] + return prefix + '...' From fdd68b209ee74f9282fac5c6834907d5fe72d255 Mon Sep 17 00:00:00 2001 From: Romil Bhardwaj Date: Fri, 11 Oct 2024 13:28:54 -0700 Subject: [PATCH 43/93] [Docs] Fix GA (#4071) * fix ga * ordering * quotes --- docs/requirements-docs.txt | 1 + docs/source/conf.py | 5 ++++- 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/docs/requirements-docs.txt b/docs/requirements-docs.txt index 161f0eecd54..7627218e451 100644 --- a/docs/requirements-docs.txt +++ b/docs/requirements-docs.txt @@ -12,6 +12,7 @@ sphinx-book-theme==1.1.0 sphinx-togglebutton==0.3.2 sphinxcontrib-applehelp==1.0.7 sphinxcontrib-devhelp==1.0.5 +sphinxcontrib-googleanalytics==0.4 sphinxcontrib-htmlhelp==2.0.4 sphinxcontrib-jsmath==1.0.1 sphinxcontrib-qthelp==1.0.6 diff --git a/docs/source/conf.py b/docs/source/conf.py index 5e6396c932b..a8ce3270e88 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -27,7 +27,6 @@ # -- General configuration extensions = [ - 'sphinxemoji.sphinxemoji', 'sphinx.ext.autodoc', 'sphinx.ext.autosummary', 'sphinx.ext.duration', @@ -38,6 +37,8 @@ 'sphinx_autodoc_typehints', 'sphinx_click', 'sphinx_copybutton', + 'sphinxcontrib.googleanalytics', + 'sphinxemoji.sphinxemoji', 'sphinx_design', 'myst_parser', ] @@ -162,6 +163,8 @@ def render_svg_logo(path): exclude_patterns = ['_gallery_original'] myst_heading_anchors = 3 +googleanalytics_id = 'G-92WF3MDCJV' + def setup(app): app.connect('builder-inited', From d63497c267b62ebc6cb952d25312f98852ca6c8d Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Sat, 12 Oct 2024 15:31:49 -0700 Subject: [PATCH 44/93] [UX] A new look of SkyPilot console outputs (#4023) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * [UX] default to minimal logging (no module/line number/timestamp). * Fix mypy. * Fix typing * Update sky/utils/env_options.py Co-authored-by: Tian Xia * Update sky/utils/env_options.py Co-authored-by: Tian Xia * Account for debug flag. * Remove prefixes from docs. * wip * Optimize the output * optimize logging * format * Update the ux * fix options * Fix logs ux for controller * Add job starting title * fixes * keep align * fix indent * UX v3 * Format * UX for launching * Add UX for setup and mounts * Fix setup and file mounts * Fix output * Refactor output * Fix * update * Change to βš™οΈ * New alternative * cyan for spinner * address comments * format * format * refactor and fix * format * format * controller logs * fix serve ux * Updated serve UX * Fix serve ux * format * Fix backward compat job log * Fix streaming for old clusters * Fix nested status * fix status * Add looking for resources spinner * Fix azure logging * format * format * Fix old provisioner * add a new internal IP for Lambda * fix multi-worker for old provisioner * Avoid error out for refresh in teardown * format * Fix k8s output * Fixes * fix * Fix * format * Fix smoke minimal * Fix validating minimal * fix managed job tests * address comments * dim indent and green finish line * Fix optimizer output * Fix nested rich status * format * reducing refreshing frequency * remove accidentally added file * update docs * update docs * increase initial delay for smoke test * A diff icon * increase refresh frequency * minor * fix * fix message * Update sky/backends/cloud_vm_ray_backend.py Co-authored-by: Zongheng Yang * fix * fix the smoke test yaml * fix * Add docstr * fix * shorten style / fore * rename class * move constants * Add indent symbol for instance up * Update controller setup * format * rename env_key * minor move * format --------- Co-authored-by: Zongheng Yang Co-authored-by: Tian Xia --- docs/source/examples/auto-failover.rst | 107 ++--- sky/adaptors/azure.py | 4 +- sky/adaptors/common.py | 8 +- sky/backends/backend.py | 13 +- sky/backends/backend_utils.py | 29 +- sky/backends/cloud_vm_ray_backend.py | 368 ++++++++++-------- sky/backends/local_docker_backend.py | 4 +- sky/benchmark/benchmark_utils.py | 9 +- sky/cli.py | 64 +-- sky/clouds/service_catalog/aws_catalog.py | 13 +- sky/clouds/service_catalog/common.py | 7 +- sky/clouds/service_catalog/cudo_catalog.py | 12 +- sky/core.py | 6 +- sky/data/storage.py | 76 ++-- sky/data/storage_utils.py | 12 +- sky/exceptions.py | 5 + sky/execution.py | 34 +- sky/jobs/core.py | 16 +- sky/jobs/utils.py | 25 +- sky/optimizer.py | 87 +++-- sky/provision/aws/config.py | 21 +- sky/provision/azure/config.py | 17 +- sky/provision/azure/instance.py | 24 +- sky/provision/kubernetes/instance.py | 4 +- sky/provision/provisioner.py | 137 +++---- sky/serve/core.py | 82 ++-- sky/sky_logging.py | 14 +- sky/skylet/log_lib.py | 9 +- .../providers/lambda_cloud/node_provider.py | 2 +- sky/utils/command_runner.py | 22 +- sky/utils/common_utils.py | 7 +- sky/utils/controller_utils.py | 107 +++-- sky/utils/env_options.py | 29 +- sky/utils/log_utils.py | 63 +-- sky/utils/resources_utils.py | 23 ++ sky/utils/rich_utils.py | 60 ++- sky/utils/ux_utils.py | 67 +++- tests/skyserve/http/aws.yaml | 4 +- tests/test_smoke.py | 57 ++- 39 files changed, 1004 insertions(+), 644 deletions(-) diff --git a/docs/source/examples/auto-failover.rst b/docs/source/examples/auto-failover.rst index 99ee5703738..8ac9d5c71bf 100644 --- a/docs/source/examples/auto-failover.rst +++ b/docs/source/examples/auto-failover.rst @@ -53,26 +53,26 @@ Cross-region failover The provisioner first retries across all regions within a task's chosen cloud. -A common high-end GPU to use in deep learning is a NVIDIA V100 GPU. These GPUs +A common high-end GPU to use in AI is a NVIDIA A100 GPU. These GPUs are often in high demand and hard to get. Let's see how SkyPilot's auto-failover provisioner handles such a request: .. code-block:: console - $ sky launch -c gpu --gpus V100 - ... # optimizer output - I 02-11 21:17:43 cloud_vm_ray_backend.py:1034] Creating a new cluster: "gpu" [1x GCP(n1-highmem-8, {'V100': 1.0})]. - I 02-11 21:17:43 cloud_vm_ray_backend.py:1034] Tip: to reuse an existing cluster, specify --cluster-name (-c) in the CLI or use sky.launch(.., cluster_name=..) in the Python API. Run `sky status` to see existing clusters. - I 02-11 21:17:43 cloud_vm_ray_backend.py:614] To view detailed progress: tail -n100 -f sky_logs/sky-2022-02-11-21-17-43-171661/provision.log - I 02-11 21:17:43 cloud_vm_ray_backend.py:624] - I 02-11 21:17:43 cloud_vm_ray_backend.py:624] Launching on GCP us-central1 (us-central1-a) - W 02-11 21:17:56 cloud_vm_ray_backend.py:358] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-a (message: The zone 'projects/intercloud-320520/zones/us-central1-a' does not have enough resources available to fulfill the request. Try a different zone, or try again later.) + $ sky launch -c gpu --gpus A100 + + ... + Launching a new cluster 'gpu'. Proceed? [Y/n]: + βš™οΈ Launching on GCP us-central1 (us-central1-a). + W 10-11 18:25:57 instance_utils.py:112] Got return codes 'VM_MIN_COUNT_NOT_REACHED', 'ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS' in us-central1-a: 'Requested minimum count of 1 VMs could not be created'; "The zone 'projects/xxxxxx/zones/us-central1-a' does not have enough resources available to fulfill the request. '(resource type:compute)'" + ... + + βš™οΈ Launching on GCP us-central1 (us-central1-f) ... - I 02-11 21:18:24 cloud_vm_ray_backend.py:624] Launching on GCP us-central1 (us-central1-f) - W 02-11 21:18:38 cloud_vm_ray_backend.py:358] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-f (message: The zone 'projects/intercloud-320520/zones/us-central1-f' does not have enough resources available to fulfill the request. Try a different zone, or try again later.) - I 02-11 21:18:38 cloud_vm_ray_backend.py:624] - I 02-11 21:18:38 cloud_vm_ray_backend.py:624] Launching on GCP us-west1 (us-west1-a) - Successfully connected to 35.230.120.87. + + βš™οΈ Launching on GCP us-west1 (us-west1-a) + ... + βœ“ Cluster launched: a100-8. View logs at: ~/sky_logs/sky-2024-10-11-18-32-48-894132/provision.log GCP was chosen as the best cloud to run the task. There was no capacity in any of the regions in US Central, so the auto-failover provisioner moved to US West instead, allowing for our instance to be successfully provisioned. @@ -81,28 +81,37 @@ Cross-cloud failover If all regions within the chosen cloud failed, the provisioner retries on the next cheapest cloud. -Here is an example of cross-cloud failover when requesting 8x V100 GPUs. All -regions in GCP failed to provide the resource, so the provisioner switched to -AWS, where it succeeded after two regions: +Here is an example of cross-cloud failover when requesting 8x A100 GPUs. All +regions in Azure failed to provide the resource, so the provisioner switched to +GCP, where it succeeded after one region: .. code-block:: console - $ sky launch -c v100-8 --gpus V100:8 - ... # optimizer output - I 02-23 16:39:59 cloud_vm_ray_backend.py:1010] Creating a new cluster: "v100-8" [1x GCP(n1-highmem-8, {'V100': 8.0})]. - I 02-23 16:39:59 cloud_vm_ray_backend.py:1010] Tip: to reuse an existing cluster, specify --cluster-name (-c) in the CLI or use sky.launch(.., cluster_name=..) in the Python API. Run `sky status` to see existing clusters. - I 02-23 16:39:59 cloud_vm_ray_backend.py:658] To view detailed progress: tail -n100 -f sky_logs/sky-2022-02-23-16-39-58-577551/provision.log - I 02-23 16:39:59 cloud_vm_ray_backend.py:668] - I 02-23 16:39:59 cloud_vm_ray_backend.py:668] Launching on GCP us-central1 (us-central1-a) - W 02-23 16:40:17 cloud_vm_ray_backend.py:403] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-a (message: The zone 'projects/intercloud-320520/zones/us-central1-a' does not have enough resources available to fulfill the request. Try a different zone, or try again later.) + $ sky launch -c a100-8 --gpus A100:8 + + Considered resources (1 node): + ---------------------------------------------------------------------------------------------------- + CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN + ---------------------------------------------------------------------------------------------------- + Azure Standard_ND96asr_v4 96 900 A100:8 eastus 27.20 βœ” + GCP a2-highgpu-8g 96 680 A100:8 us-central1-a 29.39 + AWS p4d.24xlarge 96 1152 A100:8 us-east-1 32.77 + ---------------------------------------------------------------------------------------------------- + Launching a new cluster 'a100-8'. Proceed? [Y/n]: + + ... + βš™οΈ Launching on Azure eastus. + E 10-11 18:24:59 instance.py:457] Failed to create instances: [azure.core.exceptions.HttpResponseError] (InvalidTemplateDeployment) + sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in all zones in eastus ... - I 02-23 16:42:15 cloud_vm_ray_backend.py:668] Launching on AWS us-east-2 (us-east-2a,us-east-2b,us-east-2c) - W 02-23 16:42:26 cloud_vm_ray_backend.py:477] Got error(s) in all zones of us-east-2: - W 02-23 16:42:26 cloud_vm_ray_backend.py:479] create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.16xlarge capacity in the Availability Zone you requested (us-east-2a). Our system will be working on provisioning additional capacity. You can currently get p3.16xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-2b., retrying. + + βš™οΈ Launching on GCP us-central1 (us-central1-a). + W 10-11 18:25:57 instance_utils.py:112] Got return codes 'VM_MIN_COUNT_NOT_REACHED', 'ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS' in us-central1-a: 'Requested minimum count of 1 VMs could not be created'; "The zone 'projects/xxxxxx/zones/us-central1-a' does not have enough resources available to fulfill the request. '(resource type:compute)'" ... - I 02-23 16:42:26 cloud_vm_ray_backend.py:668] - I 02-23 16:42:26 cloud_vm_ray_backend.py:668] Launching on AWS us-west-2 (us-west-2a,us-west-2b,us-west-2c,us-west-2d) - I 02-23 16:47:04 cloud_vm_ray_backend.py:740] Successfully provisioned or found existing VM. Setup completed. + + βš™οΈ Launching on GCP us-central1 (us-central1-b). + Instance is up. + βœ“ Cluster launched: a100-8. View logs at: ~/sky_logs/sky-2024-10-11-18-24-14-357884/provision.log Multiple Candidate GPUs @@ -125,13 +134,13 @@ A10, L4, and A10g GPUs, using :code:`sky launch task.yaml`. $ sky launch task.yaml ... - I 11-19 08:07:45 optimizer.py:910] ----------------------------------------------------------------------------------------------------- - I 11-19 08:07:45 optimizer.py:910] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN - I 11-19 08:07:45 optimizer.py:910] ----------------------------------------------------------------------------------------------------- - I 11-19 08:07:45 optimizer.py:910] Azure Standard_NV6ads_A10_v5 6 55 A10:1 eastus 0.45 βœ” - I 11-19 08:07:45 optimizer.py:910] GCP g2-standard-4 4 16 L4:1 us-east4-a 0.70 - I 11-19 08:07:45 optimizer.py:910] AWS g5.xlarge 4 16 A10G:1 us-east-1 1.01 - I 11-19 08:07:45 optimizer.py:910] ----------------------------------------------------------------------------------------------------- + ----------------------------------------------------------------------------------------------------- + CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN + ----------------------------------------------------------------------------------------------------- + Azure Standard_NV6ads_A10_v5 6 55 A10:1 eastus 0.45 βœ” + GCP g2-standard-4 4 16 L4:1 us-east4-a 0.70 + AWS g5.xlarge 4 16 A10G:1 us-east-1 1.01 + ----------------------------------------------------------------------------------------------------- @@ -212,15 +221,15 @@ This will generate the following output: $ sky launch -c mycluster task.yaml ... - I 12-20 23:55:56 optimizer.py:717] - I 12-20 23:55:56 optimizer.py:840] Considered resources (1 node): - I 12-20 23:55:56 optimizer.py:910] --------------------------------------------------------------------------------------------- - I 12-20 23:55:56 optimizer.py:910] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN - I 12-20 23:55:56 optimizer.py:910] --------------------------------------------------------------------------------------------- - I 12-20 23:55:56 optimizer.py:910] GCP g2-standard-96 96 384 L4:8 us-east4-a 7.98 βœ” - I 12-20 23:55:56 optimizer.py:910] AWS g5.48xlarge 192 768 A10G:8 us-east-1 16.29 - I 12-20 23:55:56 optimizer.py:910] GCP a2-highgpu-8g 96 680 A100:8 us-east1-b 29.39 - I 12-20 23:55:56 optimizer.py:910] AWS p4d.24xlarge 96 1152 A100:8 us-east-1 32.77 - I 12-20 23:55:56 optimizer.py:910] --------------------------------------------------------------------------------------------- - I 12-20 23:55:56 optimizer.py:910] + + Considered resources (1 node): + --------------------------------------------------------------------------------------------- + CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN + --------------------------------------------------------------------------------------------- + GCP g2-standard-96 96 384 L4:8 us-east4-a 7.98 βœ” + AWS g5.48xlarge 192 768 A10G:8 us-east-1 16.29 + GCP a2-highgpu-8g 96 680 A100:8 us-east1-b 29.39 + AWS p4d.24xlarge 96 1152 A100:8 us-east-1 32.77 + --------------------------------------------------------------------------------------------- + Launching a new cluster 'mycluster'. Proceed? [Y/n]: diff --git a/sky/adaptors/azure.py b/sky/adaptors/azure.py index 3da4febca3e..61d8d14352e 100644 --- a/sky/adaptors/azure.py +++ b/sky/adaptors/azure.py @@ -20,7 +20,9 @@ azure = common.LazyImport( 'azure', import_error_message=('Failed to import dependencies for Azure.' - 'Try pip install "skypilot[azure]"')) + 'Try pip install "skypilot[azure]"'), + set_loggers=lambda: logging.getLogger('azure.identity').setLevel(logging. + ERROR)) Client = Any sky_logger = sky_logging.init_logger(__name__) diff --git a/sky/adaptors/common.py b/sky/adaptors/common.py index 5bcdaab9a4b..0cfb91cb587 100644 --- a/sky/adaptors/common.py +++ b/sky/adaptors/common.py @@ -1,7 +1,7 @@ """Lazy import for modules to avoid import error when not used.""" import functools import importlib -from typing import Any, Optional, Tuple +from typing import Any, Callable, Optional, Tuple class LazyImport: @@ -18,15 +18,19 @@ class LazyImport: def __init__(self, module_name: str, - import_error_message: Optional[str] = None): + import_error_message: Optional[str] = None, + set_loggers: Optional[Callable] = None): self._module_name = module_name self._module = None self._import_error_message = import_error_message + self._set_loggers = set_loggers def load_module(self): if self._module is None: try: self._module = importlib.import_module(self._module_name) + if self._set_loggers is not None: + self._set_loggers() except ImportError as e: if self._import_error_message is not None: raise ImportError(self._import_error_message) from e diff --git a/sky/backends/backend.py b/sky/backends/backend.py index 10389cf691e..10b51b06038 100644 --- a/sky/backends/backend.py +++ b/sky/backends/backend.py @@ -4,7 +4,9 @@ import sky from sky.usage import usage_lib +from sky.utils import rich_utils from sky.utils import timeline +from sky.utils import ux_utils if typing.TYPE_CHECKING: from sky import resources @@ -54,8 +56,9 @@ def provision( cluster_name = sky.backends.backend_utils.generate_cluster_name() usage_lib.record_cluster_name_for_current_operation(cluster_name) usage_lib.messages.usage.update_actual_task(task) - return self._provision(task, to_provision, dryrun, stream_logs, - cluster_name, retry_until_up) + with rich_utils.safe_status(ux_utils.spinner_message('Launching')): + return self._provision(task, to_provision, dryrun, stream_logs, + cluster_name, retry_until_up) @timeline.event @usage_lib.messages.usage.update_runtime('sync_workdir') @@ -76,7 +79,8 @@ def sync_file_mounts( @usage_lib.messages.usage.update_runtime('setup') def setup(self, handle: _ResourceHandleType, task: 'task_lib.Task', detach_setup: bool) -> None: - return self._setup(handle, task, detach_setup) + with rich_utils.safe_status(ux_utils.spinner_message('Running setup')): + return self._setup(handle, task, detach_setup) def add_storage_objects(self, task: 'task_lib.Task') -> None: raise NotImplementedError @@ -96,7 +100,8 @@ def execute(self, usage_lib.record_cluster_name_for_current_operation( handle.get_cluster_name()) usage_lib.messages.usage.update_actual_task(task) - return self._execute(handle, task, detach_run, dryrun) + with rich_utils.safe_status(ux_utils.spinner_message('Submitting job')): + return self._execute(handle, task, detach_run, dryrun) @timeline.event def post_execute(self, handle: _ResourceHandleType, down: bool) -> None: diff --git a/sky/backends/backend_utils.py b/sky/backends/backend_utils.py index 24f638a12b9..1f213f5c614 100644 --- a/sky/backends/backend_utils.py +++ b/sky/backends/backend_utils.py @@ -70,9 +70,6 @@ SKY_REMOTE_PATH = '~/.sky/wheels' SKY_USER_FILE_PATH = '~/.sky/generated' -BOLD = '\033[1m' -RESET_BOLD = '\033[0m' - # Do not use /tmp because it gets cleared on VM restart. _SKY_REMOTE_FILE_MOUNTS_DIR = '~/.sky/file_mounts/' @@ -1171,7 +1168,8 @@ def wait_until_ray_cluster_ready( runner = command_runner.SSHCommandRunner(node=(head_ip, 22), **ssh_credentials) with rich_utils.safe_status( - '[bold cyan]Waiting for workers...') as worker_status: + ux_utils.spinner_message('Waiting for workers', + log_path=log_path)) as worker_status: while True: rc, output, stderr = runner.run( instance_setup.RAY_STATUS_WITH_SKY_RAY_PORT_COMMAND, @@ -1187,9 +1185,11 @@ def wait_until_ray_cluster_ready( ready_head, ready_workers = _count_healthy_nodes_from_ray( output, is_local_cloud=is_local_cloud) - worker_status.update('[bold cyan]' - f'{ready_workers} out of {num_nodes - 1} ' - 'workers ready') + worker_status.update( + ux_utils.spinner_message( + f'{ready_workers} out of {num_nodes - 1} ' + 'workers ready', + log_path=log_path)) # In the local case, ready_head=0 and ready_workers=num_nodes. This # is because there is no matching regex for _LAUNCHED_HEAD_PATTERN. @@ -1304,7 +1304,6 @@ def parallel_data_transfer_to_nodes( stream_logs: bool; Whether to stream logs to stdout source_bashrc: bool; Source bashrc before running the command. """ - fore = colorama.Fore style = colorama.Style origin_source = source @@ -1341,12 +1340,10 @@ def _sync_node(runner: 'command_runner.CommandRunner') -> None: num_nodes = len(runners) plural = 's' if num_nodes > 1 else '' - message = (f'{fore.CYAN}{action_message} (to {num_nodes} node{plural})' - f': {style.BRIGHT}{origin_source}{style.RESET_ALL} -> ' - f'{style.BRIGHT}{target}{style.RESET_ALL}') + message = (f' {style.DIM}{action_message} (to {num_nodes} node{plural})' + f': {origin_source} -> {target}{style.RESET_ALL}') logger.info(message) - with rich_utils.safe_status(f'[bold cyan]{action_message}[/]'): - subprocess_utils.run_in_parallel(_sync_node, runners) + subprocess_utils.run_in_parallel(_sync_node, runners) def check_local_gpus() -> bool: @@ -2488,9 +2485,9 @@ def get_clusters( progress = rich_progress.Progress(transient=True, redirect_stdout=False, redirect_stderr=False) - task = progress.add_task( - f'[bold cyan]Refreshing status for {len(records)} cluster{plural}[/]', - total=len(records)) + task = progress.add_task(ux_utils.spinner_message( + f'Refreshing status for {len(records)} cluster{plural}'), + total=len(records)) def _refresh_cluster(cluster_name): try: diff --git a/sky/backends/cloud_vm_ray_backend.py b/sky/backends/cloud_vm_ray_backend.py index 714e4fc14eb..aceac8951b0 100644 --- a/sky/backends/cloud_vm_ray_backend.py +++ b/sky/backends/cloud_vm_ray_backend.py @@ -122,9 +122,6 @@ _TPU_NOT_FOUND_ERROR = 'ERROR: (gcloud.compute.tpus.delete) NOT_FOUND' -_CTRL_C_TIP_MESSAGE = ('INFO: Tip: use Ctrl-C to exit log streaming ' - '(task will not be killed).') - _MAX_RAY_UP_RETRY = 5 # Number of retries for getting zones. @@ -405,22 +402,35 @@ def add_gang_scheduling_placement_group_and_setup( **gpu_dict, }) + streaming_message = ( + f'{ux_utils.INDENT_LAST_SYMBOL}Job started. Streaming logs... ' + f'{colorama.Style.DIM}(Ctrl-C to exit log streaming; job will not ' + f'be killed){colorama.Style.RESET_ALL}') self._code += [ textwrap.dedent(f"""\ pg = ray_util.placement_group({json.dumps(bundles)}, 'STRICT_SPREAD') plural = 's' if {num_nodes} > 1 else '' node_str = f'{num_nodes} node{{plural}}' - message = {_CTRL_C_TIP_MESSAGE!r} + '\\n' - message += f'INFO: Waiting for task resources on {{node_str}}. This will block if the cluster is full.' - print(message, - flush=True) + # We have this `INFO: Tip:` message only for backward + # compatibility, because if a cluster has the old SkyPilot version, + # it relies on this message to start log streaming. + # This message will be skipped for new clusters, because we use + # start_streaming_at for the `Waiting for task resources on` + # message. + # TODO: Remove this message in v0.9.0. + message = ('{ux_utils.INDENT_SYMBOL}{colorama.Style.DIM}INFO: ' + 'Tip: use Ctrl-C to exit log streaming, not kill ' + 'the job.{colorama.Style.RESET_ALL}\\n') + message += ('{ux_utils.INDENT_SYMBOL}{colorama.Style.DIM}' + 'Waiting for task resources on ' + f'{{node_str}}.{colorama.Style.RESET_ALL}') + print(message, flush=True) # FIXME: This will print the error message from autoscaler if # it is waiting for other task to finish. We should hide the # error message. ray.get(pg.ready()) - print('INFO: All task resources reserved.', - flush=True) + print({streaming_message!r}, flush=True) """) ] @@ -496,7 +506,6 @@ def check_ip(): )).remote() for i in range(pg.bundle_count) ]) - print('INFO: Reserved IPs:', gang_scheduling_id_to_ip) cluster_ips_to_node_id = {{ip: i for i, ip in enumerate({stable_cluster_internal_ips!r})}} job_ip_rank_list = sorted(gang_scheduling_id_to_ip, key=cluster_ips_to_node_id.get) @@ -743,15 +752,14 @@ def _lambda_handler(blocked_resources: Set['resources_lib.Resources'], region: 'clouds.Region', zones: Optional[List['clouds.Zone']], stdout: str, stderr: str): - del zones # Unused. + del region, zones # Unused. errors = FailoverCloudErrorHandlerV1._handle_errors( stdout, stderr, is_error_str_known=lambda x: 'LambdaCloudError:' in x.strip()) - logger.warning(f'Got error(s) in {region.name}:') - messages = '\n\t'.join(errors) + messages = '\n '.join(errors) style = colorama.Style - logger.warning(f'{style.DIM}\t{messages}{style.RESET_ALL}') + logger.warning(f' {style.DIM}{messages}{style.RESET_ALL}') _add_to_blocked_resources(blocked_resources, launchable_resources.copy(zone=None)) @@ -926,6 +934,10 @@ def _azure_handler(blocked_resources: Set['resources_lib.Resources'], _add_to_blocked_resources( blocked_resources, resources_lib.Resources(cloud=clouds.Azure())) + elif 'ClientAuthenticationError' in str(err): + _add_to_blocked_resources( + blocked_resources, + resources_lib.Resources(cloud=clouds.Azure())) else: _add_to_blocked_resources(blocked_resources, launchable_resources.copy(zone=None)) @@ -1224,9 +1236,10 @@ def _get_previously_launched_zones() -> Optional[List[clouds.Zone]]: if prev_cluster_status != status_lib.ClusterStatus.UP: logger.info( - f'Cluster {cluster_name!r} (status: ' - f'{prev_cluster_status.value}) was previously launched ' - f'in {cloud} {region.name}. Relaunching in that region.') + f'{colorama.Style.DIM}Cluster {cluster_name!r} (status: ' + f'{prev_cluster_status.value}) was previously in ' + f'{cloud} ({region.name}). Restarting.' + f'{colorama.Style.RESET_ALL}') yield zones # If it reaches here: the cluster status in the database gets @@ -1303,17 +1316,14 @@ def _retry_zones( prev_cluster_ever_up: bool, ) -> Dict[str, Any]: """The provision retry loop.""" - style = colorama.Style - fore = colorama.Fore # Get log_path name log_path = os.path.join(self.log_dir, 'provision.log') log_abs_path = os.path.abspath(log_path) if not dryrun: os.makedirs(os.path.expanduser(self.log_dir), exist_ok=True) os.system(f'touch {log_path}') - tail_cmd = f'tail -n100 -f {log_path}' - logger.info('To view detailed progress: ' - f'{style.BRIGHT}{tail_cmd}{style.RESET_ALL}') + rich_utils.force_update_status( + ux_utils.spinner_message('Launching', log_path)) # Get previous cluster status cluster_exists = prev_cluster_status is not None @@ -1481,6 +1491,23 @@ def _retry_zones( if to_provision.cloud.OPEN_PORTS_VERSION <= clouds.OpenPortsVersion.LAUNCH_ONLY else None) try: + controller = controller_utils.Controllers.from_name( + cluster_name) + controller_str = ('' if controller is None else + f' {controller.value.name}') + if isinstance(to_provision.cloud, clouds.Kubernetes): + # Omit the region name for Kubernetes. + logger.info( + ux_utils.starting_message( + f'Launching{controller_str} on ' + f'{to_provision.cloud}.')) + else: + logger.info( + ux_utils.starting_message( + f'Launching{controller_str} on ' + f'{to_provision.cloud} ' + f'{region.name}{colorama.Style.RESET_ALL}' + f'{zone_str}.')) provision_record = provisioner.bulk_provision( to_provision.cloud, region, @@ -1528,6 +1555,7 @@ def _retry_zones( 'region_name': region.name, 'zone_str': zone_str, } + status, stdout, stderr, head_internal_ip, head_external_ip = ( self._gang_schedule_ray_up(to_provision.cloud, cluster_config_file, handle, @@ -1566,9 +1594,9 @@ def _retry_zones( self._ensure_cluster_ray_started(handle, log_abs_path) config_dict['handle'] = handle - plural = '' if num_nodes == 1 else 's' - logger.info(f'{fore.GREEN}Successfully provisioned or found' - f' existing VM{plural}.{style.RESET_ALL}') + logger.info( + ux_utils.finishing_message( + f'Cluster launched: {cluster_name!r}.', log_path)) return config_dict # The cluster is not ready. We must perform error recording and/or @@ -1633,17 +1661,15 @@ def _retry_zones( if to_provision.zone is not None: message = ( - f'Failed to acquire resources in {to_provision.zone}. ' - 'Try changing resource requirements or use another zone.') + f'Failed to acquire resources in {to_provision.zone} for ' + f'{requested_resources}. ') elif to_provision.region is not None: # For public clouds, provision.region is always set. message = ('Failed to acquire resources in all zones in ' - f'{to_provision.region}. Try changing resource ' - 'requirements or use another region.') + f'{to_provision.region} for {requested_resources}. ') else: - message = (f'Failed to acquire resources in {to_provision.cloud}. ' - 'Try changing resource requirements or use another ' - 'cloud provider.') + message = (f'Failed to acquire resources in {to_provision.cloud} ' + f'for {requested_resources}. ') # Do not failover to other locations if the cluster was ever up, since # the user can have some data on the cluster. raise exceptions.ResourcesUnavailableError( @@ -1694,7 +1720,7 @@ def ray_up(): log_abs_path, stream_logs=False, start_streaming_at='Shared connection to', - line_processor=log_utils.RayUpLineProcessor(), + line_processor=log_utils.RayUpLineProcessor(log_abs_path), # Reduce BOTO_MAX_RETRIES from 12 to 5 to avoid long hanging # time during 'ray up' if insufficient capacity occurs. env=dict( @@ -1714,13 +1740,14 @@ def ray_up(): region_name = logging_info['region_name'] zone_str = logging_info['zone_str'] - style = colorama.Style if isinstance(to_provision_cloud, clouds.Kubernetes): - logger.info(f'{style.BRIGHT}Launching on {to_provision_cloud} ' - f'{style.RESET_ALL}') + logger.info( + ux_utils.starting_message( + f'Launching on {to_provision_cloud}.')) else: - logger.info(f'{style.BRIGHT}Launching on {to_provision_cloud} ' - f'{region_name}{style.RESET_ALL}{zone_str}') + logger.info( + ux_utils.starting_message(f'Launching on {to_provision_cloud} ' + f'{region_name}{zone_str}.')) start = time.time() # Edge case: /tmp/ray does not exist, so autoscaler can't create/store @@ -1822,11 +1849,6 @@ def need_ray_up( head_internal_ip, head_external_ip) # All code below is handling num_nodes > 1. - provision_str = ('Successfully provisioned or found existing head ' - 'instance.') - logger.info(f'{style.BRIGHT}{provision_str} ' - f'Waiting for workers.{style.RESET_ALL}') - # FIXME(zongheng): the below requires ray processes are up on head. To # repro it failing: launch a 2-node cluster, log into head and ray # stop, then launch again. @@ -2006,13 +2028,6 @@ def provision_with_retries( # Provisioning succeeded. break - if to_provision.zone is None: - region_or_zone_str = str(to_provision.region) - else: - region_or_zone_str = str(to_provision.zone) - logger.warning(f'\n{style.BRIGHT}Provision failed for {num_nodes}x ' - f'{to_provision} in {region_or_zone_str}. ' - f'Trying other locations (if any).{style.RESET_ALL}') if prev_cluster_status is None: # Add failed resources to the blocklist, only when it # is in fallback mode. @@ -2027,8 +2042,10 @@ def provision_with_retries( ), prev_cluster_status assert global_user_state.get_handle_from_cluster_name( cluster_name) is None, cluster_name - logger.info('Retrying provisioning with requested resources ' - f'{task.num_nodes}x {task.resources}') + logger.info( + ux_utils.retry_message( + f'Retrying provisioning with requested resources: ' + f'{task.num_nodes}x {task.resources}')) # Retry with the current, potentially "smaller" resources: # to_provision == the current new resources (e.g., V100:1), # which may be "smaller" than the original (V100:8). @@ -2038,6 +2055,12 @@ def provision_with_retries( prev_cluster_status = None prev_handle = None + retry_message = ux_utils.retry_message( + 'Trying other potential resources.') + logger.warning(f'\n{retry_message}') + log_path = os.path.join(self.log_dir, 'provision.log') + rich_utils.force_update_status( + ux_utils.spinner_message('Looking for resources', log_path)) # Set to None so that sky.optimize() will assign a new one # (otherwise will skip re-optimizing this task). # TODO: set all remaining tasks' best_resources to None. @@ -2781,6 +2804,9 @@ def _provision( local_wheel_path, wheel_hash, blocked_resources=task.blocked_resources) + log_path = os.path.join(self.log_dir, 'provision.log') + rich_utils.force_update_status( + ux_utils.spinner_message('Launching', log_path)) config_dict = retry_provisioner.provision_with_retries( task, to_provision_config, dryrun, stream_logs) break @@ -2796,27 +2822,34 @@ def _provision( usage_lib.messages.usage.update_final_cluster_status( None) error_message = ( - 'Failed to provision all possible launchable ' - 'resources.' - f' Relax the task\'s resource requirements: ' + f'{colorama.Fore.RED}Failed to provision all ' + f'possible launchable resources.' + f'{colorama.Style.RESET_ALL}' + ' Relax the task\'s resource requirements: ' f'{task.num_nodes}x {list(task.resources)[0]}') + + log_path = retry_provisioner.log_dir + '/provision.log' if retry_until_up: logger.error(error_message) # Sleep and retry. gap_seconds = backoff.current_backoff() plural = 's' if attempt_cnt > 1 else '' - logger.info( - f'{colorama.Style.BRIGHT}=== Retry until up ===' - f'{colorama.Style.RESET_ALL}\n' - f'Retrying provisioning after {gap_seconds:.0f}s ' - '(backoff with random jittering). ' - f'Already tried {attempt_cnt} attempt{plural}.') + retry_message = ux_utils.retry_message( + f'Retry after {gap_seconds:.0f}s ' + f'({attempt_cnt} attempt{plural}). ') + logger.info(f'\n{retry_message} ' + f'{ux_utils.log_path_hint(log_path)}' + f'{colorama.Style.RESET_ALL}') attempt_cnt += 1 time.sleep(gap_seconds) continue + logger.error( + f'{colorama.Fore.RED}β¨―{colorama.Style.RESET_ALL} ' + 'Failed to provision resources. ' + f'{ux_utils.log_path_hint(log_path)}') error_message += ( - '\nTo keep retrying until the cluster is up, use the ' - '`--retry-until-up` flag.') + '\nTo keep retrying until the cluster is up, use ' + 'the `--retry-until-up` flag.') with ux_utils.print_exception_no_traceback(): raise exceptions.ResourcesUnavailableError( error_message, @@ -2927,7 +2960,7 @@ def _get_zone(runner): # and restarted if necessary. logger.debug('Checking if skylet is running on the head node.') with rich_utils.safe_status( - '[bold cyan]Preparing SkyPilot runtime'): + ux_utils.spinner_message('Preparing SkyPilot runtime')): # We need to source bashrc for skylet to make sure the autostop # event can access the path to the cloud CLIs. self.run_on_head(handle, @@ -2970,7 +3003,7 @@ def _update_after_cluster_provisioned( cmd = job_lib.JobLibCodeGen.update_status() logger.debug('Update job queue on remote cluster.') with rich_utils.safe_status( - '[bold cyan]Preparing SkyPilot runtime'): + ux_utils.spinner_message('Preparing SkyPilot runtime')): returncode, _, stderr = self.run_on_head(handle, cmd, require_outputs=True) @@ -3005,7 +3038,8 @@ def _update_after_cluster_provisioned( if not (cloud.OPEN_PORTS_VERSION <= clouds.OpenPortsVersion.LAUNCH_ONLY): with rich_utils.safe_status( - '[bold cyan]Launching - Opening new ports'): + ux_utils.spinner_message( + 'Launching - Opening new ports')): self._open_ports(handle) with timeline.Event('backend.provision.post_process'): @@ -3054,7 +3088,7 @@ def _sync_workdir(self, handle: CloudVmRayResourceHandle, dir_size = backend_utils.path_size_megabytes(full_workdir) if dir_size >= _PATH_SIZE_MEGABYTES_WARN_THRESHOLD: logger.warning( - f'{fore.YELLOW}The size of workdir {workdir!r} ' + f' {fore.YELLOW}The size of workdir {workdir!r} ' f'is {dir_size} MB. Try to keep workdir small or use ' '.skyignore to exclude large files, as large sizes will slow ' f'down rsync.{style.RESET_ALL}') @@ -3076,17 +3110,14 @@ def _sync_workdir_node(runner: command_runner.CommandRunner) -> None: num_nodes = handle.launched_nodes plural = 's' if num_nodes > 1 else '' logger.info( - f'{fore.CYAN}Syncing workdir (to {num_nodes} node{plural}): ' - f'{style.BRIGHT}{workdir}{style.RESET_ALL}' - f' -> ' - f'{style.BRIGHT}{SKY_REMOTE_WORKDIR}{style.RESET_ALL}') + f' {style.DIM}Syncing workdir (to {num_nodes} node{plural}): ' + f'{workdir} -> {SKY_REMOTE_WORKDIR}{style.RESET_ALL}') os.makedirs(os.path.expanduser(self.log_dir), exist_ok=True) os.system(f'touch {log_path}') - tail_cmd = f'tail -n100 -f {log_path}' - logger.info('To view detailed progress: ' - f'{style.BRIGHT}{tail_cmd}{style.RESET_ALL}') - with rich_utils.safe_status('[bold cyan]Syncing[/]'): + with rich_utils.safe_status( + ux_utils.spinner_message('Syncing workdir', log_path)): subprocess_utils.run_in_parallel(_sync_workdir_node, runners) + logger.info(ux_utils.finishing_message('Workdir synced.', log_path)) def _sync_file_mounts( self, @@ -3095,17 +3126,17 @@ def _sync_file_mounts( storage_mounts: Optional[Dict[Path, storage_lib.Storage]], ) -> None: """Mounts all user files to the remote nodes.""" - controller_utils.replace_skypilot_config_path_in_file_mounts( - handle.launched_resources.cloud, all_file_mounts) - self._execute_file_mounts(handle, all_file_mounts) - self._execute_storage_mounts(handle, storage_mounts) - self._set_storage_mounts_metadata(handle.cluster_name, storage_mounts) + with rich_utils.safe_status(ux_utils.spinner_message('Syncing files')): + controller_utils.replace_skypilot_config_path_in_file_mounts( + handle.launched_resources.cloud, all_file_mounts) + self._execute_file_mounts(handle, all_file_mounts) + self._execute_storage_mounts(handle, storage_mounts) + self._set_storage_mounts_metadata(handle.cluster_name, + storage_mounts) def _setup(self, handle: CloudVmRayResourceHandle, task: task_lib.Task, detach_setup: bool) -> None: start = time.time() - style = colorama.Style - fore = colorama.Fore if task.setup is None: return @@ -3161,7 +3192,7 @@ def _run_setup(setup_cmd: str) -> int: # and source ~/.bashrc in the setup_cmd. # bash: cannot set terminal process group (7398): Inappropriate ioctl for device # pylint: disable=line-too-long # bash: no job control in this shell - skip_lines=3) + skip_num_lines=3) return returncode returncode = _run_setup(f'{create_script_code} && {setup_cmd}',) @@ -3212,23 +3243,33 @@ def error_message() -> str: num_nodes = len(runners) plural = 's' if num_nodes > 1 else '' + node_str = f'{num_nodes} VM{plural}' + if isinstance(handle.launched_resources.cloud, clouds.Kubernetes): + node_str = f'{num_nodes} pod{plural}' + controller = controller_utils.Controllers.from_name(handle.cluster_name) + if controller is not None: + node_str = controller.value.name if not detach_setup: - logger.info(f'{fore.CYAN}Running setup on {num_nodes} node{plural}.' - f'{style.RESET_ALL}') + logger.info( + ux_utils.starting_message(f'Running setup on {node_str}.')) # TODO(zhwu): run_in_parallel uses multi-thread to run the commands, # which can cause the program waiting for all the threads to finish, # even if some of them raise exceptions. We should replace it with # multi-process. + rich_utils.stop_safe_status() subprocess_utils.run_in_parallel(_setup_node, range(num_nodes)) if detach_setup: # Only set this when setup needs to be run outside the self._setup() # as part of a job (--detach-setup). self._setup_cmd = setup_cmd + logger.info(ux_utils.finishing_message('Setup completed.')) return - logger.info(f'{fore.GREEN}Setup completed.{style.RESET_ALL}') end = time.time() logger.debug(f'Setup took {end - start} seconds.') + setup_log_path = os.path.join(self.log_dir, 'setup-*.log') + logger.info( + ux_utils.finishing_message('Setup completed.', setup_log_path)) def _exec_code_on_head( self, @@ -3240,7 +3281,6 @@ def _exec_code_on_head( ) -> None: """Executes generated code on the head node.""" style = colorama.Style - fore = colorama.Fore script_path = os.path.join(SKY_REMOTE_APP_DIR, f'sky_job_{job_id}') remote_log_dir = self.log_dir @@ -3330,9 +3370,13 @@ def _dump_code_to_file(codegen: str) -> None: f'Failed to submit job {job_id}.', stderr=stdout + stderr) - logger.info('Job submitted with Job ID: ' - f'{style.BRIGHT}{job_id}{style.RESET_ALL}') - + controller = controller_utils.Controllers.from_name(handle.cluster_name) + if controller == controller_utils.Controllers.SKY_SERVE_CONTROLLER: + logger.info(ux_utils.starting_message('Service registered.')) + else: + logger.info( + ux_utils.starting_message(f'Job submitted, ID: {job_id}')) + rich_utils.stop_safe_status() try: if not detach_run: if (handle.cluster_name in controller_utils.Controllers. @@ -3347,35 +3391,37 @@ def _dump_code_to_file(codegen: str) -> None: controller = controller_utils.Controllers.from_name(name) if controller == controller_utils.Controllers.JOBS_CONTROLLER: logger.info( - f'{fore.CYAN}Managed Job ID: ' + f'\nπŸ“‹ Useful Commands' + f'\nManaged Job ID: ' f'{style.BRIGHT}{job_id}{style.RESET_ALL}' - '\nTo cancel the job:\t\t' - f'{backend_utils.BOLD}sky jobs cancel {job_id}' - f'{backend_utils.RESET_BOLD}' - '\nTo stream job logs:\t\t' - f'{backend_utils.BOLD}sky jobs logs {job_id}' - f'{backend_utils.RESET_BOLD}' - f'\nTo stream controller logs:\t' - f'{backend_utils.BOLD}sky jobs logs --controller {job_id}' - f'{backend_utils.RESET_BOLD}' - '\nTo view all managed jobs:\t' - f'{backend_utils.BOLD}sky jobs queue' - f'{backend_utils.RESET_BOLD}' - '\nTo view managed job dashboard:\t' - f'{backend_utils.BOLD}sky jobs dashboard' - f'{backend_utils.RESET_BOLD}') + f'\n{ux_utils.INDENT_SYMBOL}To cancel the job:\t\t\t' + f'{ux_utils.BOLD}sky jobs cancel {job_id}' + f'{ux_utils.RESET_BOLD}' + f'\n{ux_utils.INDENT_SYMBOL}To stream job logs:\t\t\t' + f'{ux_utils.BOLD}sky jobs logs {job_id}' + f'{ux_utils.RESET_BOLD}' + f'\n{ux_utils.INDENT_SYMBOL}To stream controller logs:\t\t' + f'{ux_utils.BOLD}sky jobs logs --controller {job_id}' + f'{ux_utils.RESET_BOLD}' + f'\n{ux_utils.INDENT_SYMBOL}To view all managed jobs:\t\t' + f'{ux_utils.BOLD}sky jobs queue' + f'{ux_utils.RESET_BOLD}' + f'\n{ux_utils.INDENT_LAST_SYMBOL}To view managed job ' + f'dashboard:\t{ux_utils.BOLD}sky jobs dashboard' + f'{ux_utils.RESET_BOLD}') elif controller is None: - logger.info(f'{fore.CYAN}Job ID: ' - f'{style.BRIGHT}{job_id}{style.RESET_ALL}' - '\nTo cancel the job:\t' - f'{backend_utils.BOLD}sky cancel {name} {job_id}' - f'{backend_utils.RESET_BOLD}' - '\nTo stream job logs:\t' - f'{backend_utils.BOLD}sky logs {name} {job_id}' - f'{backend_utils.RESET_BOLD}' - '\nTo view the job queue:\t' - f'{backend_utils.BOLD}sky queue {name}' - f'{backend_utils.RESET_BOLD}') + logger.info(f'\nπŸ“‹ Useful Commands' + f'\nJob ID: {job_id}' + f'\n{ux_utils.INDENT_SYMBOL}To cancel the job:\t\t' + f'{ux_utils.BOLD}sky cancel {name} {job_id}' + f'{ux_utils.RESET_BOLD}' + f'\n{ux_utils.INDENT_SYMBOL}To stream job logs:\t\t' + f'{ux_utils.BOLD}sky logs {name} {job_id}' + f'{ux_utils.RESET_BOLD}' + f'\n{ux_utils.INDENT_LAST_SYMBOL}To view job ' + 'queue:\t\t' + f'{ux_utils.BOLD}sky queue {name}' + f'{ux_utils.RESET_BOLD}') def _add_job(self, handle: CloudVmRayResourceHandle, job_name: Optional[str], resources_str: str) -> int: @@ -3452,27 +3498,23 @@ def _execute( def _post_execute(self, handle: CloudVmRayResourceHandle, down: bool) -> None: - fore = colorama.Fore - style = colorama.Style name = handle.cluster_name controller = controller_utils.Controllers.from_name(name) - if controller is not None or down: + if controller is not None: return - stop_str = ('\nTo stop the cluster:' - f'\t{backend_utils.BOLD}sky stop {name}' - f'{backend_utils.RESET_BOLD}') - logger.info(f'\n{fore.CYAN}Cluster name: ' - f'{style.BRIGHT}{name}{style.RESET_ALL}' - '\nTo log into the head VM:\t' - f'{backend_utils.BOLD}ssh {name}' - f'{backend_utils.RESET_BOLD}' - '\nTo submit a job:' - f'\t\t{backend_utils.BOLD}sky exec {name} yaml_file' - f'{backend_utils.RESET_BOLD}' - f'{stop_str}' - '\nTo teardown the cluster:' - f'\t{backend_utils.BOLD}sky down {name}' - f'{backend_utils.RESET_BOLD}') + logger.info(f'\nCluster name: {name}' + f'\n{ux_utils.INDENT_SYMBOL}To log into the head VM:\t' + f'{ux_utils.BOLD}ssh {name}' + f'{ux_utils.RESET_BOLD}' + f'\n{ux_utils.INDENT_SYMBOL}To submit a job:' + f'\t\t{ux_utils.BOLD}sky exec {name} yaml_file' + f'{ux_utils.RESET_BOLD}' + f'\n{ux_utils.INDENT_SYMBOL}To stop the cluster:' + f'\t{ux_utils.BOLD}sky stop {name}' + f'{ux_utils.RESET_BOLD}' + f'\n{ux_utils.INDENT_LAST_SYMBOL}To teardown the cluster:' + f'\t{ux_utils.BOLD}sky down {name}' + f'{ux_utils.RESET_BOLD}') if (gcp_utils.is_tpu(handle.launched_resources) and not gcp_utils.is_tpu_vm(handle.launched_resources)): logger.info('Tip: `sky down` will delete launched TPU(s) too.') @@ -3808,11 +3850,20 @@ def teardown_no_lock(self, Raises: RuntimeError: If the cluster fails to be terminated/stopped. """ + cluster_status_fetched = False if refresh_cluster_status: - prev_cluster_status, _ = ( - backend_utils.refresh_cluster_status_handle( - handle.cluster_name, acquire_per_cluster_status_lock=False)) - else: + try: + prev_cluster_status, _ = ( + backend_utils.refresh_cluster_status_handle( + handle.cluster_name, + acquire_per_cluster_status_lock=False)) + cluster_status_fetched = True + except exceptions.ClusterStatusFetchingError: + logger.warning( + 'Failed to fetch cluster status for ' + f'{handle.cluster_name!r}. Assuming the cluster is still ' + 'up.') + if not cluster_status_fetched: record = global_user_state.get_cluster_from_name( handle.cluster_name) prev_cluster_status = record[ @@ -3972,8 +4023,9 @@ def teardown_no_lock(self, f.flush() teardown_verb = 'Terminating' if terminate else 'Stopping' - with rich_utils.safe_status(f'[bold cyan]{teardown_verb} ' - f'[green]{cluster_name}'): + with rich_utils.safe_status( + ux_utils.spinner_message( + f'{teardown_verb}: {cluster_name}', log_path)): # FIXME(zongheng): support retries. This call can fail for # example due to GCP returning list requests per limit # exceeded. @@ -4053,7 +4105,8 @@ def post_teardown_cleanup(self, config = common_utils.read_yaml(handle.cluster_yaml) tpu_node_config = config['provider'].get('tpu_node') if tpu_node_config is None: - with rich_utils.safe_status('[bold cyan]Terminating TPU...'): + with rich_utils.safe_status( + ux_utils.spinner_message('Terminating TPU')): tpu_rc, tpu_stdout, tpu_stderr = log_lib.run_with_log( ['bash', handle.tpu_delete_script], log_abs_path, @@ -4425,13 +4478,6 @@ def _check_existing_cluster( to_provision = handle_before_refresh.launched_resources self.check_resources_fit_cluster(handle_before_refresh, task) - logger.info( - f'{colorama.Fore.CYAN}Creating a new cluster: {cluster_name!r} ' - f'[{task.num_nodes}x {to_provision}].' - f'{colorama.Style.RESET_ALL}\n' - 'Tip: to reuse an existing cluster, ' - 'specify --cluster (-c). ' - 'Run `sky status` to see existing clusters.') return RetryingVmProvisioner.ToProvisionConfig( cluster_name, to_provision, @@ -4454,7 +4500,6 @@ def _execute_file_mounts(self, handle: CloudVmRayResourceHandle, symlink_commands = [] fore = colorama.Fore style = colorama.Style - logger.info(f'{fore.CYAN}Processing file mounts.{style.RESET_ALL}') start = time.time() runners = handle.get_command_runners() log_path = os.path.join(self.log_dir, 'file_mounts.log') @@ -4468,20 +4513,20 @@ def _execute_file_mounts(self, handle: CloudVmRayResourceHandle, src_size = backend_utils.path_size_megabytes(full_src) if src_size >= _PATH_SIZE_MEGABYTES_WARN_THRESHOLD: logger.warning( - f'{fore.YELLOW}The size of file mount src {src!r} ' + f' {fore.YELLOW}The size of file mount src {src!r} ' f'is {src_size} MB. Try to keep src small or use ' '.skyignore to exclude large files, as large sizes ' f'will slow down rsync. {style.RESET_ALL}') if os.path.islink(full_src): logger.warning( - f'{fore.YELLOW}Source path {src!r} is a symlink. ' + f' {fore.YELLOW}Source path {src!r} is a symlink. ' f'Symlink contents are not uploaded.{style.RESET_ALL}') os.makedirs(os.path.expanduser(self.log_dir), exist_ok=True) os.system(f'touch {log_path}') - tail_cmd = f'tail -n100 -f {log_path}' - logger.info('To view detailed progress: ' - f'{style.BRIGHT}{tail_cmd}{style.RESET_ALL}') + + rich_utils.force_update_status( + ux_utils.spinner_message('Syncing file mounts', log_path)) for dst, src in file_mounts.items(): # TODO: room for improvement. Here there are many moving parts @@ -4576,6 +4621,7 @@ def _symlink_node(runner: command_runner.CommandRunner): subprocess_utils.run_in_parallel(_symlink_node, runners) end = time.time() logger.debug(f'File mount sync took {end - start} seconds.') + logger.info(ux_utils.finishing_message('Files synced.', log_path)) def _execute_storage_mounts( self, handle: CloudVmRayResourceHandle, @@ -4599,16 +4645,15 @@ def _execute_storage_mounts( # Handle cases when there aren't any Storages with MOUNT mode. if not storage_mounts: return - - fore = colorama.Fore - style = colorama.Style - plural = 's' if len(storage_mounts) > 1 else '' - logger.info(f'{fore.CYAN}Processing {len(storage_mounts)} ' - f'storage mount{plural}.{style.RESET_ALL}') start = time.time() runners = handle.get_command_runners() log_path = os.path.join(self.log_dir, 'storage_mounts.log') + plural = 's' if len(storage_mounts) > 1 else '' + rich_utils.force_update_status( + ux_utils.spinner_message( + f'Mounting {len(storage_mounts)} storage{plural}', log_path)) + for dst, storage_obj in storage_mounts.items(): if not os.path.isabs(dst) and not dst.startswith('~/'): dst = f'{SKY_REMOTE_WORKDIR}/{dst}' @@ -4662,6 +4707,7 @@ def _execute_storage_mounts( end = time.time() logger.debug(f'Storage mount sync took {end - start} seconds.') + logger.info(ux_utils.finishing_message('Storage mounted.', log_path)) def _set_storage_mounts_metadata( self, cluster_name: str, diff --git a/sky/backends/local_docker_backend.py b/sky/backends/local_docker_backend.py index 78619943e8c..2cc3f3347a5 100644 --- a/sky/backends/local_docker_backend.py +++ b/sky/backends/local_docker_backend.py @@ -14,6 +14,7 @@ from sky.backends import docker_utils from sky.data import storage as storage_lib from sky.utils import rich_utils +from sky.utils import ux_utils if typing.TYPE_CHECKING: from sky import resources @@ -159,7 +160,8 @@ def _provision( handle = LocalDockerResourceHandle(cluster_name) logger.info(f'Building docker image for task {task.name}. ' 'This might take some time.') - with rich_utils.safe_status('[bold cyan]Building Docker image[/]'): + with rich_utils.safe_status( + ux_utils.spinner_message('Building Docker image')): image_tag, metadata = docker_utils.build_dockerimage_from_task(task) self.images[handle] = (image_tag, metadata) logger.info(f'Image {image_tag} built.') diff --git a/sky/benchmark/benchmark_utils.py b/sky/benchmark/benchmark_utils.py index 11160332209..c9c17f00944 100644 --- a/sky/benchmark/benchmark_utils.py +++ b/sky/benchmark/benchmark_utils.py @@ -595,7 +595,8 @@ def update_benchmark_state(benchmark: str) -> None: remote_dir = os.path.join(bucket_name, benchmark) local_dir = os.path.join(_SKY_LOCAL_BENCHMARK_DIR, benchmark) os.makedirs(local_dir, exist_ok=True) - with rich_utils.safe_status('[bold cyan]Downloading benchmark logs[/]'): + with rich_utils.safe_status( + ux_utils.spinner_message('Downloading benchmark logs')): _download_remote_dir(remote_dir, local_dir, bucket_type) # Update the benchmark results in parallel. @@ -604,9 +605,9 @@ def update_benchmark_state(benchmark: str) -> None: progress = rich_progress.Progress(transient=True, redirect_stdout=False, redirect_stderr=False) - task = progress.add_task( - f'[bold cyan]Processing {num_candidates} benchmark result{plural}[/]', - total=num_candidates) + task = progress.add_task(ux_utils.spinner_message( + f'Processing {num_candidates} benchmark result{plural}'), + total=num_candidates) def _update_with_progress_bar(arg: Any) -> None: message = _update_benchmark_result(arg) diff --git a/sky/cli.py b/sky/cli.py index 70c4a13704f..87d35f58d1c 100644 --- a/sky/cli.py +++ b/sky/cli.py @@ -1814,7 +1814,8 @@ def _try_get_future_result(future) -> Tuple[bool, Any]: if show_managed_jobs: click.echo(f'\n{colorama.Fore.CYAN}{colorama.Style.BRIGHT}' f'Managed jobs{colorama.Style.RESET_ALL}') - with rich_utils.safe_status('[cyan]Checking managed jobs[/]'): + with rich_utils.safe_status( + ux_utils.spinner_message('Checking managed jobs')): managed_jobs_query_interrupted, result = _try_get_future_result( managed_jobs_future) if managed_jobs_query_interrupted: @@ -1855,7 +1856,8 @@ def _try_get_future_result(future) -> Tuple[bool, Any]: # The pool is terminated, so we cannot run the service query. msg = 'KeyboardInterrupt' else: - with rich_utils.safe_status('[cyan]Checking services[/]'): + with rich_utils.safe_status( + ux_utils.spinner_message('Checking services')): interrupted, result = _try_get_future_result( services_future) if interrupted: @@ -2551,8 +2553,8 @@ def start( 'is currently not supported.\n' 'Please start the former independently.') if controllers: - bold = backend_utils.BOLD - reset_bold = backend_utils.RESET_BOLD + bold = ux_utils.BOLD + reset_bold = ux_utils.RESET_BOLD if len(controllers) != 1: raise click.UsageError( 'Starting multiple controllers is currently not supported.\n' @@ -2673,7 +2675,7 @@ def _hint_or_raise_for_down_jobs_controller(controller_name: str): assert controller is not None, controller_name with rich_utils.safe_status( - '[bold cyan]Checking for in-progress managed jobs[/]'): + ux_utils.spinner_message('Checking for in-progress managed jobs')): try: managed_jobs_ = managed_jobs.queue(refresh=False, skip_finished=True) @@ -2725,7 +2727,8 @@ def _hint_or_raise_for_down_sky_serve_controller(controller_name: str): """ controller = controller_utils.Controllers.from_name(controller_name) assert controller is not None, controller_name - with rich_utils.safe_status('[bold cyan]Checking for live services[/]'): + with rich_utils.safe_status( + ux_utils.spinner_message('Checking for live services')): try: services = serve_lib.status() except exceptions.ClusterNotUpError as e: @@ -2909,9 +2912,9 @@ def _down_or_stop_clusters( progress = rich_progress.Progress(transient=True, redirect_stdout=False, redirect_stderr=False) - task = progress.add_task( - f'[bold cyan]{operation} {len(clusters)} cluster{plural}[/]', - total=len(clusters)) + task = progress.add_task(ux_utils.spinner_message( + f'{operation} {len(clusters)} cluster{plural}'), + total=len(clusters)) def _down_or_stop(name: str): success_progress = False @@ -3680,7 +3683,7 @@ def jobs_launch( dag_utils.fill_default_config_in_dag_for_job_launch(dag) click.secho(f'Managed job {dag.name!r} will be launched on (estimated):', - fg='yellow') + fg='cyan') dag = sky.optimize(dag) if not yes: @@ -3774,7 +3777,8 @@ def jobs_queue(all: bool, refresh: bool, skip_finished: bool): """ click.secho('Fetching managed job statuses...', fg='yellow') - with rich_utils.safe_status('[cyan]Checking managed jobs[/]'): + with rich_utils.safe_status( + ux_utils.spinner_message('Checking managed jobs')): _, msg = _get_managed_jobs(refresh=refresh, skip_finished=skip_finished, show_all=all, @@ -3825,10 +3829,12 @@ def jobs_cancel(name: Optional[str], job_ids: Tuple[int], all: bool, yes: bool): # Cancel managed jobs with IDs 1, 2, 3 $ sky jobs cancel 1 2 3 """ - backend_utils.is_controller_accessible( - controller=controller_utils.Controllers.JOBS_CONTROLLER, - stopped_message='All managed jobs should have finished.', - exit_if_not_accessible=True) + with rich_utils.safe_status( + ux_utils.spinner_message('Checking managed jobs')): + backend_utils.is_controller_accessible( + controller=controller_utils.Controllers.JOBS_CONTROLLER, + stopped_message='All managed jobs should have finished.', + exit_if_not_accessible=True) job_id_str = ','.join(map(str, job_ids)) if sum([len(job_ids) > 0, name is not None, all]) != 1: @@ -4390,7 +4396,7 @@ def serve_status(all: bool, endpoint: bool, service_names: List[str]): sky serve status my-service """ # This won't pollute the output of --endpoint. - with rich_utils.safe_status('[cyan]Checking services[/]'): + with rich_utils.safe_status(ux_utils.spinner_message('Checking services')): _, msg = _get_services(service_names, show_all=all, show_endpoint=endpoint, @@ -4814,11 +4820,11 @@ def benchmark_launch( f'\n{colorama.Fore.CYAN}Benchmark name: ' f'{colorama.Style.BRIGHT}{benchmark}{colorama.Style.RESET_ALL}' '\nTo see the benchmark results: ' - f'{backend_utils.BOLD}sky bench show ' - f'{benchmark}{backend_utils.RESET_BOLD}' + f'{ux_utils.BOLD}sky bench show ' + f'{benchmark}{ux_utils.RESET_BOLD}' '\nTo teardown the clusters: ' - f'{backend_utils.BOLD}sky bench down ' - f'{benchmark}{backend_utils.RESET_BOLD}') + f'{ux_utils.BOLD}sky bench down ' + f'{benchmark}{ux_utils.RESET_BOLD}') subprocess_utils.run('sky bench ls') else: logger.error('No benchmarking clusters are created.') @@ -5109,9 +5115,9 @@ def benchmark_delete(benchmarks: Tuple[str], all: Optional[bool], progress = rich_progress.Progress(transient=True, redirect_stdout=False, redirect_stderr=False) - task = progress.add_task( - f'[bold cyan]Deleting {len(to_delete)} benchmark{plural}: ', - total=len(to_delete)) + task = progress.add_task(ux_utils.spinner_message( + f'Deleting {len(to_delete)} benchmark{plural}'), + total=len(to_delete)) def _delete_benchmark(benchmark: str) -> None: clusters = benchmark_state.get_benchmark_clusters(benchmark) @@ -5126,8 +5132,8 @@ def _delete_benchmark(benchmark: str) -> None: message = (f'{colorama.Fore.YELLOW}Benchmark {benchmark} ' f'has {num_clusters} un-terminated cluster{plural}. ' f'Terminate the cluster{plural} with ' - f'{backend_utils.BOLD} sky bench down {benchmark} ' - f'{backend_utils.RESET_BOLD} ' + f'{ux_utils.BOLD} sky bench down {benchmark} ' + f'{ux_utils.RESET_BOLD} ' 'before deleting the benchmark report.') success = False else: @@ -5228,7 +5234,7 @@ def _deploy_local_cluster(gpus: bool): f'Full log: {log_path}' f'\nError: {style.BRIGHT}{stderr}{style.RESET_ALL}') # Run sky check - with rich_utils.safe_status('[bold cyan]Running sky check...'): + with rich_utils.safe_status(ux_utils.spinner_message('Running sky check')): sky_check.check(clouds=['kubernetes'], quiet=True) if cluster_created: # Prepare completion message which shows CPU and GPU count @@ -5425,7 +5431,8 @@ def local_down(): 'local_down.log') tail_cmd = 'tail -n100 -f ' + log_path - with rich_utils.safe_status('[bold cyan]Removing local cluster...'): + with rich_utils.safe_status( + ux_utils.spinner_message('Removing local cluster')): style = colorama.Style click.echo('To view detailed progress: ' f'{style.BRIGHT}{tail_cmd}{style.RESET_ALL}') @@ -5448,7 +5455,8 @@ def local_down(): f'\nError: {style.BRIGHT}{stderr}{style.RESET_ALL}') if cluster_removed: # Run sky check - with rich_utils.safe_status('[bold cyan]Running sky check...'): + with rich_utils.safe_status( + ux_utils.spinner_message('Running sky check')): sky_check.check(clouds=['kubernetes'], quiet=True) click.echo( f'{colorama.Fore.GREEN}Local cluster removed.{style.RESET_ALL}') diff --git a/sky/clouds/service_catalog/aws_catalog.py b/sky/clouds/service_catalog/aws_catalog.py index 77d080c8999..a44750c4ec4 100644 --- a/sky/clouds/service_catalog/aws_catalog.py +++ b/sky/clouds/service_catalog/aws_catalog.py @@ -10,8 +10,6 @@ import typing from typing import Dict, List, Optional, Tuple -import colorama - from sky import exceptions from sky import sky_logging from sky.adaptors import common as adaptors_common @@ -21,6 +19,8 @@ from sky.clouds.service_catalog.data_fetchers import fetch_aws from sky.utils import common_utils from sky.utils import resources_utils +from sky.utils import rich_utils +from sky.utils import ux_utils if typing.TYPE_CHECKING: import pandas as pd @@ -82,11 +82,10 @@ def _get_az_mappings(aws_user_hash: str) -> Optional['pd.DataFrame']: az_mappings = None if aws_user_hash != 'default': # Fetch az mapping from AWS. - print( - f'\r{colorama.Style.DIM}AWS: Fetching availability zones ' - f'mapping...{colorama.Style.RESET_ALL}', - end='') - az_mappings = fetch_aws.fetch_availability_zone_mappings() + with rich_utils.safe_status( + ux_utils.spinner_message('AWS: Fetching availability ' + 'zones mapping')): + az_mappings = fetch_aws.fetch_availability_zone_mappings() else: return None az_mappings.to_csv(az_mapping_path, index=False) diff --git a/sky/clouds/service_catalog/common.py b/sky/clouds/service_catalog/common.py index 1b5fec9e8e8..4df72824027 100644 --- a/sky/clouds/service_catalog/common.py +++ b/sky/clouds/service_catalog/common.py @@ -198,9 +198,10 @@ def _update_catalog(): if pull_frequency_hours is not None: update_frequency_str = ( f' (every {pull_frequency_hours} hours)') - with rich_utils.safe_status((f'Updating {cloud} catalog: ' - f'{filename}' - f'{update_frequency_str}')): + with rich_utils.safe_status( + ux_utils.spinner_message( + f'Updating {cloud} catalog: {filename}') + + f'{update_frequency_str}'): try: r = requests.get(url) r.raise_for_status() diff --git a/sky/clouds/service_catalog/cudo_catalog.py b/sky/clouds/service_catalog/cudo_catalog.py index a3ccdab88e3..62832cba5bf 100644 --- a/sky/clouds/service_catalog/cudo_catalog.py +++ b/sky/clouds/service_catalog/cudo_catalog.py @@ -14,6 +14,9 @@ _df = common.read_catalog(cudo_mt.VMS_CSV, pull_frequency_hours=_PULL_FREQUENCY_HOURS) +_DEFAULT_NUM_VCPUS = 8 +_DEFAULT_MEMORY_CPU_RATIO = 2 + def instance_type_exists(instance_type: str) -> bool: return common.instance_type_exists_impl(_df, instance_type) @@ -52,7 +55,14 @@ def get_default_instance_type(cpus: Optional[str] = None, del disk_tier # NOTE: After expanding catalog to multiple entries, you may # want to specify a default instance type or family. - return common.get_instance_type_for_cpus_mem_impl(_df, cpus, memory) + if cpus is None and memory is None: + cpus = f'{_DEFAULT_NUM_VCPUS}+' + + memory_gb_or_ratio = memory + if memory is None: + memory_gb_or_ratio = f'{_DEFAULT_MEMORY_CPU_RATIO}x' + return common.get_instance_type_for_cpus_mem_impl(_df, cpus, + memory_gb_or_ratio) def get_accelerators_from_instance_type( diff --git a/sky/core.py b/sky/core.py index 85f81ac6c7a..fa695bda687 100644 --- a/sky/core.py +++ b/sky/core.py @@ -21,6 +21,7 @@ from sky.utils import controller_utils from sky.utils import rich_utils from sky.utils import subprocess_utils +from sky.utils import ux_utils if typing.TYPE_CHECKING: from sky import resources as resources_lib @@ -127,8 +128,9 @@ def endpoints(cluster: str, RuntimeError: if the cluster has no ports to be exposed or no endpoints are exposed yet. """ - with rich_utils.safe_status('[bold cyan]Fetching endpoints for cluster ' - f'{cluster}...[/]'): + with rich_utils.safe_status( + ux_utils.spinner_message( + f'Fetching endpoints for cluster {cluster}')): return backend_utils.get_endpoints(cluster=cluster, port=port) diff --git a/sky/data/storage.py b/sky/data/storage.py index 78174ad1ed5..6fbb95a8c56 100644 --- a/sky/data/storage.py +++ b/sky/data/storage.py @@ -1317,8 +1317,8 @@ def get_dir_sync_command(src_dir_path, dest_dir_name): source_message = source_path_list[0] with rich_utils.safe_status( - f'[bold cyan]Syncing ' - f'[green]{source_message}[/] to [green]s3://{self.name}/[/]'): + ux_utils.spinner_message(f'Syncing {source_message} -> ' + f's3://{self.name}/')): data_utils.parallel_upload( source_path_list, get_file_sync_command, @@ -1445,7 +1445,8 @@ def _create_s3_bucket(self, } s3_client.create_bucket(**create_bucket_config) logger.info( - f'Created S3 bucket {bucket_name!r} in {region or "us-east-1"}') + f' {colorama.Style.DIM}Created S3 bucket {bucket_name!r} in ' + f'{region or "us-east-1"}{colorama.Style.RESET_ALL}') # Add AWS tags configured in config.yaml to the bucket. # This is useful for cost tracking and external cleanup. @@ -1486,7 +1487,8 @@ def _delete_s3_bucket(self, bucket_name: str) -> bool: remove_command = f'aws s3 rb s3://{bucket_name} --force' try: with rich_utils.safe_status( - f'[bold cyan]Deleting S3 bucket {bucket_name}[/]'): + ux_utils.spinner_message( + f'Deleting S3 bucket [green]{bucket_name}')): subprocess.check_output(remove_command.split(' '), stderr=subprocess.STDOUT) except subprocess.CalledProcessError as e: @@ -1726,8 +1728,8 @@ def batch_gsutil_cp(self, f'cp -e -n -r -I gs://{self.name}') with rich_utils.safe_status( - f'[bold cyan]Syncing ' - f'[green]{source_message}[/] to [green]gs://{self.name}/[/]'): + ux_utils.spinner_message(f'Syncing {source_message} -> ' + f'gs://{self.name}/')): data_utils.run_upload_cli(sync_command, self._ACCESS_DENIED_MESSAGE, bucket_name=self.name) @@ -1781,8 +1783,8 @@ def get_dir_sync_command(src_dir_path, dest_dir_name): source_message = source_path_list[0] with rich_utils.safe_status( - f'[bold cyan]Syncing ' - f'[green]{source_message}[/] to [green]gs://{self.name}/[/]'): + ux_utils.spinner_message(f'Syncing {source_message} -> ' + f'gs://{self.name}/')): data_utils.parallel_upload( source_path_list, get_file_sync_command, @@ -1904,8 +1906,9 @@ def _create_gcs_bucket(self, f'Attempted to create a bucket {self.name} but failed.' ) from e logger.info( - f'Created GCS bucket {new_bucket.name} in {new_bucket.location} ' - f'with storage class {new_bucket.storage_class}') + f' {colorama.Style.DIM}Created GCS bucket {new_bucket.name!r} in ' + f'{new_bucket.location} with storage class ' + f'{new_bucket.storage_class}{colorama.Style.RESET_ALL}') return new_bucket def _delete_gcs_bucket(self, bucket_name: str) -> bool: @@ -1919,7 +1922,8 @@ def _delete_gcs_bucket(self, bucket_name: str) -> bool: """ with rich_utils.safe_status( - f'[bold cyan]Deleting GCS bucket {bucket_name}[/]'): + ux_utils.spinner_message( + f'Deleting GCS bucket [green]{bucket_name}')): try: self.client.get_bucket(bucket_name) except gcp.forbidden_exception() as e: @@ -2306,11 +2310,12 @@ def _get_storage_account_and_resource_group( resource_group_name) except azure.exceptions().ResourceNotFoundError: with rich_utils.safe_status( - '[bold cyan]Setting up resource group: ' - f'{resource_group_name}'): + ux_utils.spinner_message( + f'Setting up resource group: ' + f'{resource_group_name}')): self.resource_client.resource_groups.create_or_update( resource_group_name, {'location': self.region}) - logger.info('Created Azure resource group ' + logger.info(' Created Azure resource group ' f'{resource_group_name!r}.') # check if the storage account name already exists under the # given resource group name. @@ -2319,13 +2324,14 @@ def _get_storage_account_and_resource_group( resource_group_name, storage_account_name) except azure.exceptions().ResourceNotFoundError: with rich_utils.safe_status( - '[bold cyan]Setting up storage account: ' - f'{storage_account_name}'): + ux_utils.spinner_message( + f'Setting up storage account: ' + f'{storage_account_name}')): self._create_storage_account(resource_group_name, storage_account_name) # wait until new resource creation propagates to Azure. time.sleep(1) - logger.info('Created Azure storage account ' + logger.info(' Created Azure storage account ' f'{storage_account_name!r}.') return storage_account_name, resource_group_name @@ -2514,9 +2520,9 @@ def get_dir_sync_command(src_dir_path, dest_dir_name) -> str: container_endpoint = data_utils.AZURE_CONTAINER_URL.format( storage_account_name=self.storage_account_name, container_name=self.name) - with rich_utils.safe_status(f'[bold cyan]Syncing ' - f'[green]{source_message}[/] to ' - f'[green]{container_endpoint}/[/]'): + with rich_utils.safe_status( + ux_utils.spinner_message( + f'Syncing {source_message} -> {container_endpoint}/')): data_utils.parallel_upload( source_path_list, get_file_sync_command, @@ -2665,9 +2671,10 @@ def _create_az_bucket(self, container_name: str) -> StorageHandle: self.storage_account_name, container_name, blob_container={}) - logger.info('Created AZ Container ' + logger.info(f' {colorama.Style.DIM}Created AZ Container ' f'{container_name!r} in {self.region!r} under storage ' - f'account {self.storage_account_name!r}.') + f'account {self.storage_account_name!r}.' + f'{colorama.Style.RESET_ALL}') except azure.exceptions().ResourceExistsError as e: if 'container is being deleted' in e.error.message: with ux_utils.print_exception_no_traceback(): @@ -2700,7 +2707,8 @@ def _delete_az_bucket(self, container_name: str) -> bool: """ try: with rich_utils.safe_status( - f'[bold cyan]Deleting Azure container {container_name}[/]'): + ux_utils.spinner_message( + f'Deleting Azure container {container_name}')): # Check for the existance of the container before deletion. self.storage_client.blob_containers.get( self.resource_group_name, @@ -2916,8 +2924,8 @@ def get_dir_sync_command(src_dir_path, dest_dir_name): source_message = source_path_list[0] with rich_utils.safe_status( - f'[bold cyan]Syncing ' - f'[green]{source_message}[/] to [green]r2://{self.name}/[/]'): + ux_utils.spinner_message( + f'Syncing {source_message} -> r2://{self.name}/')): data_utils.parallel_upload( source_path_list, get_file_sync_command, @@ -3055,7 +3063,9 @@ def _create_r2_bucket(self, location = {'LocationConstraint': region} r2_client.create_bucket(Bucket=bucket_name, CreateBucketConfiguration=location) - logger.info(f'Created R2 bucket {bucket_name} in {region}') + logger.info(f' {colorama.Style.DIM}Created R2 bucket ' + f'{bucket_name!r} in {region}' + f'{colorama.Style.RESET_ALL}') except aws.botocore_exceptions().ClientError as e: with ux_utils.print_exception_no_traceback(): raise exceptions.StorageBucketCreateError( @@ -3087,7 +3097,8 @@ def _delete_r2_bucket(self, bucket_name: str) -> bool: f'--profile={cloudflare.R2_PROFILE_NAME}') try: with rich_utils.safe_status( - f'[bold cyan]Deleting R2 bucket {bucket_name}[/]'): + ux_utils.spinner_message( + f'Deleting R2 bucket {bucket_name}')): subprocess.check_output(remove_command, stderr=subprocess.STDOUT, shell=True) @@ -3354,9 +3365,8 @@ def get_file_sync_command(base_dir_path, file_names) -> str: source_message = source_path_list[0] with rich_utils.safe_status( - f'[bold cyan]Syncing ' - f'[green]{source_message}[/] to ' - f'[green]cos://{self.region}/{self.name}/[/]'): + ux_utils.spinner_message(f'Syncing {source_message} -> ' + f'cos://{self.region}/{self.name}/')): data_utils.parallel_upload( source_path_list, get_file_sync_command, @@ -3490,8 +3500,10 @@ def _create_cos_bucket(self, CreateBucketConfiguration={ 'LocationConstraint': f'{region}-smart' }) - logger.info(f'Created IBM COS bucket {bucket_name} in {region} ' - f'with storage class smart tier') + logger.info(f' {colorama.Style.DIM}Created IBM COS bucket ' + f'{bucket_name!r} in {region} ' + 'with storage class smart tier' + f'{colorama.Style.RESET_ALL}') self.bucket = self.s3_resource.Bucket(bucket_name) except ibm.ibm_botocore.exceptions.ClientError as e: # type: ignore[union-attr] # pylint: disable=line-too-long diff --git a/sky/data/storage_utils.py b/sky/data/storage_utils.py index 7b5bf48d5db..f4db244d917 100644 --- a/sky/data/storage_utils.py +++ b/sky/data/storage_utils.py @@ -213,9 +213,13 @@ def get_excluded_files(src_dir_path: str) -> List[str]: skyignore_path = os.path.join(expand_src_dir_path, constants.SKY_IGNORE_FILE) if os.path.exists(skyignore_path): - logger.info(f'Exclude files to sync to cluster based on ' - f'{constants.SKY_IGNORE_FILE}.') + logger.info(f' {colorama.Style.DIM}' + f'Excluded files to sync to cluster based on ' + f'{constants.SKY_IGNORE_FILE}.' + f'{colorama.Style.RESET_ALL}') return get_excluded_files_from_skyignore(src_dir_path) - logger.info(f'Exclude files to sync to cluster based on ' - f'{constants.GIT_IGNORE_FILE}.') + logger.info(f' {colorama.Style.DIM}' + f'Excluded files to sync to cluster based on ' + f'{constants.GIT_IGNORE_FILE}.' + f'{colorama.Style.RESET_ALL}') return get_excluded_files_from_gitignore(src_dir_path) diff --git a/sky/exceptions.py b/sky/exceptions.py index 04c50ad4e08..066d36c3cf3 100644 --- a/sky/exceptions.py +++ b/sky/exceptions.py @@ -291,3 +291,8 @@ class PortDoesNotExistError(Exception): class UserRequestRejectedByPolicy(Exception): """Raised when a user request is rejected by an admin policy.""" pass + + +class NoClusterLaunchedError(Exception): + """No cluster launched, so cleanup can be skipped during failover.""" + pass diff --git a/sky/execution.py b/sky/execution.py index 792ca5fffc0..d9a346a99cf 100644 --- a/sky/execution.py +++ b/sky/execution.py @@ -3,7 +3,6 @@ See `Stage` for a Task's life cycle. """ import enum -import os from typing import List, Optional, Tuple, Union import colorama @@ -20,10 +19,8 @@ from sky.utils import admin_policy_utils from sky.utils import controller_utils from sky.utils import dag_utils -from sky.utils import env_options from sky.utils import resources_utils from sky.utils import rich_utils -from sky.utils import subprocess_utils from sky.utils import timeline from sky.utils import ux_utils @@ -293,11 +290,17 @@ def _execute( logger.info('Dryrun finished.') return None, None - if Stage.SYNC_WORKDIR in stages and not dryrun: - if task.workdir is not None: - backend.sync_workdir(handle, task.workdir) + do_workdir = (Stage.SYNC_WORKDIR in stages and not dryrun and + task.workdir is not None) + do_file_mounts = (Stage.SYNC_FILE_MOUNTS in stages and not dryrun and + task.file_mounts is not None) + if do_workdir or do_file_mounts: + logger.info(ux_utils.starting_message('Mounting files.')) - if Stage.SYNC_FILE_MOUNTS in stages and not dryrun: + if do_workdir: + backend.sync_workdir(handle, task.workdir) + + if do_file_mounts: backend.sync_file_mounts(handle, task.file_mounts, task.storage_mounts) @@ -330,23 +333,6 @@ def _execute( backend.teardown_ephemeral_storage(task) backend.teardown(handle, terminate=True) finally: - controller = controller_utils.Controllers.from_name(cluster_name) - if controller is None and not _is_launched_by_sky_serve_controller: - # UX: print live clusters to make users aware (to save costs). - # - # Don't print if this job is launched by the jobs controller, - # because managed jobs are serverless, there can be many of them, - # and users tend to continuously monitor managed jobs using `sky - # job queue`. Also don't print if this job is a skyserve controller - # job or launched by a skyserve controller job, because the - # redirect for this subprocess.run won't success and it will - # pollute the controller logs. - # - # Disable the usage collection for this status command. - env = dict(os.environ, - **{env_options.Options.DISABLE_LOGGING.value: '1'}) - subprocess_utils.run( - 'sky status --no-show-managed-jobs --no-show-services', env=env) print() print('\x1b[?25h', end='') # Show cursor. return job_id, handle diff --git a/sky/jobs/core.py b/sky/jobs/core.py index 2cfc2783b4b..6c1ac42d192 100644 --- a/sky/jobs/core.py +++ b/sky/jobs/core.py @@ -79,9 +79,11 @@ def launch( dag_utils.fill_default_config_in_dag_for_job_launch(dag) - for task_ in dag.tasks: - controller_utils.maybe_translate_local_file_mounts_and_sync_up( - task_, path='jobs') + with rich_utils.safe_status( + ux_utils.spinner_message('Initializing managed job')): + for task_ in dag.tasks: + controller_utils.maybe_translate_local_file_mounts_and_sync_up( + task_, path='jobs') with tempfile.NamedTemporaryFile(prefix=f'managed-dag-{dag.name}-', mode='w') as f: @@ -129,7 +131,6 @@ def launch( f'{colorama.Fore.YELLOW}' f'Launching managed job {dag.name!r} from jobs controller...' f'{colorama.Style.RESET_ALL}') - sky_logging.print('Launching jobs controller...') sky.launch(task=controller_task, stream_logs=stream_logs, cluster_name=controller_name, @@ -262,11 +263,12 @@ def queue(refresh: bool, skip_finished: bool = False) -> List[Dict[str, Any]]: f'{colorama.Style.RESET_ALL}') rich_utils.force_update_status( - '[cyan] Checking managed jobs - restarting ' - 'controller[/]') + ux_utils.spinner_message('Checking managed jobs - restarting ' + 'controller')) handle = sky.start(jobs_controller_type.value.cluster_name) controller_status = status_lib.ClusterStatus.UP - rich_utils.force_update_status('[cyan] Checking managed jobs[/]') + rich_utils.force_update_status( + ux_utils.spinner_message('Checking managed jobs')) assert handle is not None, (controller_status, refresh) diff --git a/sky/jobs/utils.py b/sky/jobs/utils.py index d46404bd4fd..0d2eed9af9c 100644 --- a/sky/jobs/utils.py +++ b/sky/jobs/utils.py @@ -34,6 +34,7 @@ from sky.utils import log_utils from sky.utils import rich_utils from sky.utils import subprocess_utils +from sky.utils import ux_utils if typing.TYPE_CHECKING: import sky @@ -57,11 +58,13 @@ _LOG_STREAM_CHECK_CONTROLLER_GAP_SECONDS = 5 -_JOB_WAITING_STATUS_MESSAGE = ('[bold cyan]Waiting for the task to start' - '{status_str}.[/] It may take a few minutes.') +_JOB_WAITING_STATUS_MESSAGE = ux_utils.spinner_message( + 'Waiting for task to start[/]' + '{status_str}. It may take a few minutes.\n' + ' [dim]View controller logs: sky jobs logs --controller {job_id}') _JOB_CANCELLED_MESSAGE = ( - '[bold cyan]Waiting for the task status to be updated.' - '[/] It may take a minute.') + ux_utils.spinner_message('Waiting for task status to be updated.') + + ' It may take a minute.') # The maximum time to wait for the managed job status to transition to terminal # state, after the job finished. This is a safeguard to avoid the case where @@ -290,8 +293,8 @@ def cancel_job_by_name(job_name: str) -> str: def stream_logs_by_id(job_id: int, follow: bool = True) -> str: """Stream logs by job id.""" controller_status = job_lib.get_status(job_id) - status_msg = ('[bold cyan]Waiting for controller process to be RUNNING' - '{status_str}[/].') + status_msg = ux_utils.spinner_message( + 'Waiting for controller process to be RUNNING') + '{status_str}' status_display = rich_utils.safe_status(status_msg.format(status_str='')) num_tasks = managed_job_state.get_num_tasks(job_id) @@ -310,7 +313,7 @@ def stream_logs_by_id(job_id: int, follow: bool = True) -> str: time.sleep(_LOG_STREAM_CHECK_CONTROLLER_GAP_SECONDS) controller_status = job_lib.get_status(job_id) - msg = _JOB_WAITING_STATUS_MESSAGE.format(status_str='') + msg = _JOB_WAITING_STATUS_MESSAGE.format(status_str='', job_id=job_id) status_display.update(msg) prev_msg = msg managed_job_status = managed_job_state.get_status(job_id) @@ -356,7 +359,8 @@ def stream_logs_by_id(job_id: int, follow: bool = True) -> str: logger.debug( f'INFO: The log is not ready yet{status_str}. ' f'Waiting for {JOB_STATUS_CHECK_GAP_SECONDS} seconds.') - msg = _JOB_WAITING_STATUS_MESSAGE.format(status_str=status_str) + msg = _JOB_WAITING_STATUS_MESSAGE.format(status_str=status_str, + job_id=job_id) if msg != prev_msg: status_display.update(msg) prev_msg = msg @@ -444,8 +448,9 @@ def stream_logs_by_id(job_id: int, follow: bool = True) -> str: managed_job_status = managed_job_state.get_status(job_id) assert managed_job_status is not None, job_id - logger.info(f'Logs finished for job {job_id} ' - f'(status: {managed_job_status.value}).') + logger.info( + ux_utils.finishing_message(f'Managed job finished: {job_id} ' + f'(status: {managed_job_status.value}).')) return '' diff --git a/sky/optimizer.py b/sky/optimizer.py index 4326329579d..bcab7836ee9 100644 --- a/sky/optimizer.py +++ b/sky/optimizer.py @@ -123,22 +123,23 @@ def optimize(dag: 'dag_lib.Dag', for a task. exceptions.NoCloudAccessError: if no public clouds are enabled. """ - _check_specified_clouds(dag) - - # This function is effectful: mutates every node in 'dag' by setting - # node.best_resources if it is None. - Optimizer._add_dummy_source_sink_nodes(dag) - try: - unused_best_plan = Optimizer._optimize_dag( - dag=dag, - minimize_cost=minimize == OptimizeTarget.COST, - blocked_resources=blocked_resources, - quiet=quiet) - finally: - # Make sure to remove the dummy source/sink nodes, even if the - # optimization fails. - Optimizer._remove_dummy_source_sink_nodes(dag) - return dag + with rich_utils.safe_status(ux_utils.spinner_message('Optimizing')): + _check_specified_clouds(dag) + + # This function is effectful: mutates every node in 'dag' by setting + # node.best_resources if it is None. + Optimizer._add_dummy_source_sink_nodes(dag) + try: + unused_best_plan = Optimizer._optimize_dag( + dag=dag, + minimize_cost=minimize == OptimizeTarget.COST, + blocked_resources=blocked_resources, + quiet=quiet) + finally: + # Make sure to remove the dummy source/sink nodes, even if the + # optimization fails. + Optimizer._remove_dummy_source_sink_nodes(dag) + return dag @staticmethod def _add_dummy_source_sink_nodes(dag: 'dag_lib.Dag'): @@ -259,6 +260,9 @@ def get_available_reservations( launchable_resources: Dict[resources_lib.Resources, List[resources_lib.Resources]] ) -> Dict[resources_lib.Resources, int]: + if not resources_utils.need_to_query_reservations(): + return {} + num_available_reserved_nodes_per_resource = {} def get_reservations_available_resources( @@ -269,7 +273,7 @@ def get_reservations_available_resources( launchable_resources_list: List[resources_lib.Resources] = sum( launchable_resources.values(), []) with rich_utils.safe_status( - '[cyan]Checking reserved resources...[/]'): + ux_utils.spinner_message('Checking reserved resources')): subprocess_utils.run_in_parallel( get_reservations_available_resources, launchable_resources_list) @@ -337,8 +341,8 @@ def get_reservations_available_resources( if minimize_cost: cost_per_node = resources.get_cost(estimated_runtime) num_available_reserved_nodes = ( - num_available_reserved_nodes_per_resource[resources] - ) + num_available_reserved_nodes_per_resource.get( + resources, 0)) # We consider the cost of the unused reservation # resources to be 0 since we are already paying for @@ -384,10 +388,14 @@ def get_reservations_available_resources( fuzzy_candidates_str = ( f'\nTry one of these offered accelerators: {cyan}' f'{fuzzy_candidates}{reset}') + node_resources_reprs = ', '.join(f'{node.num_nodes}x ' + + r.repr_with_region_zone + for r in node.resources) error_msg = ( f'{source_hint.capitalize()} does not contain any ' - f'instances satisfying the request:\n{node}.' - f'\n\nTo fix: relax or change the ' + f'instances satisfying the request: ' + f'{node_resources_reprs}.' + f'\nTo fix: relax or change the ' f'resource requirements.{fuzzy_candidates_str}\n\n' f'Hint: {bold}sky show-gpus{reset} ' 'to list available accelerators.\n' @@ -716,7 +724,6 @@ def print_optimized_plan( node_to_cost_map: _TaskToCostMap, minimize_cost: bool, ): - logger.info('== Optimizer ==') ordered_node_to_cost_map = collections.OrderedDict() ordered_best_plan = collections.OrderedDict() for node in topo_order: @@ -738,15 +745,18 @@ def print_optimized_plan( node.get_inputs() is None and node.get_outputs() is None): print_hourly_cost = True - if print_hourly_cost: - logger.info(f'{colorama.Style.BRIGHT}Estimated cost: ' - f'{colorama.Style.RESET_ALL}${total_cost:.1f} / hour\n') - else: - logger.info(f'{colorama.Style.BRIGHT}Estimated total runtime: ' - f'{colorama.Style.RESET_ALL}{total_time / 3600:.1f} ' - 'hours\n' - f'{colorama.Style.BRIGHT}Estimated total cost: ' - f'{colorama.Style.RESET_ALL}${total_cost:.1f}\n') + if not env_options.Options.MINIMIZE_LOGGING.get(): + if print_hourly_cost: + logger.info( + f'{colorama.Style.BRIGHT}Estimated cost: ' + f'{colorama.Style.RESET_ALL}${total_cost:.1f} / hour\n') + else: + logger.info( + f'{colorama.Style.BRIGHT}Estimated total runtime: ' + f'{colorama.Style.RESET_ALL}{total_time / 3600:.1f} ' + 'hours\n' + f'{colorama.Style.BRIGHT}Estimated total cost: ' + f'{colorama.Style.RESET_ALL}${total_cost:.1f}\n') def _get_resources_element_list( resources: 'resources_lib.Resources') -> List[str]: @@ -845,7 +855,7 @@ def _get_resource_group_hash(resources: 'resources_lib.Resources'): best_plan_table = _create_table(['TASK', '#NODES'] + resource_fields) best_plan_table.add_rows(best_plan_rows) - logger.info(f'{best_plan_table}\n') + logger.info(f'{best_plan_table}') # Print the egress plan if any data egress is scheduled. Optimizer._print_egress_plan(graph, best_plan, minimize_cost) @@ -864,6 +874,10 @@ def _get_resource_group_hash(resources: 'resources_lib.Resources'): } task_str = (f'for task {task.name!r} ' if num_tasks > 1 else '') plural = 's' if task.num_nodes > 1 else '' + if num_tasks > 1: + # Add a new line for better readability, when there are multiple + # tasks. + logger.info('') logger.info( f'{colorama.Style.BRIGHT}Considered resources {task_str}' f'({task.num_nodes} node{plural}):' @@ -934,7 +948,7 @@ def sort_key(row, accelerator_spot_list=accelerator_spot_list): table = _create_table(field_names) table.add_rows(rows) - logger.info(f'{table}\n') + logger.info(f'{table}') # Warning message for using disk_tier=ultra # TODO(yi): Consider price of disks in optimizer and @@ -965,10 +979,10 @@ def _print_candidates(node_to_candidate_map: _TaskToPerCloudCandidates): f'Multiple {cloud} instances satisfy ' f'{acc_name}:{int(acc_count)}. ' f'The cheapest {candidate_list[0]!r} is considered ' - f'among:\n{instance_list}.\n') + f'among:\n{instance_list}.') if is_multi_instances: logger.info( - f'To list more details, run \'sky show-gpus {acc_name}\'.') + f'To list more details, run: sky show-gpus {acc_name}\n') @staticmethod def _optimize_dag( @@ -1101,8 +1115,7 @@ def ordinal_number(n): Optimizer.print_optimized_plan(graph, topo_order, best_plan, total_time, total_cost, node_to_cost_map, minimize_cost) - if not env_options.Options.MINIMIZE_LOGGING.get(): - Optimizer._print_candidates(local_node_to_candidate_map) + Optimizer._print_candidates(local_node_to_candidate_map) return best_plan diff --git a/sky/provision/aws/config.py b/sky/provision/aws/config.py index c83732d60c4..d61e72ae7ae 100644 --- a/sky/provision/aws/config.py +++ b/sky/provision/aws/config.py @@ -16,10 +16,12 @@ import colorama +from sky import exceptions from sky import sky_logging from sky.adaptors import aws from sky.provision import common from sky.provision.aws import utils +from sky.utils import common_utils logger = sky_logging.init_logger(__name__) @@ -535,12 +537,19 @@ def _get_or_create_vpc_security_group(ec2, vpc_id: str, if vpc_id in vpc_to_existing_sg: return vpc_to_existing_sg[vpc_id] - # create a new security group - ec2.meta.client.create_security_group( - Description='Auto-created security group for Ray workers', - GroupName=expected_sg_name, - VpcId=vpc_id, - ) + try: + # create a new security group + ec2.meta.client.create_security_group( + Description='Auto-created security group for Ray workers', + GroupName=expected_sg_name, + VpcId=vpc_id, + ) + except ec2.meta.client.exceptions.ClientError as e: + message = ('Failed to create security group. Error: ' + f'{common_utils.format_exception(e)}') + logger.warning(message) + raise exceptions.NoClusterLaunchedError(message) from e + security_group = _get_security_groups_from_vpc_ids(ec2, [vpc_id], [expected_sg_name]) diff --git a/sky/provision/azure/config.py b/sky/provision/azure/config.py index 7b50c3d8c0f..b3cb357512a 100644 --- a/sky/provision/azure/config.py +++ b/sky/provision/azure/config.py @@ -5,16 +5,18 @@ """ import hashlib import json -import logging from pathlib import Path import random import time from typing import Any, Callable +from sky import exceptions +from sky import sky_logging from sky.adaptors import azure from sky.provision import common +from sky.utils import common_utils -logger = logging.getLogger(__name__) +logger = sky_logging.init_logger(__name__) UNIQUE_ID_LEN = 4 _DEPLOYMENT_NAME = 'skypilot-config' @@ -92,10 +94,19 @@ def bootstrap_instances( retry += 1 continue raise + except azure.exceptions().ClientAuthenticationError as e: + message = ( + 'Failed to authenticate with Azure. Please check your Azure ' + f'credentials. Error: {common_utils.format_exception(e)}' + ).replace('\n', ' ') + logger.error(message) + raise exceptions.NoClusterLaunchedError(message) from e else: - raise TimeoutError( + message = ( f'Timed out waiting for resource group {resource_group} to be ' 'deleted.') + logger.error(message) + raise TimeoutError(message) # load the template file current_path = Path(__file__).parent diff --git a/sky/provision/azure/instance.py b/sky/provision/azure/instance.py index 009fb889848..3c5ed8801a4 100644 --- a/sky/provision/azure/instance.py +++ b/sky/provision/azure/instance.py @@ -441,15 +441,21 @@ def _create_instance_tag(target_instance, is_head: bool = True) -> str: if to_start_count > 0: resource_client = azure.get_client('resource', subscription_id) logger.debug(f'run_instances: Creating {to_start_count} instances.') - created_instances = _create_instances( - compute_client=compute_client, - resource_client=resource_client, - cluster_name_on_cloud=cluster_name_on_cloud, - resource_group=resource_group, - provider_config=provider_config, - node_config=config.node_config, - tags=tags, - count=to_start_count) + try: + created_instances = _create_instances( + compute_client=compute_client, + resource_client=resource_client, + cluster_name_on_cloud=cluster_name_on_cloud, + resource_group=resource_group, + provider_config=provider_config, + node_config=config.node_config, + tags=tags, + count=to_start_count) + except Exception as e: + err_message = common_utils.format_exception( + e, use_bracket=True).replace('\n', ' ') + logger.error(f'Failed to create instances: {err_message}') + raise created_instance_ids = [inst.name for inst in created_instances] non_running_instance_statuses = list( diff --git a/sky/provision/kubernetes/instance.py b/sky/provision/kubernetes/instance.py index 8da13d5ad0f..6663ed3f657 100644 --- a/sky/provision/kubernetes/instance.py +++ b/sky/provision/kubernetes/instance.py @@ -632,7 +632,9 @@ def run_instances(region: str, cluster_name_on_cloud: str, try: return _create_pods(region, cluster_name_on_cloud, config) except (kubernetes.api_exception(), config_lib.KubernetesError) as e: - logger.warning(f'run_instances: Error occurred when creating pods: {e}') + e_msg = common_utils.format_exception(e).replace('\n', ' ') + logger.warning('run_instances: Error occurred when creating pods: ' + f'{e_msg}') raise diff --git a/sky/provision/provisioner.py b/sky/provision/provisioner.py index 0c188599ae6..b2ac6d6660f 100644 --- a/sky/provision/provisioner.py +++ b/sky/provision/provisioner.py @@ -14,6 +14,7 @@ import sky from sky import clouds +from sky import exceptions from sky import provision from sky import sky_logging from sky import status_lib @@ -42,76 +43,50 @@ def _bulk_provision( cloud: clouds.Cloud, region: clouds.Region, - zones: Optional[List[clouds.Zone]], cluster_name: resources_utils.ClusterName, bootstrap_config: provision_common.ProvisionConfig, ) -> provision_common.ProvisionRecord: provider_name = repr(cloud) region_name = region.name - style = colorama.Style - - if not zones: - # For Azure, zones is always an empty list. - zone_str = 'all zones' - else: - zone_str = ','.join(z.name for z in zones) - - if isinstance(cloud, clouds.Kubernetes): - # Omit the region name for Kubernetes. - logger.info(f'{style.BRIGHT}Launching on {cloud}{style.RESET_ALL} ' - f'{cluster_name!r}.') - else: - logger.info(f'{style.BRIGHT}Launching on {cloud} ' - f'{region_name}{style.RESET_ALL} ({zone_str})') - start = time.time() - with rich_utils.safe_status('[bold cyan]Launching[/]') as status: + # TODO(suquark): Should we cache the bootstrapped result? + # Currently it is not necessary as bootstrapping takes + # only ~3s, caching it seems over-engineering and could + # cause other issues like the cache is not synced + # with the cloud configuration. + config = provision.bootstrap_instances(provider_name, region_name, + cluster_name.name_on_cloud, + bootstrap_config) + + provision_record = provision.run_instances(provider_name, + region_name, + cluster_name.name_on_cloud, + config=config) + + backoff = common_utils.Backoff(initial_backoff=1, max_backoff_factor=3) + logger.debug(f'\nWaiting for instances of {cluster_name!r} to be ready...') + rich_utils.force_update_status( + ux_utils.spinner_message('Launching - Checking instance status', + str(provision_logging.config.log_path))) + # AWS would take a very short time (<<1s) updating the state of the + # instance. + time.sleep(1) + for retry_cnt in range(_MAX_RETRY): try: - # TODO(suquark): Should we cache the bootstrapped result? - # Currently it is not necessary as bootstrapping takes - # only ~3s, caching it seems over-engineering and could - # cause other issues like the cache is not synced - # with the cloud configuration. - config = provision.bootstrap_instances(provider_name, region_name, - cluster_name.name_on_cloud, - bootstrap_config) - except Exception as e: - logger.error(f'{colorama.Fore.YELLOW}Failed to configure ' - f'{cluster_name!r} on {cloud} {region} ({zone_str}) ' - 'with the following error:' - f'{colorama.Style.RESET_ALL}\n' - f'{common_utils.format_exception(e)}') - raise - - provision_record = provision.run_instances(provider_name, - region_name, - cluster_name.name_on_cloud, - config=config) - - backoff = common_utils.Backoff(initial_backoff=1, max_backoff_factor=3) - logger.debug( - f'\nWaiting for instances of {cluster_name!r} to be ready...') - status.update('[bold cyan]Launching - Checking instance status[/]') - # AWS would take a very short time (<<1s) updating the state of the - # instance. - time.sleep(1) - for retry_cnt in range(_MAX_RETRY): - try: - provision.wait_instances(provider_name, - region_name, - cluster_name.name_on_cloud, - state=status_lib.ClusterStatus.UP) - break - except (aws.botocore_exceptions().WaiterError, RuntimeError): - time.sleep(backoff.current_backoff()) - else: - raise RuntimeError( - f'Failed to wait for instances of {cluster_name!r} to be ' - f'ready on the cloud provider after max retries {_MAX_RETRY}.') - logger.debug( - f'Instances of {cluster_name!r} are ready after {retry_cnt} ' - 'retries.') + provision.wait_instances(provider_name, + region_name, + cluster_name.name_on_cloud, + state=status_lib.ClusterStatus.UP) + break + except (aws.botocore_exceptions().WaiterError, RuntimeError): + time.sleep(backoff.current_backoff()) + else: + raise RuntimeError( + f'Failed to wait for instances of {cluster_name!r} to be ' + f'ready on the cloud provider after max retries {_MAX_RETRY}.') + logger.debug(f'Instances of {cluster_name!r} are ready after {retry_cnt} ' + 'retries.') logger.debug( f'\nProvisioning {cluster_name!r} took {time.time() - start:.2f} ' @@ -162,8 +137,11 @@ def bulk_provision( logger.debug( 'Provision config:\n' f'{json.dumps(dataclasses.asdict(bootstrap_config), indent=2)}') - return _bulk_provision(cloud, region, zones, cluster_name, + return _bulk_provision(cloud, region, cluster_name, bootstrap_config) + except exceptions.NoClusterLaunchedError: + # Skip the teardown if the cluster was never launched. + raise except Exception: # pylint: disable=broad-except zone_str = 'all zones' if zones: @@ -440,23 +418,30 @@ def _post_provision_setup( # We don't set docker_user here, as we are configuring the VM itself. ssh_credentials = backend_utils.ssh_credential_from_yaml( cluster_yaml, ssh_user=cluster_info.ssh_user) + docker_config = config_from_yaml.get('docker', {}) with rich_utils.safe_status( - '[bold cyan]Launching - Waiting for SSH access[/]') as status: + ux_utils.spinner_message( + 'Launching - Waiting for SSH access', + provision_logging.config.log_path)) as status: logger.debug( f'\nWaiting for SSH to be available for {cluster_name!r} ...') wait_for_ssh(cluster_info, ssh_credentials) - logger.debug(f'SSH Conection ready for {cluster_name!r}') + logger.debug(f'SSH Connection ready for {cluster_name!r}') + vm_str = 'Instance' if cloud_name.lower() != 'kubernetes' else 'Pod' plural = '' if len(cluster_info.instances) == 1 else 's' - logger.info(f'{colorama.Fore.GREEN}Successfully provisioned ' - f'or found existing instance{plural}.' - f'{colorama.Style.RESET_ALL}') + verb = 'is' if len(cluster_info.instances) == 1 else 'are' + indent_str = (ux_utils.INDENT_SYMBOL + if docker_config else ux_utils.INDENT_LAST_SYMBOL) + logger.info(f'{indent_str}{colorama.Style.DIM}{vm_str}{plural} {verb} ' + f'up.{colorama.Style.RESET_ALL}') - docker_config = config_from_yaml.get('docker', {}) if docker_config: status.update( - '[bold cyan]Launching - Initializing docker container[/]') + ux_utils.spinner_message( + 'Launching - Initializing docker container', + provision_logging.config.log_path)) docker_user = instance_setup.initialize_docker( cluster_name.name_on_cloud, docker_config=docker_config, @@ -470,6 +455,8 @@ def _post_provision_setup( cluster_info.docker_user = docker_user ssh_credentials['docker_user'] = docker_user logger.debug(f'Docker user: {docker_user}') + logger.info(f'{ux_utils.INDENT_LAST_SYMBOL}{colorama.Style.DIM}' + f'Docker container is up.{colorama.Style.RESET_ALL}') # We mount the metadata with sky wheel for speedup. # NOTE: currently we mount all credentials for all nodes, because @@ -482,8 +469,9 @@ def _post_provision_setup( # for later. file_mounts = config_from_yaml.get('file_mounts', {}) - runtime_preparation_str = ('[bold cyan]Preparing SkyPilot ' - 'runtime ({step}/3 - {step_name})') + runtime_preparation_str = (ux_utils.spinner_message( + 'Preparing SkyPilot runtime ({step}/3 - {step_name})', + provision_logging.config.log_path)) status.update( runtime_preparation_str.format(step=1, step_name='initializing')) instance_setup.internal_file_mounts(cluster_name.name_on_cloud, @@ -551,8 +539,9 @@ def _post_provision_setup( instance_setup.start_skylet_on_head_node(cluster_name.name_on_cloud, cluster_info, ssh_credentials) - logger.info(f'{colorama.Fore.GREEN}Successfully provisioned cluster: ' - f'{cluster_name}{colorama.Style.RESET_ALL}') + logger.info( + ux_utils.finishing_message(f'Cluster launched: {cluster_name}.', + provision_logging.config.log_path)) return cluster_info diff --git a/sky/serve/core.py b/sky/serve/core.py index 2bb6e1384ee..3ad260213f1 100644 --- a/sky/serve/core.py +++ b/sky/serve/core.py @@ -129,8 +129,10 @@ def up( task, use_mutated_config_in_current_request=False) task = dag.tasks[0] - controller_utils.maybe_translate_local_file_mounts_and_sync_up(task, - path='serve') + with rich_utils.safe_status( + ux_utils.spinner_message('Initializing service')): + controller_utils.maybe_translate_local_file_mounts_and_sync_up( + task, path='serve') with tempfile.NamedTemporaryFile( prefix=f'service-task-{service_name}-', @@ -215,7 +217,8 @@ def up( # TODO(tian): Cache endpoint locally to speedup. Endpoint won't # change after the first time, so there is no consistency issue. with rich_utils.safe_status( - '[cyan]Waiting for the service to register[/]'): + ux_utils.spinner_message( + 'Waiting for the service to register')): # This function will check the controller job id in the database # and return the endpoint if the job id matches. Otherwise it will # return None. @@ -274,34 +277,31 @@ def up( f'{style.BRIGHT}{service_name}{style.RESET_ALL}' f'\n{fore.CYAN}Endpoint URL: ' f'{style.BRIGHT}{endpoint}{style.RESET_ALL}' - '\nTo see detailed info:\t\t' - f'{backend_utils.BOLD}sky serve status {service_name} ' - f'[--endpoint]{backend_utils.RESET_BOLD}' - '\nTo teardown the service:\t' - f'{backend_utils.BOLD}sky serve down {service_name}' - f'{backend_utils.RESET_BOLD}' - '\n' - '\nTo see logs of a replica:\t' - f'{backend_utils.BOLD}sky serve logs {service_name} [REPLICA_ID]' - f'{backend_utils.RESET_BOLD}' - '\nTo see logs of load balancer:\t' - f'{backend_utils.BOLD}sky serve logs --load-balancer {service_name}' - f'{backend_utils.RESET_BOLD}' - '\nTo see logs of controller:\t' - f'{backend_utils.BOLD}sky serve logs --controller {service_name}' - f'{backend_utils.RESET_BOLD}' - '\n' - '\nTo monitor replica status:\t' - f'{backend_utils.BOLD}watch -n10 sky serve status {service_name}' - f'{backend_utils.RESET_BOLD}' - '\nTo send a test request:\t\t' - f'{backend_utils.BOLD}curl {endpoint}' - f'{backend_utils.RESET_BOLD}' - '\n' - f'\n{fore.GREEN}SkyServe is spinning up your service now.' - f'{style.RESET_ALL}' - f'\n{fore.GREEN}The replicas should be ready within a ' - f'short time.{style.RESET_ALL}') + f'\nπŸ“‹ Useful Commands' + f'\n{ux_utils.INDENT_SYMBOL}To check service status:\t' + f'{ux_utils.BOLD}sky serve status {service_name} ' + f'[--endpoint]{ux_utils.RESET_BOLD}' + f'\n{ux_utils.INDENT_SYMBOL}To teardown the service:\t' + f'{ux_utils.BOLD}sky serve down {service_name}' + f'{ux_utils.RESET_BOLD}' + f'\n{ux_utils.INDENT_SYMBOL}To see replica logs:\t' + f'{ux_utils.BOLD}sky serve logs {service_name} [REPLICA_ID]' + f'{ux_utils.RESET_BOLD}' + f'\n{ux_utils.INDENT_SYMBOL}To see load balancer logs:\t' + f'{ux_utils.BOLD}sky serve logs --load-balancer {service_name}' + f'{ux_utils.RESET_BOLD}' + f'\n{ux_utils.INDENT_SYMBOL}To see controller logs:\t' + f'{ux_utils.BOLD}sky serve logs --controller {service_name}' + f'{ux_utils.RESET_BOLD}' + f'\n{ux_utils.INDENT_SYMBOL}To monitor the status:\t' + f'{ux_utils.BOLD}watch -n10 sky serve status {service_name}' + f'{ux_utils.RESET_BOLD}' + f'\n{ux_utils.INDENT_LAST_SYMBOL}To send a test request:\t' + f'{ux_utils.BOLD}curl {endpoint}' + f'{ux_utils.RESET_BOLD}' + '\n\n' + + ux_utils.finishing_message('Service is spinning up and replicas ' + 'will be ready shortly.')) return service_name, endpoint @@ -323,11 +323,11 @@ def update( controller=controller_utils.Controllers.SKY_SERVE_CONTROLLER, stopped_message= 'Service controller is stopped. There is no service to update. ' - f'To spin up a new service, use {backend_utils.BOLD}' - f'sky serve up{backend_utils.RESET_BOLD}', + f'To spin up a new service, use {ux_utils.BOLD}' + f'sky serve up{ux_utils.RESET_BOLD}', non_existent_message='Service does not exist. ' 'To spin up a new service, ' - f'use {backend_utils.BOLD}sky serve up{backend_utils.RESET_BOLD}', + f'use {ux_utils.BOLD}sky serve up{ux_utils.RESET_BOLD}', ) backend = backend_utils.get_backend_from_handle(handle) @@ -353,8 +353,8 @@ def update( if len(service_statuses) == 0: with ux_utils.print_exception_no_traceback(): raise RuntimeError(f'Cannot find service {service_name!r}.' - f'To spin up a service, use {backend_utils.BOLD}' - f'sky serve up{backend_utils.RESET_BOLD}') + f'To spin up a service, use {ux_utils.BOLD}' + f'sky serve up{ux_utils.RESET_BOLD}') if len(service_statuses) > 1: with ux_utils.print_exception_no_traceback(): @@ -374,8 +374,10 @@ def update( with ux_utils.print_exception_no_traceback(): raise RuntimeError(prompt) - controller_utils.maybe_translate_local_file_mounts_and_sync_up(task, - path='serve') + with rich_utils.safe_status( + ux_utils.spinner_message('Initializing service')): + controller_utils.maybe_translate_local_file_mounts_and_sync_up( + task, path='serve') code = serve_utils.ServeCodeGen.add_version(service_name) returncode, version_string_payload, stderr = backend.run_on_head( @@ -433,8 +435,8 @@ def update( print(f'{colorama.Fore.GREEN}Service {service_name!r} update scheduled.' f'{colorama.Style.RESET_ALL}\n' - f'Please use {backend_utils.BOLD}sky serve status {service_name} ' - f'{backend_utils.RESET_BOLD}to check the latest status.') + f'Please use {ux_utils.BOLD}sky serve status {service_name} ' + f'{ux_utils.RESET_BOLD}to check the latest status.') @usage_lib.entrypoint diff --git a/sky/sky_logging.py b/sky/sky_logging.py index c8a243c72cf..75dc836a49e 100644 --- a/sky/sky_logging.py +++ b/sky/sky_logging.py @@ -10,10 +10,10 @@ from sky.utils import env_options from sky.utils import rich_utils -# If the SKYPILOT_MINIMIZE_LOGGING environment variable is set to True, -# remove logging prefixes and unnecessary information in optimizer -_FORMAT = (None if env_options.Options.MINIMIZE_LOGGING.get() else - '%(levelname).1s %(asctime)s %(filename)s:%(lineno)d] %(message)s') +# UX: Should we show logging prefixes and some extra information in optimizer? +_show_logging_prefix = (env_options.Options.SHOW_DEBUG_INFO.get() or + not env_options.Options.MINIMIZE_LOGGING.get()) +_FORMAT = '%(levelname).1s %(asctime)s %(filename)s:%(lineno)d] %(message)s' _DATE_FORMAT = '%m-%d %H:%M:%S' @@ -45,6 +45,7 @@ def emit(self, record: logging.LogRecord) -> None: _default_handler = None _logging_config = threading.local() +NO_PREFIX_FORMATTER = NewLineFormatter(None, datefmt=_DATE_FORMAT) FORMATTER = NewLineFormatter(_FORMAT, datefmt=_DATE_FORMAT) DIM_FORMATTER = NewLineFormatter(_FORMAT, datefmt=_DATE_FORMAT, dim=True) @@ -67,7 +68,10 @@ def _setup_logger(): else: _default_handler.setLevel(logging.INFO) _root_logger.addHandler(_default_handler) - _default_handler.setFormatter(FORMATTER) + if _show_logging_prefix: + _default_handler.setFormatter(FORMATTER) + else: + _default_handler.setFormatter(NO_PREFIX_FORMATTER) # Setting this will avoid the message # being propagated to the parent logger. _root_logger.propagate = False diff --git a/sky/skylet/log_lib.py b/sky/skylet/log_lib.py index 9615e5af27f..9f1483b2b48 100644 --- a/sky/skylet/log_lib.py +++ b/sky/skylet/log_lib.py @@ -21,6 +21,7 @@ from sky.skylet import job_lib from sky.utils import log_utils from sky.utils import subprocess_utils +from sky.utils import ux_utils _SKY_LOG_WAITING_GAP_SECONDS = 1 _SKY_LOG_WAITING_MAX_RETRY = 5 @@ -377,7 +378,9 @@ def _follow_job_logs(file, wait_last_logs = False continue status_str = status.value if status is not None else 'None' - print(f'INFO: Job finished (status: {status_str}).') + print( + ux_utils.finishing_message( + f'Job finished (status: {status_str}).')) return time.sleep(_SKY_LOG_TAILING_GAP_SECONDS) @@ -412,8 +415,6 @@ def tail_logs(job_id: Optional[int], return logger.debug(f'Tailing logs for job, real job_id {job_id}, managed_job_id ' f'{managed_job_id}.') - logger.info(f'{colorama.Fore.YELLOW}Start streaming logs for {job_str}.' - f'{colorama.Style.RESET_ALL}') log_path = os.path.join(log_dir, 'run.log') log_path = os.path.expanduser(log_path) @@ -437,7 +438,7 @@ def tail_logs(job_id: Optional[int], time.sleep(_SKY_LOG_WAITING_GAP_SECONDS) status = job_lib.update_job_status([job_id], silent=True)[0] - start_stream_at = 'INFO: Tip: use Ctrl-C to exit log' + start_stream_at = 'Waiting for task resources on ' if follow and status in [ job_lib.JobStatus.SETTING_UP, job_lib.JobStatus.PENDING, diff --git a/sky/skylet/providers/lambda_cloud/node_provider.py b/sky/skylet/providers/lambda_cloud/node_provider.py index bb8d40da62e..557afe75568 100644 --- a/sky/skylet/providers/lambda_cloud/node_provider.py +++ b/sky/skylet/providers/lambda_cloud/node_provider.py @@ -25,7 +25,7 @@ _REMOTE_SSH_KEY_NAME = '~/.lambda_cloud/ssh_key_name' _REMOTE_RAY_SSH_KEY = '~/ray_bootstrap_key.pem' _REMOTE_RAY_YAML = '~/ray_bootstrap_config.yaml' -_GET_INTERNAL_IP_CMD = 'ip -4 -br addr show | grep UP | grep -Eo "(10\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|172\.(1[6-9]|2[0-9]|3[0-1]))\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"' +_GET_INTERNAL_IP_CMD = 's=$(ip -4 -br addr show | grep UP); echo "$s"; echo "$s" | grep -Eo "(10\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|172\.(1[6-9]|2[0-9]|3[0-1])|104\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"' logger = logging.getLogger(__name__) diff --git a/sky/utils/command_runner.py b/sky/utils/command_runner.py index c94970ce764..be6e8346e3d 100644 --- a/sky/utils/command_runner.py +++ b/sky/utils/command_runner.py @@ -171,7 +171,7 @@ def _get_command_to_run( cmd: Union[str, List[str]], process_stream: bool, separate_stderr: bool, - skip_lines: int, + skip_num_lines: int, source_bashrc: bool = False, ) -> str: """Returns the command to run.""" @@ -203,12 +203,12 @@ def _get_command_to_run( ] if not separate_stderr: command.append('2>&1') - if not process_stream and skip_lines: + if not process_stream and skip_num_lines: command += [ # A hack to remove the following bash warnings (twice): # bash: cannot set terminal process group # bash: no job control in this shell - f'| stdbuf -o0 tail -n +{skip_lines}', + f'| stdbuf -o0 tail -n +{skip_num_lines}', # This is required to make sure the executor of command can get # correct returncode, since linux pipe is used. '; exit ${PIPESTATUS[0]}' @@ -320,7 +320,7 @@ def run( separate_stderr: bool = False, connect_timeout: Optional[int] = None, source_bashrc: bool = False, - skip_lines: int = 0, + skip_num_lines: int = 0, **kwargs) -> Union[int, Tuple[int, str, str]]: """Runs the command on the cluster. @@ -335,7 +335,7 @@ def run( connect_timeout: timeout in seconds for the ssh connection. source_bashrc: Whether to source the ~/.bashrc before running the command. - skip_lines: The number of lines to skip at the beginning of the + skip_num_lines: The number of lines to skip at the beginning of the output. This is used when the output is not processed by SkyPilot but we still want to get rid of some warning messages, such as SSH warnings. @@ -529,7 +529,7 @@ def run( separate_stderr: bool = False, connect_timeout: Optional[int] = None, source_bashrc: bool = False, - skip_lines: int = 0, + skip_num_lines: int = 0, **kwargs) -> Union[int, Tuple[int, str, str]]: """Uses 'ssh' to run 'cmd' on a node with ip. @@ -550,7 +550,7 @@ def run( connect_timeout: timeout in seconds for the ssh connection. source_bashrc: Whether to source the bashrc before running the command. - skip_lines: The number of lines to skip at the beginning of the + skip_num_lines: The number of lines to skip at the beginning of the output. This is used when the output is not processed by SkyPilot but we still want to get rid of some warning messages, such as SSH warnings. @@ -573,7 +573,7 @@ def run( command_str = self._get_command_to_run(cmd, process_stream, separate_stderr, - skip_lines=skip_lines, + skip_num_lines=skip_num_lines, source_bashrc=source_bashrc) command = base_ssh_command + [shlex.quote(command_str)] @@ -693,7 +693,7 @@ def run( separate_stderr: bool = False, connect_timeout: Optional[int] = None, source_bashrc: bool = False, - skip_lines: int = 0, + skip_num_lines: int = 0, **kwargs) -> Union[int, Tuple[int, str, str]]: """Uses 'kubectl exec' to run 'cmd' on a pod by its name and namespace. @@ -713,7 +713,7 @@ def run( connect_timeout: timeout in seconds for the pod connection. source_bashrc: Whether to source the bashrc before running the command. - skip_lines: The number of lines to skip at the beginning of the + skip_num_lines: The number of lines to skip at the beginning of the output. This is used when the output is not processed by SkyPilot but we still want to get rid of some warning messages, such as SSH warnings. @@ -751,7 +751,7 @@ def run( command_str = self._get_command_to_run(cmd, process_stream, separate_stderr, - skip_lines=skip_lines, + skip_num_lines=skip_num_lines, source_bashrc=source_bashrc) command = kubectl_base_command + [ # It is important to use /bin/bash -c here to make sure we quote the diff --git a/sky/utils/common_utils.py b/sky/utils/common_utils.py index 4a8e6aa37d6..6383ee8af0d 100644 --- a/sky/utils/common_utils.py +++ b/sky/utils/common_utils.py @@ -16,7 +16,6 @@ from typing import Any, Callable, Dict, List, Optional, Union import uuid -import colorama import jinja2 import jsonschema import yaml @@ -479,11 +478,9 @@ def format_exception(e: Union[Exception, SystemExit, KeyboardInterrupt], Returns: A string that represents the exception. """ - bright = colorama.Style.BRIGHT - reset = colorama.Style.RESET_ALL if use_bracket: - return f'{bright}[{class_fullname(e.__class__)}]{reset} {e}' - return f'{bright}{class_fullname(e.__class__)}:{reset} {e}' + return f'[{class_fullname(e.__class__)}] {e}' + return f'{class_fullname(e.__class__)}: {e}' def remove_color(s: str): diff --git a/sky/utils/controller_utils.py b/sky/utils/controller_utils.py index 39045962a78..0c71357c856 100644 --- a/sky/utils/controller_utils.py +++ b/sky/utils/controller_utils.py @@ -28,6 +28,7 @@ from sky.skylet import constants from sky.utils import common_utils from sky.utils import env_options +from sky.utils import rich_utils from sky.utils import ux_utils if typing.TYPE_CHECKING: @@ -192,7 +193,11 @@ def _get_cloud_dependencies_installation_commands( # TODO(tian): Make dependency installation command a method of cloud # class and get all installation command for enabled clouds. commands = [] - prefix_str = 'Check & install cloud dependencies on controller: ' + # We use / instead of strong formatting, as we need to update + # the at the end of the for loop, and python does not support + # partial string formatting. + prefix_str = ('[/] Check & install cloud dependencies ' + 'on controller: ') # This is to make sure the shorter checking message does not have junk # characters from the previous message. empty_str = ' ' * 10 @@ -203,6 +208,7 @@ def _get_cloud_dependencies_installation_commands( # other clouds will install boto3 but not awscli. 'pip list | grep awscli> /dev/null 2>&1 || pip install "urllib3<2" ' 'awscli>=1.27.10 "colorama<0.4.5" > /dev/null 2>&1') + setup_clouds: List[str] = [] for cloud in sky_check.get_cached_enabled_clouds_or_refresh(): if isinstance( clouds, @@ -211,11 +217,16 @@ def _get_cloud_dependencies_installation_commands( # fluidstack and paperspace continue if isinstance(cloud, clouds.AWS): - commands.append(f'echo -en "\\r{prefix_str}AWS{empty_str}" && ' + + step_prefix = prefix_str.replace('', + str(len(setup_clouds) + 1)) + commands.append(f'echo -en "\\r{step_prefix}AWS{empty_str}" && ' + aws_dependencies_installation) + setup_clouds.append(str(cloud)) elif isinstance(cloud, clouds.Azure): + step_prefix = prefix_str.replace('', + str(len(setup_clouds) + 1)) commands.append( - f'echo -en "\\r{prefix_str}Azure{empty_str}" && ' + f'echo -en "\\r{step_prefix}Azure{empty_str}" && ' 'pip list | grep azure-cli > /dev/null 2>&1 || ' 'pip install "azure-cli>=2.31.0" azure-core ' '"azure-identity>=1.13.0" azure-mgmt-network > /dev/null 2>&1') @@ -225,9 +236,12 @@ def _get_cloud_dependencies_installation_commands( commands.append( 'pip list | grep azure-storage-blob > /dev/null 2>&1 || ' 'pip install azure-storage-blob msgraph-sdk > /dev/null 2>&1') + setup_clouds.append(str(cloud)) elif isinstance(cloud, clouds.GCP): + step_prefix = prefix_str.replace('', + str(len(setup_clouds) + 1)) commands.append( - f'echo -en "\\r{prefix_str}GCP{empty_str}" && ' + f'echo -en "\\r{step_prefix}GCP{empty_str}" && ' 'pip list | grep google-api-python-client > /dev/null 2>&1 || ' 'pip install "google-api-python-client>=2.69.0" ' '> /dev/null 2>&1') @@ -238,9 +252,12 @@ def _get_cloud_dependencies_installation_commands( 'pip list | grep google-cloud-storage > /dev/null 2>&1 || ' 'pip install google-cloud-storage > /dev/null 2>&1') commands.append(f'{gcp.GOOGLE_SDK_INSTALLATION_COMMAND}') + setup_clouds.append(str(cloud)) elif isinstance(cloud, clouds.Kubernetes): + step_prefix = prefix_str.replace('', + str(len(setup_clouds) + 1)) commands.append( - f'echo -en "\\r{prefix_str}Kubernetes{empty_str}" && ' + f'echo -en "\\r{step_prefix}Kubernetes{empty_str}" && ' 'pip list | grep kubernetes > /dev/null 2>&1 || ' 'pip install "kubernetes>=20.0.0" > /dev/null 2>&1 &&' # Install k8s + skypilot dependencies @@ -248,8 +265,8 @@ def _get_cloud_dependencies_installation_commands( '! command -v curl &> /dev/null || ' '! command -v socat &> /dev/null || ' '! command -v netcat &> /dev/null; ' - 'then apt update && apt install curl socat netcat -y ' - '&> /dev/null; ' + 'then apt update &> /dev/null && ' + 'apt install curl socat netcat -y &> /dev/null; ' 'fi" && ' # Install kubectl '(command -v kubectl &>/dev/null || ' @@ -258,34 +275,55 @@ def _get_cloud_dependencies_installation_commands( '/bin/linux/amd64/kubectl" && ' 'sudo install -o root -g root -m 0755 ' 'kubectl /usr/local/bin/kubectl))') + setup_clouds.append(str(cloud)) elif isinstance(cloud, clouds.Cudo): + step_prefix = prefix_str.replace('', + str(len(setup_clouds) + 1)) commands.append( - f'echo -en "\\r{prefix_str}Cudo{empty_str}" && ' + f'echo -en "\\r{step_prefix}Cudo{empty_str}" && ' 'pip list | grep cudo-compute > /dev/null 2>&1 || ' 'pip install "cudo-compute>=0.1.10" > /dev/null 2>&1 && ' 'wget https://download.cudo.org/compute/cudoctl-0.3.2-amd64.deb -O ~/cudoctl.deb > /dev/null 2>&1 && ' # pylint: disable=line-too-long 'sudo dpkg -i ~/cudoctl.deb > /dev/null 2>&1') + setup_clouds.append(str(cloud)) elif isinstance(cloud, clouds.RunPod): - commands.append(f'echo -en "\\r{prefix_str}RunPod{empty_str}" && ' + step_prefix = prefix_str.replace('', + str(len(setup_clouds) + 1)) + commands.append(f'echo -en "\\r{step_prefix}RunPod{empty_str}" && ' 'pip list | grep runpod > /dev/null 2>&1 || ' 'pip install "runpod>=1.5.1" > /dev/null 2>&1') + setup_clouds.append(str(cloud)) if controller == Controllers.JOBS_CONTROLLER: if isinstance(cloud, clouds.IBM): + step_prefix = prefix_str.replace('', + str(len(setup_clouds) + 1)) commands.append( - f'echo -en "\\r{prefix_str}IBM{empty_str}" ' + f'echo -en "\\r{step_prefix}IBM{empty_str}" ' '&& pip list | grep ibm-cloud-sdk-core > /dev/null 2>&1 || ' 'pip install ibm-cloud-sdk-core ibm-vpc ' 'ibm-platform-services ibm-cos-sdk > /dev/null 2>&1') + setup_clouds.append(str(cloud)) elif isinstance(cloud, clouds.OCI): + step_prefix = prefix_str.replace('', + str(len(setup_clouds) + 1)) commands.append(f'echo -en "\\r{prefix_str}OCI{empty_str}" && ' 'pip list | grep oci > /dev/null 2>&1 || ' 'pip install oci > /dev/null 2>&1') + setup_clouds.append(str(cloud)) if (cloudflare.NAME in storage_lib.get_cached_enabled_storage_clouds_or_refresh()): - commands.append(f'echo -en "\\r{prefix_str}Cloudflare{empty_str}" && ' + - aws_dependencies_installation) - commands.append(f'echo -e "\\r{prefix_str}Done for {len(commands)} ' - 'clouds."') + step_prefix = prefix_str.replace('', str(len(setup_clouds) + 1)) + commands.append( + f'echo -en "\\r{step_prefix}Cloudflare{empty_str}" && ' + + aws_dependencies_installation) + setup_clouds.append(cloudflare.NAME) + + finish_prefix = prefix_str.replace('[/] ', ' ') + commands.append(f'echo -e "\\r{finish_prefix}done.{empty_str}"') + commands = [ + command.replace('', str(len(setup_clouds))) + for command in commands + ] return commands @@ -388,7 +426,7 @@ def shared_controller_vars_to_fill( 'local_user_config_path': local_user_config_path, } env_vars: Dict[str, str] = { - env.value: '1' for env in env_options.Options if env.get() + env.env_key: str(int(env.get())) for env in env_options.Options } env_vars.update({ # Should not use $USER here, as that env var can be empty when @@ -396,7 +434,9 @@ def shared_controller_vars_to_fill( constants.USER_ENV_VAR: getpass.getuser(), constants.USER_ID_ENV_VAR: common_utils.get_user_hash(), # Skip cloud identity check to avoid the overhead. - env_options.Options.SKIP_CLOUD_IDENTITY_CHECK.value: '1', + env_options.Options.SKIP_CLOUD_IDENTITY_CHECK.env_key: '1', + # Disable minimize logging to get more details on the controller. + env_options.Options.MINIMIZE_LOGGING.env_key: '0', }) if skypilot_config.loaded(): # Only set the SKYPILOT_CONFIG env var if the user has a config file. @@ -599,6 +639,7 @@ def maybe_translate_local_file_mounts_and_sync_up(task: 'task_lib.Task', # ================================================================ # Translate the workdir and local file mounts to cloud file mounts. # ================================================================ + run_id = common_utils.get_usage_run_id()[:8] original_file_mounts = task.file_mounts if task.file_mounts else {} original_storage_mounts = task.storage_mounts if task.storage_mounts else {} @@ -618,8 +659,12 @@ def maybe_translate_local_file_mounts_and_sync_up(task: 'task_lib.Task', elif has_local_source_paths_workdir: msg = 'workdir' if msg: - logger.info(f'{colorama.Fore.YELLOW}Translating {msg} to SkyPilot ' - f'Storage...{colorama.Style.RESET_ALL}') + logger.info( + ux_utils.starting_message(f'Translating {msg} to ' + 'SkyPilot Storage...')) + rich_utils.force_update_status( + ux_utils.spinner_message( + f'Translating {msg} to SkyPilot Storage...')) # Step 1: Translate the workdir to SkyPilot storage. new_storage_mounts = {} @@ -643,8 +688,8 @@ def maybe_translate_local_file_mounts_and_sync_up(task: 'task_lib.Task', }) # Check of the existence of the workdir in file_mounts is done in # the task construction. - logger.info(f'Workdir {workdir!r} will be synced to cloud storage ' - f'{bucket_name!r}.') + logger.info(f' {colorama.Style.DIM}Workdir: {workdir!r} ' + f'-> storage: {bucket_name!r}.{colorama.Style.RESET_ALL}') # Step 2: Translate the local file mounts with folder in src to SkyPilot # storage. @@ -668,9 +713,8 @@ def maybe_translate_local_file_mounts_and_sync_up(task: 'task_lib.Task', 'persistent': False, 'mode': 'COPY', }) - logger.info( - f'Folder in local file mount {src!r} will be synced to SkyPilot ' - f'storage {bucket_name}.') + logger.info(f' {colorama.Style.DIM}Folder : {src!r} ' + f'-> storage: {bucket_name!r}.{colorama.Style.RESET_ALL}') # Step 3: Translate local file mounts with file in src to SkyPilot storage. # Hard link the files in src to a temporary directory, and upload folder. @@ -703,10 +747,12 @@ def maybe_translate_local_file_mounts_and_sync_up(task: 'task_lib.Task', f'destination {file_mount_remote_tmp_dir} ' 'being taken.') sources = list(src_to_file_id.keys()) - sources_str = '\n\t'.join(sources) - logger.info('Source files in file_mounts will be synced to ' - f'cloud storage {file_bucket_name}:' - f'\n\t{sources_str}') + sources_str = '\n '.join(sources) + logger.info(f' {colorama.Style.DIM}Files (listed below) ' + f' -> storage: {file_bucket_name}:' + f'\n {sources_str}{colorama.Style.RESET_ALL}') + rich_utils.force_update_status( + ux_utils.spinner_message('Uploading translated local files/folders')) task.update_storage_mounts(new_storage_mounts) # Step 4: Upload storage from sources @@ -716,8 +762,9 @@ def maybe_translate_local_file_mounts_and_sync_up(task: 'task_lib.Task', if task.storage_mounts: # There may be existing (non-translated) storage mounts, so log this # whenever task.storage_mounts is non-empty. - logger.info(f'{colorama.Fore.YELLOW}Uploading sources to cloud storage.' - f'{colorama.Style.RESET_ALL} See: sky storage ls') + rich_utils.force_update_status( + ux_utils.spinner_message('Uploading local sources to storage[/] ' + '[dim]View storages: sky storage ls')) try: task.sync_storage_mounts() except ValueError as e: @@ -800,3 +847,5 @@ def maybe_translate_local_file_mounts_and_sync_up(task: 'task_lib.Task', }) updated_mount_storages[storage_path] = new_storage task.update_storage_mounts(updated_mount_storages) + if msg: + logger.info(ux_utils.finishing_message('Uploaded local files/folders.')) diff --git a/sky/utils/env_options.py b/sky/utils/env_options.py index 166bf42ce80..ebec8eeb90d 100644 --- a/sky/utils/env_options.py +++ b/sky/utils/env_options.py @@ -5,17 +5,32 @@ class Options(enum.Enum): """Environment variables for SkyPilot.""" - IS_DEVELOPER = 'SKYPILOT_DEV' - SHOW_DEBUG_INFO = 'SKYPILOT_DEBUG' - DISABLE_LOGGING = 'SKYPILOT_DISABLE_USAGE_COLLECTION' - MINIMIZE_LOGGING = 'SKYPILOT_MINIMIZE_LOGGING' + + # (env var name, default value) + IS_DEVELOPER = ('SKYPILOT_DEV', False) + SHOW_DEBUG_INFO = ('SKYPILOT_DEBUG', False) + DISABLE_LOGGING = ('SKYPILOT_DISABLE_USAGE_COLLECTION', False) + MINIMIZE_LOGGING = ('SKYPILOT_MINIMIZE_LOGGING', True) # Internal: this is used to skip the cloud user identity check, which is # used to protect cluster operations in a multi-identity scenario. # Currently, this is only used in the job and serve controller, as there # will not be multiple identities, and skipping the check can increase # robustness. - SKIP_CLOUD_IDENTITY_CHECK = 'SKYPILOT_SKIP_CLOUD_IDENTITY_CHECK' + SKIP_CLOUD_IDENTITY_CHECK = ('SKYPILOT_SKIP_CLOUD_IDENTITY_CHECK', False) + + def __init__(self, env_var: str, default: bool) -> None: + self.env_var = env_var + self.default = default - def get(self): + def __repr__(self) -> str: + return self.env_var + + def get(self) -> bool: """Check if an environment variable is set to True.""" - return os.getenv(self.value, 'False').lower() in ('true', '1') + return os.getenv(self.env_var, + str(self.default)).lower() in ('true', '1') + + @property + def env_key(self) -> str: + """The environment variable key name.""" + return self.value[0] diff --git a/sky/utils/log_utils.py b/sky/utils/log_utils.py index 8f7a152392e..e116f36819e 100644 --- a/sky/utils/log_utils.py +++ b/sky/utils/log_utils.py @@ -9,6 +9,7 @@ from sky import sky_logging from sky.utils import rich_utils +from sky.utils import ux_utils logger = sky_logging.init_logger(__name__) @@ -37,30 +38,34 @@ class ProvisionStatus(enum.Enum): RUNTIME_SETUP = 1 PULLING_DOCKER_IMAGES = 2 + def __init__(self, log_path: str): + self.log_path = log_path + def __enter__(self) -> None: self.state = self.ProvisionStatus.LAUNCH - self.status_display = rich_utils.safe_status('[bold cyan]Launching') + self.status_display = rich_utils.safe_status( + ux_utils.spinner_message('Launching', self.log_path)) self.status_display.start() def process_line(self, log_line: str) -> None: if ('Success.' in log_line and self.state == self.ProvisionStatus.LAUNCH): - logger.info(f'{colorama.Fore.GREEN}Head node is up.' - f'{colorama.Style.RESET_ALL}') + logger.info(' Head VM is up.') self.status_display.update( - '[bold cyan]Launching - Preparing SkyPilot runtime') + ux_utils.spinner_message( + 'Launching - Preparing SkyPilot runtime', self.log_path)) self.state = self.ProvisionStatus.RUNTIME_SETUP if ('Pulling from' in log_line and self.state == self.ProvisionStatus.RUNTIME_SETUP): self.status_display.update( - '[bold cyan]Launching - Pulling docker images') + ux_utils.spinner_message( + 'Launching - Initializing docker container', self.log_path)) self.state = self.ProvisionStatus.PULLING_DOCKER_IMAGES if ('Status: Downloaded newer image' in log_line and self.state == self.ProvisionStatus.PULLING_DOCKER_IMAGES): - logger.info(f'{colorama.Fore.GREEN}Docker image is downloaded.' - f'{colorama.Style.RESET_ALL}') self.status_display.update( - '[bold cyan]Launching - Preparing SkyPilot runtime') + ux_utils.spinner_message( + 'Launching - Preparing SkyPilot runtime', self.log_path)) self.state = self.ProvisionStatus.RUNTIME_SETUP def __exit__(self, except_type: Optional[Type[BaseException]], @@ -73,9 +78,10 @@ def __exit__(self, except_type: Optional[Type[BaseException]], class SkyLocalUpLineProcessor(LineProcessor): """A processor for `sky local up` log lines.""" - def __enter__(self) -> None: - status = rich_utils.safe_status('[bold cyan]Creating local cluster - ' - 'initializing Kubernetes') + def __enter__(self): + status = rich_utils.safe_status( + ux_utils.spinner_message('Creating local cluster - ' + 'initializing Kubernetes')) self.status_display = status self.status_display.start() @@ -84,31 +90,37 @@ def process_line(self, log_line: str) -> None: logger.info(f'{colorama.Fore.GREEN}Kubernetes is running.' f'{colorama.Style.RESET_ALL}') if 'Installing NVIDIA GPU operator...' in log_line: - self.status_display.update('[bold cyan]Creating local cluster - ' - 'Installing NVIDIA GPU operator') + self.status_display.update( + ux_utils.spinner_message('Creating local cluster - ' + 'Installing NVIDIA GPU operator')) if 'Starting wait for GPU operator installation...' in log_line: self.status_display.update( - '[bold cyan]Creating local cluster - ' - 'waiting for NVIDIA GPU operator installation to complete') + ux_utils.spinner_message( + 'Creating local cluster - ' + 'waiting for NVIDIA GPU operator installation to complete')) logger.info('To check NVIDIA GPU operator status, ' 'see pods: kubectl get pods -n gpu-operator') if 'GPU operator installed' in log_line: logger.info(f'{colorama.Fore.GREEN}NVIDIA GPU Operator installed.' f'{colorama.Style.RESET_ALL}') if 'Pulling SkyPilot GPU image...' in log_line: - self.status_display.update('[bold cyan]Creating local cluster - ' - 'pulling and loading SkyPilot GPU image') + self.status_display.update( + ux_utils.spinner_message( + 'Creating local cluster - ' + 'pulling and loading SkyPilot GPU image')) if 'SkyPilot GPU image loaded into kind cluster' in log_line: logger.info(f'{colorama.Fore.GREEN}SkyPilot GPU image pulled.' f'{colorama.Style.RESET_ALL}') if 'Labelling nodes with GPUs...' in log_line: - self.status_display.update('[bold cyan]Creating local cluster - ' - 'launching GPU labelling jobs') + self.status_display.update( + ux_utils.spinner_message('Creating local cluster - ' + 'launching GPU labelling jobs')) if ('Starting wait for SkyPilot GPU labeling jobs to complete' in log_line): self.status_display.update( - '[bold cyan]Creating local cluster - ' - 'waiting for GPU labelling jobs to complete') + ux_utils.spinner_message( + 'Creating local cluster - ' + 'waiting for GPU labelling jobs to complete')) logger.info( 'To check GPU labelling status, see jobs: ' 'kubectl get jobs -n kube-system -l job=sky-gpu-labeler') @@ -116,14 +128,17 @@ def process_line(self, log_line: str) -> None: logger.info(f'{colorama.Fore.GREEN}GPU labelling complete.' f'{colorama.Style.RESET_ALL}') if 'Pulling SkyPilot CPU image...' in log_line: - self.status_display.update('[bold cyan]Creating local cluster - ' - 'pulling and loading SkyPilot CPU image') + self.status_display.update( + ux_utils.spinner_message( + 'Creating local cluster - ' + 'pulling and loading SkyPilot CPU image')) if 'SkyPilot CPU image loaded into kind cluster' in log_line: logger.info(f'{colorama.Fore.GREEN}SkyPilot CPU image pulled.' f'{colorama.Style.RESET_ALL}') if 'Starting installation of Nginx Ingress Controller...' in log_line: self.status_display.update( - '[bold cyan]Creating Nginx Ingress Controller') + ux_utils.spinner_message('Creating local cluster - ' + 'creating Nginx Ingress Controller')) if 'Nginx Ingress Controller installed' in log_line: logger.info( f'{colorama.Fore.GREEN}Nginx Ingress Controller installed.' diff --git a/sky/utils/resources_utils.py b/sky/utils/resources_utils.py index 6f5c07f7d25..72aa5ac05d3 100644 --- a/sky/utils/resources_utils.py +++ b/sky/utils/resources_utils.py @@ -6,6 +6,8 @@ import typing from typing import List, Optional, Set +from sky import skypilot_config +from sky.clouds import cloud_registry from sky.utils import ux_utils if typing.TYPE_CHECKING: @@ -177,3 +179,24 @@ class FeasibleResources: resources_list: List['resources_lib.Resources'] fuzzy_candidate_list: List[str] hint: Optional[str] + + +def need_to_query_reservations() -> bool: + """Checks if we need to query reservations from cloud APIs. + + We need to query reservations if: + - The cloud has specific reservations. + - The cloud prioritizes reservations over on-demand instances. + + This is useful to skip the potentially expensive reservation query for + clouds that do not use reservations. + """ + for cloud_str in cloud_registry.CLOUD_REGISTRY.keys(): + cloud_specific_reservations = skypilot_config.get_nested( + (cloud_str, 'specific_reservations'), None) + cloud_prioritize_reservations = skypilot_config.get_nested( + (cloud_str, 'prioritize_reservations'), False) + if (cloud_specific_reservations is not None or + cloud_prioritize_reservations): + return True + return False diff --git a/sky/utils/rich_utils.py b/sky/utils/rich_utils.py index 4b3dd07257e..6badf621294 100644 --- a/sky/utils/rich_utils.py +++ b/sky/utils/rich_utils.py @@ -5,8 +5,9 @@ import rich.console as rich_console -console = rich_console.Console() +console = rich_console.Console(soft_wrap=True) _status = None +_status_nesting_level = 0 _logging_lock = threading.RLock() @@ -30,19 +31,68 @@ def start(self): pass +class _RevertibleStatus: + """A wrapper for status that can revert to previous message after exit.""" + + def __init__(self, message: str): + if _status is not None: + self.previous_message = _status.status + else: + self.previous_message = None + self.message = message + + def __enter__(self): + global _status_nesting_level + _status.update(self.message) + _status_nesting_level += 1 + _status.__enter__() + return _status + + def __exit__(self, exc_type, exc_val, exc_tb): + global _status_nesting_level, _status + _status_nesting_level -= 1 + if _status_nesting_level <= 0: + _status_nesting_level = 0 + if _status is not None: + _status.__exit__(exc_type, exc_val, exc_tb) + _status = None + else: + _status.update(self.previous_message) + + def update(self, *args, **kwargs): + _status.update(*args, **kwargs) + + def stop(self): + _status.stop() + + def start(self): + _status.start() + + def safe_status(msg: str) -> Union['rich_console.Status', _NoOpConsoleStatus]: """A wrapper for multi-threaded console.status.""" from sky import sky_logging # pylint: disable=import-outside-toplevel + global _status if (threading.current_thread() is threading.main_thread() and not sky_logging.is_silent()): - global _status if _status is None: - _status = console.status(msg) - _status.update(msg) - return _status + _status = console.status(msg, refresh_per_second=8) + return _RevertibleStatus(msg) return _NoOpConsoleStatus() +def stop_safe_status(): + """Stops all nested statuses. + + This is useful when we need to stop all statuses, e.g., when we are going to + stream logs from user program and do not want it to interfere with the + spinner display. + """ + if (threading.current_thread() is threading.main_thread() and + _status is not None): + _status.stop() + + def force_update_status(msg: str): """Update the status message even if sky_logging.is_silent() is true.""" if (threading.current_thread() is threading.main_thread() and diff --git a/sky/utils/ux_utils.py b/sky/utils/ux_utils.py index 6f5c551dc13..f6699f355f8 100644 --- a/sky/utils/ux_utils.py +++ b/sky/utils/ux_utils.py @@ -1,9 +1,12 @@ """Utility functions for UX.""" import contextlib +import os import sys import traceback -from typing import Callable +import typing +from typing import Callable, Optional, Union +import colorama import rich.console as rich_console from sky import sky_logging @@ -11,11 +14,25 @@ from sky.utils import env_options from sky.utils import ux_utils +if typing.TYPE_CHECKING: + import pathlib + console = rich_console.Console() +INDENT_SYMBOL = f'{colorama.Style.DIM}β”œβ”€β”€ {colorama.Style.RESET_ALL}' +INDENT_LAST_SYMBOL = f'{colorama.Style.DIM}└── {colorama.Style.RESET_ALL}' + +# Console formatting constants +BOLD = '\033[1m' +RESET_BOLD = '\033[0m' + +# Log path hint in the spinner during launching +_LOG_PATH_HINT = (f'{colorama.Style.DIM}View logs at: {{log_path}}' + f'{colorama.Style.RESET_ALL}') + def console_newline(): - """Print a newline to the console using rich. + """Prints a newline to the console using rich. Useful when catching exceptions inside console.status() """ @@ -50,7 +67,7 @@ def print_exception_no_traceback(): @contextlib.contextmanager def enable_traceback(): - """Revert the effect of print_exception_no_traceback(). + """Reverts the effect of print_exception_no_traceback(). This is used for usage_lib to collect the full traceback. """ @@ -61,7 +78,7 @@ def enable_traceback(): class RedirectOutputForProcess: - """Redirect stdout and stderr to a file. + """Redirects stdout and stderr to a file. This class enabled output redirect for multiprocessing.Process. Example usage: @@ -102,3 +119,45 @@ def run(self, *args, **kwargs): with ux_utils.enable_traceback(): logger.error(f' Traceback:\n{traceback.format_exc()}') raise + + +def starting_message(message: str) -> str: + """Gets the starting message for the given message.""" + return f'βš™οΈŽ {message}' + + +def log_path_hint(log_path: Union[str, 'pathlib.Path']) -> str: + """Gets the log path hint for the given log path.""" + log_path = str(log_path) + expanded_home = os.path.expanduser('~') + if log_path.startswith(expanded_home): + log_path = '~' + log_path[len(expanded_home):] + return _LOG_PATH_HINT.format(log_path=log_path) + + +def finishing_message( + message: str, + log_path: Optional[Union[str, 'pathlib.Path']] = None) -> str: + """Gets the finishing message for the given message.""" + success_prefix = (f'{colorama.Fore.GREEN}βœ“ {message}' + f'{colorama.Style.RESET_ALL}') + if log_path is None: + return success_prefix + path_hint = log_path_hint(log_path) + return f'{success_prefix} {path_hint}' + + +def retry_message(message: str) -> str: + """Gets the retry message for the given message.""" + return f'{colorama.Fore.YELLOW}β†Ί{colorama.Style.RESET_ALL} {message}' + + +def spinner_message( + message: str, + log_path: Optional[Union[str, 'pathlib.Path']] = None) -> str: + """Gets the spinner message for the given message and log path.""" + colored_spinner = f'[bold cyan]{message}[/]' + if log_path is None: + return colored_spinner + path_hint = log_path_hint(log_path) + return f'{colored_spinner} {path_hint}' diff --git a/tests/skyserve/http/aws.yaml b/tests/skyserve/http/aws.yaml index cd7217b3d61..c33d3624ef7 100644 --- a/tests/skyserve/http/aws.yaml +++ b/tests/skyserve/http/aws.yaml @@ -5,10 +5,10 @@ service: replicas: 2 resources: - ports: 8081 + ports: 8080 cloud: aws cpus: 2+ workdir: examples/serve/http_server -run: python3 server.py +run: python3 server.py --port 8080 diff --git a/tests/test_smoke.py b/tests/test_smoke.py index 4d81015f9cd..22084e9c368 100644 --- a/tests/test_smoke.py +++ b/tests/test_smoke.py @@ -282,34 +282,49 @@ def test_example_app(): _VALIDATE_LAUNCH_OUTPUT = ( # Validate the output of the job submission: - # I 05-23 07:52:47 cloud_vm_ray_backend.py:3217] Running setup on 1 node. + # βš™οΈ Launching on Kubernetes. + # Pod is up. + # βœ“ Cluster launched: test. View logs at: ~/sky_logs/sky-2024-10-07-19-44-18-177288/provision.log + # βš™οΈ Running setup on 1 pod. # running setup - # I 05-23 07:52:49 cloud_vm_ray_backend.py:3230] Setup completed. - # I 05-23 07:52:55 cloud_vm_ray_backend.py:3319] Job submitted with Job ID: 1 - # I 05-23 07:52:58 log_lib.py:408] Start streaming logs for job 1. - # INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed). - # INFO: Waiting for task resources on 1 node. This will block if the cluster is full. - # INFO: All task resources reserved. - # INFO: Reserved IPs: ['10.128.0.127'] - # (min, pid=4164) # conda environments: - # (min, pid=4164) # - # (min, pid=4164) base * /opt/conda - # (min, pid=4164) - # (min, pid=4164) task run finish - # INFO: Job finished (status: SUCCEEDED). + # βœ“ Setup completed. + # βš™οΈ Job submitted, ID: 1. + # β”œβ”€β”€ Waiting for task resources on 1 node. + # └── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed) + # (min, pid=1277) # conda environments: + # (min, pid=1277) # + # (min, pid=1277) base * /opt/conda + # (min, pid=1277) + # (min, pid=1277) task run finish + # βœ“ Job finished (status: SUCCEEDED). + + # πŸ“‹ Useful Commands + # Job ID: 1 + # β”œβ”€β”€ To cancel the job: sky cancel test 1 + # β”œβ”€β”€ To stream job logs: sky logs test 1 + # └── To view job queue: sky queue test + + # Cluster name: test + # β”œβ”€β”€ To log into the head VM: ssh test + # β”œβ”€β”€ To submit a job: sky exec test yaml_file + # β”œβ”€β”€ To stop the cluster: sky stop test + # └── To teardown the cluster: sky down test + 'echo "$s" && echo "==Validating launching==" && ' + 'echo "$s" | grep -A 1 "Launching on" | grep "is up." && ' 'echo "$s" && echo "==Validating setup output==" && ' 'echo "$s" | grep -A 1 "Running setup on" | grep "running setup" && ' 'echo "==Validating running output hints==" && echo "$s" | ' - 'grep -A 1 "Job submitted with Job ID:" | ' - 'grep "Start streaming logs for job" && ' + 'grep -A 1 "Job submitted, ID:" | ' + 'grep "Waiting for task resources on " && ' 'echo "==Validating task output starting==" && echo "$s" | ' - 'grep -A 1 "INFO: Reserved IPs" | grep "(min, pid=" && ' + 'grep -A 1 "Job started. Streaming logs..." | grep "(min, pid=" && ' 'echo "==Validating task output ending==" && ' 'echo "$s" | grep -A 1 "task run finish" | ' - 'grep "INFO: Job finished (status: SUCCEEDED)" && ' + 'grep "Job finished (status: SUCCEEDED)" && ' 'echo "==Validating task output ending 2==" && ' - 'echo "$s" | grep -A 1 "INFO: Job finished (status: SUCCEEDED)" | ' - 'grep "Job ID:"') + 'echo "$s" | grep -A 5 "Job finished (status: SUCCEEDED)" | ' + 'grep "Useful Commands" && ' + 'echo "$s" | grep -A 1 "Useful Commands" | grep "Job ID:"') # ---------- A minimal task ---------- @@ -2647,7 +2662,7 @@ def test_managed_jobs(generic_cloud: str): f'{_JOB_QUEUE_WAIT}| grep {name}-1 | head -n1 | grep CANCELLED', # Test the functionality for logging. f's=$(sky jobs logs -n {name}-2 --no-follow); echo "$s"; echo "$s" | grep "start counting"', - f's=$(sky jobs logs --controller -n {name}-2 --no-follow); echo "$s"; echo "$s" | grep "Successfully provisioned cluster:"', + f's=$(sky jobs logs --controller -n {name}-2 --no-follow); echo "$s"; echo "$s" | grep "Cluster launched:"', f'{_JOB_QUEUE_WAIT}| grep {name}-2 | head -n1 | grep "RUNNING\|SUCCEEDED"', ], # TODO(zhwu): Change to _JOB_CANCEL_WAIT.format(job_name=f'{name}-1 -n {name}-2') when From e4b7df7b0f3e22c377f003830d41db5ac73dbe0e Mon Sep 17 00:00:00 2001 From: Hysun He Date: Mon, 14 Oct 2024 03:37:32 +0800 Subject: [PATCH 45/93] [OCI]: Bug fix: 1. sky config file path resolution. 2. fill in image_id as ocid in task YAML (#4074) * Bug fix for sky config file path resolution. * format * [OCI] Bug fix for image_id in Task YAML --- sky/clouds/oci.py | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/sky/clouds/oci.py b/sky/clouds/oci.py index 56dd60f8044..f4ac4d577e3 100644 --- a/sky/clouds/oci.py +++ b/sky/clouds/oci.py @@ -4,6 +4,19 @@ - Hysun He (hysun.he@oracle.com) @ Apr, 2023: Initial implementation - Hysun He (hysun.he@oracle.com) @ May 4, 2023: Support use the default image_id (configurable) if no image_id specified in the task yaml. + - Hysun He (hysun.he@oracle.com) @ Oct 12, 2024: + get_credential_file_mounts(): bug fix for sky config + file path resolution (by os.path.expanduser) when construct the file + mounts. This bug will cause the created workder nodes located in different + compartment and VCN than the header node if user specifies compartment_id + in the sky config file, because the ~/.sky/config is not sync-ed to the + remote machine. + The workaround is set the sky config file path using ENV before running + the sky launch: export SKYPILOT_CONFIG=/home/ubuntu/.sky/config.yaml + - Hysun He (hysun.he@oracle.com) @ Oct 12, 2024: + make_deploy_resources_variables(): Bug fix for specify the image_id as + the ocid of the image in the task.yaml file, in this case the image_id + for the node config should be set to the ocid instead of a dict. """ import json import logging @@ -211,7 +224,9 @@ def make_deploy_resources_variables( listing_id = image_cols[1] res_ver = image_cols[2] else: - image_id = resources.image_id + # Oct.12,2024 by HysunHe: Bug fix - resources.image_id is an + # dict. The image_id here should be the ocid format. + image_id = image_str listing_id = None res_ver = None @@ -447,7 +462,7 @@ def get_credential_file_mounts(self) -> Dict[str, str]: credential_files = [oci_cfg_file, api_key_file] # Sky config file is optional - if os.path.exists(sky_cfg_file): + if os.path.exists(os.path.expanduser(sky_cfg_file)): credential_files.append(sky_cfg_file) file_mounts = { From 340f38404fe5d3ebe35ea430a67cb3377241d1f3 Mon Sep 17 00:00:00 2001 From: Nayan Date: Mon, 14 Oct 2024 06:52:45 +0530 Subject: [PATCH 46/93] [Core] Turn on WAL mode for cluster job table (#3923) * fix: add retry wrapper around db operation * fix: enable wal mode for job table * fix: remove retries from db utils * fix: add WAL to jobs table * fix: tidy up formatting --- sky/skylet/job_lib.py | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/sky/skylet/job_lib.py b/sky/skylet/job_lib.py index 93bbe99b3ce..5e7008e55d8 100644 --- a/sky/skylet/job_lib.py +++ b/sky/skylet/job_lib.py @@ -8,6 +8,7 @@ import os import pathlib import shlex +import sqlite3 import subprocess import time import typing @@ -55,6 +56,20 @@ class JobInfoLoc(enum.IntEnum): def create_table(cursor, conn): + # Enable WAL mode to avoid locking issues. + # See: issue #3863, #1441 and PR #1509 + # https://github.com/microsoft/WSL/issues/2395 + # TODO(romilb): We do not enable WAL for WSL because of known issue in WSL. + # This may cause the database locked problem from WSL issue #1441. + if not common_utils.is_wsl(): + try: + cursor.execute('PRAGMA journal_mode=WAL') + except sqlite3.OperationalError as e: + if 'database is locked' not in str(e): + raise + # If the database is locked, it is OK to continue, as the WAL mode + # is not critical and is likely to be enabled by other processes. + cursor.execute("""\ CREATE TABLE IF NOT EXISTS jobs ( job_id INTEGER PRIMARY KEY AUTOINCREMENT, From 1ff843f17a7b78d8d87e12bea57d6325423b8c37 Mon Sep 17 00:00:00 2001 From: Romil Bhardwaj Date: Mon, 14 Oct 2024 10:57:42 -0700 Subject: [PATCH 47/93] [docs] Unroll k8s internal load balancer docs (#4083) unroll load balancer docs --- .../reference/kubernetes/kubernetes-ports.rst | 44 +++++-------------- 1 file changed, 11 insertions(+), 33 deletions(-) diff --git a/docs/source/reference/kubernetes/kubernetes-ports.rst b/docs/source/reference/kubernetes/kubernetes-ports.rst index 3824b651717..4f8476c1bbc 100644 --- a/docs/source/reference/kubernetes/kubernetes-ports.rst +++ b/docs/source/reference/kubernetes/kubernetes-ports.rst @@ -59,40 +59,18 @@ To restrict your services to be accessible only within the cluster, you can set Depending on your cloud, set the appropriate annotation in the SkyPilot config file (``~/.sky/config.yaml``): -.. tab-set:: - - .. tab-item:: GCP - :sync: internal-lb-gke - - .. code-block:: yaml - - # ~/.sky/config.yaml - kubernetes: - custom_metadata: - annotations: - networking.gke.io/load-balancer-type: "Internal" - - .. tab-item:: AWS - :sync: internal-lb-aws - - .. code-block:: yaml - - # ~/.sky/config.yaml - kubernetes: - custom_metadata: - annotations: - service.beta.kubernetes.io/aws-load-balancer-internal: "true" - - .. tab-item:: Azure - :sync: internal-lb-azure - - .. code-block:: yaml +.. code-block:: yaml - # ~/.sky/config.yaml - kubernetes: - custom_metadata: - annotations: - service.beta.kubernetes.io/azure-load-balancer-internal: "true" + # ~/.sky/config.yaml + kubernetes: + custom_metadata: + annotations: + # For GCP/GKE + networking.gke.io/load-balancer-type: "Internal" + # For AWS/EKS + service.beta.kubernetes.io/aws-load-balancer-internal: "true" + # For Azure/AKS + service.beta.kubernetes.io/azure-load-balancer-internal: "true" .. _kubernetes-ingress: From a0243e56484797c745f42c04260d34e6d280a384 Mon Sep 17 00:00:00 2001 From: Romil Bhardwaj Date: Mon, 14 Oct 2024 12:49:45 -0700 Subject: [PATCH 48/93] [docs] `sky status --kubernetes` docs (#4064) * observability docs * comments --- .../kubernetes/kubernetes-getting-started.rst | 51 +++++++++++++++++ .../reference/kubernetes/kubernetes-setup.rst | 57 +++++++++++++++++-- 2 files changed, 104 insertions(+), 4 deletions(-) diff --git a/docs/source/reference/kubernetes/kubernetes-getting-started.rst b/docs/source/reference/kubernetes/kubernetes-getting-started.rst index 4f87c8a6ee7..d7313fba3e2 100644 --- a/docs/source/reference/kubernetes/kubernetes-getting-started.rst +++ b/docs/source/reference/kubernetes/kubernetes-getting-started.rst @@ -119,6 +119,57 @@ Once your cluster administrator has :ref:`setup a Kubernetes cluster `_ for easily viewing and managing -SkyPilot tasks running on your cluster. +Below, we provide tips on how to monitor SkyPilot resources on your Kubernetes cluster. + +.. _kubernetes-observability-skystatus: + +List SkyPilot resources across all users +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +We provide a convenience command, :code:`sky status --k8s`, to view the status of all SkyPilot resources in the cluster. + +Unlike :code:`sky status` which lists only the SkyPilot resources launched by the current user, +:code:`sky status --k8s` lists all SkyPilot resources in the cluster across all users. + +.. code-block:: console + + $ sky status --k8s + Kubernetes cluster state (context: mycluster) + SkyPilot clusters + USER NAME LAUNCHED RESOURCES STATUS + alice infer-svc-1 23 hrs ago 1x Kubernetes(cpus=1, mem=1, {'L4': 1}) UP + alice sky-jobs-controller-80b50983 2 days ago 1x Kubernetes(cpus=4, mem=4) UP + alice sky-serve-controller-80b50983 23 hrs ago 1x Kubernetes(cpus=4, mem=4) UP + bob dev 1 day ago 1x Kubernetes(cpus=2, mem=8, {'H100': 1}) UP + bob multinode-dev 1 day ago 2x Kubernetes(cpus=2, mem=2) UP + bob sky-jobs-controller-2ea485ea 2 days ago 1x Kubernetes(cpus=4, mem=4) UP + + Managed jobs + In progress tasks: 1 STARTING + USER ID TASK NAME RESOURCES SUBMITTED TOT. DURATION JOB DURATION #RECOVERIES STATUS + alice 1 - eval 1x[CPU:1+] 2 days ago 49s 8s 0 SUCCEEDED + bob 4 - pretrain 1x[H100:4] 1 day ago 1h 1m 11s 1h 14s 0 SUCCEEDED + bob 3 - bigjob 1x[CPU:16] 1 day ago 1d 21h 11m 4s - 0 STARTING + bob 2 - failjob 1x[CPU:1+] 1 day ago 54s 9s 0 FAILED + bob 1 - shortjob 1x[CPU:1+] 2 days ago 1h 1m 19s 1h 16s 0 SUCCEEDED + + +.. _kubernetes-observability-dashboard: + +Kubernetes Dashboard +^^^^^^^^^^^^^^^^^^^^ +You can deploy tools such as the `Kubernetes dashboard `_ to easily view and manage +SkyPilot resources on your cluster. .. image:: ../../images/screenshots/kubernetes/kubernetes-dashboard.png :width: 80% From 92431134d2061f961e81106a8597c2b81c28453e Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Mon, 14 Oct 2024 15:48:02 -0700 Subject: [PATCH 49/93] [UX] Show log after failure and fix the color issue with narrow window (#4084) * fix narrow window and show log path during exception * format * format --- sky/backends/cloud_vm_ray_backend.py | 6 ++--- sky/provision/provisioner.py | 5 +++- sky/utils/ux_utils.py | 40 ++++++++++++++++++++++------ 3 files changed, 39 insertions(+), 12 deletions(-) diff --git a/sky/backends/cloud_vm_ray_backend.py b/sky/backends/cloud_vm_ray_backend.py index aceac8951b0..f0fb4d97ba1 100644 --- a/sky/backends/cloud_vm_ray_backend.py +++ b/sky/backends/cloud_vm_ray_backend.py @@ -2844,9 +2844,9 @@ def _provision( time.sleep(gap_seconds) continue logger.error( - f'{colorama.Fore.RED}β¨―{colorama.Style.RESET_ALL} ' - 'Failed to provision resources. ' - f'{ux_utils.log_path_hint(log_path)}') + ux_utils.error_message( + 'Failed to provision resources. ' + f'{ux_utils.log_path_hint(log_path)}')) error_message += ( '\nTo keep retrying until the cluster is up, use ' 'the `--retry-until-up` flag.') diff --git a/sky/provision/provisioner.py b/sky/provision/provisioner.py index b2ac6d6660f..7706a3d489b 100644 --- a/sky/provision/provisioner.py +++ b/sky/provision/provisioner.py @@ -571,7 +571,10 @@ def post_provision_runtime_setup( provision_record=provision_record, custom_resource=custom_resource) except Exception: # pylint: disable=broad-except - logger.error('*** Failed setting up cluster. ***') + logger.error( + ux_utils.error_message( + 'Failed to set up SkyPilot runtime on cluster.', + provision_logging.config.log_path)) logger.debug(f'Stacktrace:\n{traceback.format_exc()}') with ux_utils.print_exception_no_traceback(): raise diff --git a/sky/utils/ux_utils.py b/sky/utils/ux_utils.py index f6699f355f8..2fffa8a9df9 100644 --- a/sky/utils/ux_utils.py +++ b/sky/utils/ux_utils.py @@ -121,11 +121,6 @@ def run(self, *args, **kwargs): raise -def starting_message(message: str) -> str: - """Gets the starting message for the given message.""" - return f'βš™οΈŽ {message}' - - def log_path_hint(log_path: Union[str, 'pathlib.Path']) -> str: """Gets the log path hint for the given log path.""" log_path = str(log_path) @@ -135,21 +130,50 @@ def log_path_hint(log_path: Union[str, 'pathlib.Path']) -> str: return _LOG_PATH_HINT.format(log_path=log_path) +def starting_message(message: str) -> str: + """Gets the starting message for the given message.""" + # We have to reset the color before the message, because sometimes if a + # previous spinner with dimmed color overflows in a narrow terminal, the + # color might be messed up. + return f'{colorama.Style.RESET_ALL}βš™οΈŽ {message}' + + def finishing_message( message: str, log_path: Optional[Union[str, 'pathlib.Path']] = None) -> str: """Gets the finishing message for the given message.""" - success_prefix = (f'{colorama.Fore.GREEN}βœ“ {message}' - f'{colorama.Style.RESET_ALL}') + # We have to reset the color before the message, because sometimes if a + # previous spinner with dimmed color overflows in a narrow terminal, the + # color might be messed up. + success_prefix = (f'{colorama.Style.RESET_ALL}{colorama.Fore.GREEN}βœ“ ' + f'{message}{colorama.Style.RESET_ALL}') if log_path is None: return success_prefix path_hint = log_path_hint(log_path) return f'{success_prefix} {path_hint}' +def error_message(message: str, + log_path: Optional[Union[str, 'pathlib.Path']] = None) -> str: + """Gets the error message for the given message.""" + # We have to reset the color before the message, because sometimes if a + # previous spinner with dimmed color overflows in a narrow terminal, the + # color might be messed up. + error_prefix = (f'{colorama.Style.RESET_ALL}{colorama.Fore.RED}β¨―' + f'{colorama.Style.RESET_ALL} {message}') + if log_path is None: + return error_prefix + path_hint = log_path_hint(log_path) + return f'{error_prefix} {path_hint}' + + def retry_message(message: str) -> str: """Gets the retry message for the given message.""" - return f'{colorama.Fore.YELLOW}β†Ί{colorama.Style.RESET_ALL} {message}' + # We have to reset the color before the message, because sometimes if a + # previous spinner with dimmed color overflows in a narrow terminal, the + # color might be messed up. + return (f'{colorama.Style.RESET_ALL}{colorama.Fore.YELLOW}β†Ί' + f'{colorama.Style.RESET_ALL} {message}') def spinner_message( From a4e2fcd438d70373377c85bcbec1b185ef04c99f Mon Sep 17 00:00:00 2001 From: Romil Bhardwaj Date: Tue, 15 Oct 2024 00:26:39 -0700 Subject: [PATCH 50/93] [k8s] `sky status --k8s` refactor (#4079) * refactor * lint * refactor, dataclass * refactor, dataclass * refactor * lint --- sky/backends/backend_utils.py | 4 +- sky/cli.py | 52 +--------- sky/core.py | 77 +++++++++++++- sky/provision/kubernetes/utils.py | 113 +++++++++++++++++++++ sky/utils/cli_utils/status_utils.py | 149 +++------------------------- 5 files changed, 209 insertions(+), 186 deletions(-) diff --git a/sky/backends/backend_utils.py b/sky/backends/backend_utils.py index 1f213f5c614..2521fcbcfe5 100644 --- a/sky/backends/backend_utils.py +++ b/sky/backends/backend_utils.py @@ -56,7 +56,7 @@ from sky.utils import ux_utils if typing.TYPE_CHECKING: - from sky import resources + from sky import resources as resources_lib from sky import task as task_lib from sky.backends import cloud_vm_ray_backend from sky.backends import local_docker_backend @@ -751,7 +751,7 @@ def _restore_block(new_block: Dict[str, Any], old_block: Dict[str, Any]): # TODO: too many things happening here - leaky abstraction. Refactor. @timeline.event def write_cluster_config( - to_provision: 'resources.Resources', + to_provision: 'resources_lib.Resources', num_nodes: int, cluster_config_template: str, cluster_name: str, diff --git a/sky/cli.py b/sky/cli.py index 87d35f58d1c..114c18c9256 100644 --- a/sky/cli.py +++ b/sky/cli.py @@ -1464,54 +1464,8 @@ def _status_kubernetes(show_all: bool): Args: show_all (bool): Show all job information (e.g., start time, failures). """ - context = kubernetes_utils.get_current_kube_config_context_name() - try: - pods = kubernetes_utils.get_skypilot_pods(context) - except exceptions.ResourcesUnavailableError as e: - with ux_utils.print_exception_no_traceback(): - raise ValueError('Failed to get SkyPilot pods from ' - f'Kubernetes: {str(e)}') from e - all_clusters, jobs_controllers, serve_controllers = ( - status_utils.process_skypilot_pods(pods, context)) - all_jobs = [] - with rich_utils.safe_status( - '[bold cyan]Checking in-progress managed jobs[/]') as spinner: - for i, (_, job_controller_info) in enumerate(jobs_controllers.items()): - user = job_controller_info['user'] - pod = job_controller_info['pods'][0] - status_message = ('[bold cyan]Checking managed jobs controller') - if len(jobs_controllers) > 1: - status_message += f's ({i+1}/{len(jobs_controllers)})' - spinner.update(f'{status_message}[/]') - try: - job_list = managed_jobs.queue_from_kubernetes_pod( - pod.metadata.name) - except RuntimeError as e: - logger.warning('Failed to get managed jobs from controller ' - f'{pod.metadata.name}: {str(e)}') - job_list = [] - # Add user field to jobs - for job in job_list: - job['user'] = user - all_jobs.extend(job_list) - # Reconcile cluster state between managed jobs and clusters: - # To maintain a clear separation between regular SkyPilot clusters - # and those from managed jobs, we need to exclude the latter from - # the main cluster list. - # We do this by reconstructing managed job cluster names from each - # job's name and ID. We then use this set to filter out managed - # clusters from the main cluster list. This is necessary because there - # are no identifiers distinguishing clusters from managed jobs from - # regular clusters. - managed_job_cluster_names = set() - for job in all_jobs: - # Managed job cluster name is - - managed_cluster_name = f'{job["job_name"]}-{job["job_id"]}' - managed_job_cluster_names.add(managed_cluster_name) - unmanaged_clusters = [ - c for c in all_clusters - if c['cluster_name'] not in managed_job_cluster_names - ] + all_clusters, unmanaged_clusters, all_jobs, context = ( + core.status_kubernetes()) click.echo(f'{colorama.Fore.CYAN}{colorama.Style.BRIGHT}' f'Kubernetes cluster state (context: {context})' f'{colorama.Style.RESET_ALL}') @@ -1523,7 +1477,7 @@ def _status_kubernetes(show_all: bool): f'{colorama.Style.RESET_ALL}') msg = managed_jobs.format_job_table(all_jobs, show_all=show_all) click.echo(msg) - if serve_controllers: + if any(['sky-serve-controller' in c.cluster_name for c in all_clusters]): # TODO: Parse serve controllers and show services separately. # Currently we show a hint that services are shown as clusters. click.echo(f'\n{colorama.Style.DIM}Hint: SkyServe replica pods are ' diff --git a/sky/core.py b/sky/core.py index fa695bda687..496b8b8ad5e 100644 --- a/sky/core.py +++ b/sky/core.py @@ -1,7 +1,7 @@ """SDK functions for cluster/job management.""" import getpass import typing -from typing import Any, Dict, List, Optional, Union +from typing import Any, Dict, List, Optional, Tuple, Union import colorama @@ -11,10 +11,12 @@ from sky import data from sky import exceptions from sky import global_user_state +from sky import jobs as managed_jobs from sky import sky_logging from sky import status_lib from sky import task from sky.backends import backend_utils +from sky.provision.kubernetes import utils as kubernetes_utils from sky.skylet import constants from sky.skylet import job_lib from sky.usage import usage_lib @@ -111,6 +113,79 @@ def status(cluster_names: Optional[Union[str, List[str]]] = None, cluster_names=cluster_names) +def status_kubernetes( +) -> Tuple[List['kubernetes_utils.KubernetesSkyPilotClusterInfo'], + List['kubernetes_utils.KubernetesSkyPilotClusterInfo'], List[Dict[ + str, Any]], Optional[str]]: + """Get all SkyPilot clusters and jobs in the Kubernetes cluster. + + Managed jobs and services are also included in the clusters returned. + The caller must parse the controllers to identify which clusters are run + as managed jobs or services. +all_clusters, unmanaged_clusters, all_jobs, context + Returns: + A tuple containing: + - all_clusters: List of KubernetesSkyPilotClusterInfo with info for + all clusters, including managed jobs, services and controllers. + - unmanaged_clusters: List of KubernetesSkyPilotClusterInfo with info + for all clusters excluding managed jobs and services. Controllers + are included. + - all_jobs: List of managed jobs from all controllers. Each entry is a + dictionary job info, see jobs.queue_from_kubernetes_pod for details. + - context: Kubernetes context used to fetch the cluster information. + """ + context = kubernetes_utils.get_current_kube_config_context_name() + try: + pods = kubernetes_utils.get_skypilot_pods(context) + except exceptions.ResourcesUnavailableError as e: + with ux_utils.print_exception_no_traceback(): + raise ValueError('Failed to get SkyPilot pods from ' + f'Kubernetes: {str(e)}') from e + all_clusters, jobs_controllers, _ = (kubernetes_utils.process_skypilot_pods( + pods, context)) + all_jobs = [] + with rich_utils.safe_status( + ux_utils.spinner_message( + '[bold cyan]Checking in-progress managed jobs[/]')) as spinner: + for i, job_controller_info in enumerate(jobs_controllers): + user = job_controller_info.user + pod = job_controller_info.pods[0] + status_message = '[bold cyan]Checking managed jobs controller' + if len(jobs_controllers) > 1: + status_message += f's ({i + 1}/{len(jobs_controllers)})' + spinner.update(f'{status_message}[/]') + try: + job_list = managed_jobs.queue_from_kubernetes_pod( + pod.metadata.name) + except RuntimeError as e: + logger.warning('Failed to get managed jobs from controller ' + f'{pod.metadata.name}: {str(e)}') + job_list = [] + # Add user field to jobs + for job in job_list: + job['user'] = user + all_jobs.extend(job_list) + # Reconcile cluster state between managed jobs and clusters: + # To maintain a clear separation between regular SkyPilot clusters + # and those from managed jobs, we need to exclude the latter from + # the main cluster list. + # We do this by reconstructing managed job cluster names from each + # job's name and ID. We then use this set to filter out managed + # clusters from the main cluster list. This is necessary because there + # are no identifiers distinguishing clusters from managed jobs from + # regular clusters. + managed_job_cluster_names = set() + for job in all_jobs: + # Managed job cluster name is - + managed_cluster_name = f'{job["job_name"]}-{job["job_id"]}' + managed_job_cluster_names.add(managed_cluster_name) + unmanaged_clusters = [ + c for c in all_clusters + if c.cluster_name not in managed_job_cluster_names + ] + return all_clusters, unmanaged_clusters, all_jobs, context + + def endpoints(cluster: str, port: Optional[Union[int, str]] = None) -> Dict[int, str]: """Gets the endpoint for a given cluster and port number (endpoint). diff --git a/sky/provision/kubernetes/utils.py b/sky/provision/kubernetes/utils.py index 3924074838e..0156c4d1091 100644 --- a/sky/provision/kubernetes/utils.py +++ b/sky/provision/kubernetes/utils.py @@ -15,9 +15,11 @@ import yaml import sky +from sky import clouds from sky import exceptions from sky import sky_logging from sky import skypilot_config +from sky import status_lib from sky.adaptors import kubernetes from sky.provision import constants as provision_constants from sky.provision.kubernetes import network_utils @@ -30,6 +32,7 @@ if typing.TYPE_CHECKING: from sky import backends + from sky import resources as resources_lib # TODO(romilb): Move constants to constants.py DEFAULT_NAMESPACE = 'default' @@ -2023,3 +2026,113 @@ def get_skypilot_pods(context: Optional[str] = None) -> List[Any]: 'kubectl get pods --selector=skypilot-cluster --all-namespaces' ) from None return pods + + +@dataclasses.dataclass +class KubernetesSkyPilotClusterInfo: + cluster_name_on_cloud: str + cluster_name: str + user: str + status: status_lib.ClusterStatus + pods: List[Any] + launched_at: float + resources: 'resources_lib.Resources' + resources_str: str + + +def process_skypilot_pods( + pods: List[Any], + context: Optional[str] = None +) -> Tuple[List[KubernetesSkyPilotClusterInfo], + List[KubernetesSkyPilotClusterInfo], + List[KubernetesSkyPilotClusterInfo]]: + """Process SkyPilot pods on k8s to extract cluster and controller info. + + Args: + pods: List of Kubernetes pod objects. + context: Kubernetes context name, used to detect GPU label formatter. + + Returns: + A tuple containing: + - List of KubernetesSkyPilotClusterInfo with all cluster info. + - List of KubernetesSkyPilotClusterInfo with job controller info. + - List of KubernetesSkyPilotClusterInfo with serve controller info. + """ + # pylint: disable=import-outside-toplevel + from sky import resources as resources_lib + clusters: Dict[str, KubernetesSkyPilotClusterInfo] = {} + jobs_controllers: List[KubernetesSkyPilotClusterInfo] = [] + serve_controllers: List[KubernetesSkyPilotClusterInfo] = [] + + for pod in pods: + cluster_name_on_cloud = pod.metadata.labels.get('skypilot-cluster') + cluster_name = cluster_name_on_cloud.rsplit( + '-', 1 + )[0] # Remove the user hash to get cluster name (e.g., mycluster-2ea4) + if cluster_name_on_cloud not in clusters: + # Parse the start time for the cluster + start_time = pod.status.start_time + if start_time is not None: + start_time = pod.status.start_time.timestamp() + + # Parse resources + cpu_request = parse_cpu_or_gpu_resource( + pod.spec.containers[0].resources.requests.get('cpu', '0')) + memory_request = parse_memory_resource( + pod.spec.containers[0].resources.requests.get('memory', '0'), + unit='G') + gpu_count = parse_cpu_or_gpu_resource( + pod.spec.containers[0].resources.requests.get( + 'nvidia.com/gpu', '0')) + gpu_name = None + if gpu_count > 0: + label_formatter, _ = (detect_gpu_label_formatter(context)) + assert label_formatter is not None, ( + 'GPU label formatter cannot be None if there are pods ' + f'requesting GPUs: {pod.metadata.name}') + gpu_label = label_formatter.get_label_key() + # Get GPU name from pod node selector + if pod.spec.node_selector is not None: + gpu_name = label_formatter.get_accelerator_from_label_value( + pod.spec.node_selector.get(gpu_label)) + + resources = resources_lib.Resources( + cloud=clouds.Kubernetes(), + cpus=int(cpu_request), + memory=int(memory_request), + accelerators=(f'{gpu_name}:{gpu_count}' + if gpu_count > 0 else None)) + if pod.status.phase == 'Pending': + # If pod is pending, do not show it in the status + continue + + cluster_info = KubernetesSkyPilotClusterInfo( + cluster_name_on_cloud=cluster_name_on_cloud, + cluster_name=cluster_name, + user=pod.metadata.labels.get('skypilot-user'), + status=status_lib.ClusterStatus.UP, + pods=[], + launched_at=start_time, + resources=resources, + resources_str='') + clusters[cluster_name_on_cloud] = cluster_info + # Check if cluster name is name of a controller + # Can't use controller_utils.Controllers.from_name(cluster_name) + # because hash is different across users + if 'sky-jobs-controller' in cluster_name_on_cloud: + jobs_controllers.append(cluster_info) + elif 'sky-serve-controller' in cluster_name_on_cloud: + serve_controllers.append(cluster_info) + else: + # Update start_time if this pod started earlier + pod_start_time = pod.status.start_time + if pod_start_time is not None: + pod_start_time = pod_start_time.timestamp() + if pod_start_time < clusters[cluster_name_on_cloud].launched_at: + clusters[cluster_name_on_cloud].launched_at = pod_start_time + clusters[cluster_name_on_cloud].pods.append(pod) + # Update resources_str in clusters: + for cluster in clusters.values(): + num_pods = len(cluster.pods) + cluster.resources_str = f'{num_pods}x {cluster.resources}' + return list(clusters.values()), jobs_controllers, serve_controllers diff --git a/sky/utils/cli_utils/status_utils.py b/sky/utils/cli_utils/status_utils.py index 09172f24814..96f9b5e9946 100644 --- a/sky/utils/cli_utils/status_utils.py +++ b/sky/utils/cli_utils/status_utils.py @@ -1,19 +1,20 @@ """Utilities for sky status.""" -from typing import Any, Callable, Dict, List, Optional, Tuple +import typing +from typing import Any, Callable, Dict, List, Optional import click import colorama from sky import backends -from sky import clouds as sky_clouds -from sky import resources as resources_lib from sky import status_lib -from sky.provision.kubernetes import utils as kubernetes_utils from sky.skylet import constants from sky.utils import common_utils from sky.utils import log_utils from sky.utils import resources_utils +if typing.TYPE_CHECKING: + from sky.provision.kubernetes import utils as kubernetes_utils + COMMAND_TRUNC_LENGTH = 25 NUM_COST_REPORT_LINES = 5 @@ -303,19 +304,19 @@ def _get_estimated_cost_for_cost_report( return f'$ {cost:.2f}' -def show_kubernetes_cluster_status_table(clusters: List[Any], - show_all: bool) -> None: +def show_kubernetes_cluster_status_table( + clusters: List['kubernetes_utils.KubernetesSkyPilotClusterInfo'], + show_all: bool) -> None: """Compute cluster table values and display for Kubernetes clusters.""" status_columns = [ - StatusColumn('USER', lambda c: c['user']), - StatusColumn('NAME', lambda c: c['cluster_name']), - StatusColumn( - 'LAUNCHED', - lambda c: log_utils.readable_time_duration(c['launched_at'])), + StatusColumn('USER', lambda c: c.user), + StatusColumn('NAME', lambda c: c.cluster_name), + StatusColumn('LAUNCHED', + lambda c: log_utils.readable_time_duration(c.launched_at)), StatusColumn('RESOURCES', - lambda c: c['resources_str'], + lambda c: c.resources_str, trunc_length=70 if not show_all else 0), - StatusColumn('STATUS', lambda c: c['status'].colored_str()), + StatusColumn('STATUS', lambda c: c.status.colored_str()), # TODO(romilb): We should consider adding POD_NAME field here when --all # is passed to help users fetch pod name programmatically. ] @@ -326,8 +327,7 @@ def show_kubernetes_cluster_status_table(clusters: List[Any], cluster_table = log_utils.create_table(columns) # Sort table by user, then by cluster name - sorted_clusters = sorted(clusters, - key=lambda c: (c['user'], c['cluster_name'])) + sorted_clusters = sorted(clusters, key=lambda c: (c.user, c.cluster_name)) for cluster in sorted_clusters: row = [] @@ -344,122 +344,3 @@ def show_kubernetes_cluster_status_table(clusters: List[Any], else: click.echo('No SkyPilot resources found in the ' 'active Kubernetes context.') - - -def process_skypilot_pods( - pods: List[Any], - context: Optional[str] = None -) -> Tuple[List[Dict[Any, Any]], Dict[str, Any], Dict[str, Any]]: - """Process SkyPilot pods on k8s to extract cluster and controller info. - - Args: - pods: List of Kubernetes pod objects. - context: Kubernetes context name, used to detect GPU label formatter. - - Returns: - A tuple containing: - - List of dictionaries with cluster information. - - Dictionary of job controller information. - - Dictionary of serve controller information. - - Each dictionary contains the following keys: - 'cluster_name_on_cloud': The cluster_name_on_cloud used by SkyPilot - 'cluster_name': The cluster name without the user hash - 'user': The user who created the cluster. Fetched from pod label - 'status': The cluster status (assumed UP if pod exists) - 'pods': List of pod objects in the cluster - 'launched_at': Timestamp of when the cluster was launched - 'resources': sky.Resources object for the cluster - """ - clusters: Dict[str, Dict] = {} - jobs_controllers: Dict[str, Dict] = {} - serve_controllers: Dict[str, Dict] = {} - - for pod in pods: - cluster_name_on_cloud = pod.metadata.labels.get('skypilot-cluster') - cluster_name = cluster_name_on_cloud.rsplit( - '-', 1 - )[0] # Remove the user hash to get cluster name (e.g., mycluster-2ea4) - - # Check if cluster name is name of a controller - # Can't use controller_utils.Controllers.from_name(cluster_name) - # because hash is different across users - if 'controller' in cluster_name_on_cloud: - start_time = pod.status.start_time.timestamp() - controller_info = { - 'cluster_name_on_cloud': cluster_name_on_cloud, - 'cluster_name': cluster_name, - 'user': pod.metadata.labels.get('skypilot-user'), - 'status': status_lib.ClusterStatus.UP, - # Assuming UP if pod exists - 'pods': [pod], - 'launched_at': start_time - } - if 'sky-jobs-controller' in cluster_name_on_cloud: - jobs_controllers[cluster_name_on_cloud] = controller_info - elif 'sky-serve-controller' in cluster_name_on_cloud: - serve_controllers[cluster_name_on_cloud] = controller_info - - if cluster_name_on_cloud not in clusters: - # Parse the start time for the cluster - start_time = pod.status.start_time - if start_time is not None: - start_time = pod.status.start_time.timestamp() - - # Parse resources - cpu_request = kubernetes_utils.parse_cpu_or_gpu_resource( - pod.spec.containers[0].resources.requests.get('cpu', '0')) - memory_request = kubernetes_utils.parse_memory_resource( - pod.spec.containers[0].resources.requests.get('memory', '0'), - unit='G') - gpu_count = kubernetes_utils.parse_cpu_or_gpu_resource( - pod.spec.containers[0].resources.requests.get( - 'nvidia.com/gpu', '0')) - if gpu_count > 0: - label_formatter, _ = ( - kubernetes_utils.detect_gpu_label_formatter(context)) - assert label_formatter is not None, ( - 'GPU label formatter cannot be None if there are pods ' - f'requesting GPUs: {pod.metadata.name}') - gpu_label = label_formatter.get_label_key() - # Get GPU name from pod node selector - if pod.spec.node_selector is not None: - gpu_name = label_formatter.get_accelerator_from_label_value( - pod.spec.node_selector.get(gpu_label)) - - resources = resources_lib.Resources( - cloud=sky_clouds.Kubernetes(), - cpus=int(cpu_request), - memory=int(memory_request), - accelerators=(f'{gpu_name}:{gpu_count}' - if gpu_count > 0 else None)) - if pod.status.phase == 'Pending': - # If pod is pending, do not show it in the status - continue - - clusters[cluster_name_on_cloud] = { - 'cluster_name_on_cloud': cluster_name_on_cloud, - 'cluster_name': cluster_name, - 'user': pod.metadata.labels.get('skypilot-user'), - 'status': status_lib.ClusterStatus.UP, - 'pods': [], - 'launched_at': start_time, - 'resources': resources, - } - else: - # Update start_time if this pod started earlier - pod_start_time = pod.status.start_time - if pod_start_time is not None: - pod_start_time = pod_start_time.timestamp() - if pod_start_time < clusters[cluster_name_on_cloud][ - 'launched_at']: - clusters[cluster_name_on_cloud][ - 'launched_at'] = pod_start_time - clusters[cluster_name_on_cloud]['pods'].append(pod) - # Update resources_str in clusters: - for cluster_name, cluster in clusters.items(): - resources = cluster['resources'] - num_pods = len(cluster['pods']) - resources_str = f'{num_pods}x {resources}' - cluster['resources_str'] = resources_str - return list(clusters.values()), jobs_controllers, serve_controllers From 53380e26f01452559012d57b333b17f40dd8a4d1 Mon Sep 17 00:00:00 2001 From: yika-luo Date: Tue, 15 Oct 2024 15:21:11 -0700 Subject: [PATCH 51/93] [Performance] Use new GCP custom images (#4027) * [Performance] Use new custom image to create GCP GPU VMs * update image tags for both CPU and GPU * always generate .sky/python_path --------- Co-authored-by: Yika Luo --- sky/clouds/gcp.py | 12 +++++++++--- sky/skylet/constants.py | 4 ++-- 2 files changed, 11 insertions(+), 5 deletions(-) diff --git a/sky/clouds/gcp.py b/sky/clouds/gcp.py index b1015c92979..bde6abcb48f 100644 --- a/sky/clouds/gcp.py +++ b/sky/clouds/gcp.py @@ -94,6 +94,12 @@ f'\nTo query common AI images: {colorama.Style.BRIGHT}gcloud compute images list --project deeplearning-platform-release | less{colorama.Style.RESET_ALL}' ) +# Image ID tags +_DEFAULT_CPU_IMAGE_ID = 'skypilot:custom-cpu-ubuntu-2204' +# For GPU-related package version, see sky/clouds/service_catalog/images/provisioners/cuda.sh +_DEFAULT_GPU_IMAGE_ID = 'skypilot:custom-gpu-ubuntu-2204' +_DEFAULT_GPU_K80_IMAGE_ID = 'skypilot:k80-debian-10' + def _run_output(cmd): proc = subprocess.run(cmd, @@ -422,7 +428,7 @@ def make_deploy_resources_variables( # --no-standard-images # We use the debian image, as the ubuntu image has some connectivity # issue when first booted. - image_id = 'skypilot:cpu-debian-11' + image_id = _DEFAULT_CPU_IMAGE_ID def _failover_disk_tier() -> Optional[resources_utils.DiskTier]: if (r.disk_tier is not None and @@ -487,10 +493,10 @@ def _failover_disk_tier() -> Optional[resources_utils.DiskTier]: # Though the image is called cu113, it actually has later # versions of CUDA as noted below. # CUDA driver version 470.57.02, CUDA Library 11.4 - image_id = 'skypilot:k80-debian-10' + image_id = _DEFAULT_GPU_K80_IMAGE_ID else: # CUDA driver version 535.86.10, CUDA Library 12.2 - image_id = 'skypilot:gpu-debian-11' + image_id = _DEFAULT_GPU_IMAGE_ID if (resources.image_id is not None and resources.extract_docker_image() is None): diff --git a/sky/skylet/constants.py b/sky/skylet/constants.py index 5729d75c968..032ad5d25b1 100644 --- a/sky/skylet/constants.py +++ b/sky/skylet/constants.py @@ -155,8 +155,8 @@ # We use --system-site-packages to reuse the system site packages to avoid # the overhead of installing the same packages in the new environment. f'[ -d {SKY_REMOTE_PYTHON_ENV} ] || ' - f'{{ {SKY_PYTHON_CMD} -m venv {SKY_REMOTE_PYTHON_ENV} --system-site-packages && ' - f'echo "$(echo {SKY_REMOTE_PYTHON_ENV})/bin/python" > {SKY_PYTHON_PATH_FILE}; }};' + f'{SKY_PYTHON_CMD} -m venv {SKY_REMOTE_PYTHON_ENV} --system-site-packages;' + f'echo "$(echo {SKY_REMOTE_PYTHON_ENV})/bin/python" > {SKY_PYTHON_PATH_FILE};' ) _sky_version = str(version.parse(sky.__version__)) From 724e97e54cbe61ec3457aa3d599e046d05372fff Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 16 Oct 2024 18:51:01 -0700 Subject: [PATCH 52/93] [GCP] Add H100 mega (#4099) * Add H100 mega support on GCP * fix for some other regions * format * fix resource type * fix catalog fetching --- sky/clouds/gcp.py | 2 +- .../data_fetchers/fetch_gcp.py | 23 ++++++++++++++----- sky/clouds/service_catalog/gcp_catalog.py | 3 +++ 3 files changed, 21 insertions(+), 7 deletions(-) diff --git a/sky/clouds/gcp.py b/sky/clouds/gcp.py index bde6abcb48f..bac297a92c3 100644 --- a/sky/clouds/gcp.py +++ b/sky/clouds/gcp.py @@ -483,7 +483,7 @@ def _failover_disk_tier() -> Optional[resources_utils.DiskTier]: if acc in ('A100-80GB', 'L4'): # A100-80GB and L4 have a different name pattern. resources_vars['gpu'] = f'nvidia-{acc.lower()}' - elif acc == 'H100': + elif acc in ('H100', 'H100-MEGA'): resources_vars['gpu'] = f'nvidia-{acc.lower()}-80gb' else: resources_vars['gpu'] = 'nvidia-tesla-{}'.format( diff --git a/sky/clouds/service_catalog/data_fetchers/fetch_gcp.py b/sky/clouds/service_catalog/data_fetchers/fetch_gcp.py index eb69695aa55..097efe74deb 100644 --- a/sky/clouds/service_catalog/data_fetchers/fetch_gcp.py +++ b/sky/clouds/service_catalog/data_fetchers/fetch_gcp.py @@ -419,6 +419,11 @@ def _get_gpus_for_zone(zone: str) -> 'pd.DataFrame': if count != 8: # H100 only has 8 cards. continue + if 'H100-MEGA-80GB' in gpu_name: + gpu_name = 'H100-MEGA' + if count != 8: + # H100-MEGA only has 8 cards. + continue if 'VWS' in gpu_name: continue if gpu_name.startswith('TPU-'): @@ -447,6 +452,7 @@ def _gpu_info_from_name(name: str) -> Optional[Dict[str, List[Dict[str, Any]]]]: 'A100-80GB': 80 * 1024, 'A100': 40 * 1024, 'H100': 80 * 1024, + 'H100-MEGA': 80 * 1024, 'P4': 8 * 1024, 'T4': 16 * 1024, 'V100': 16 * 1024, @@ -491,12 +497,17 @@ def get_gpu_price(row: pd.Series, spot: bool) -> Optional[float]: if sku['category']['usageType'] != ondemand_or_spot: continue - gpu_name = row['AcceleratorName'] - if gpu_name == 'A100-80GB': - gpu_name = 'A100 80GB' - if gpu_name == 'H100': - gpu_name = 'H100 80GB' - if f'{gpu_name} GPU' not in sku['description']: + gpu_names = [row['AcceleratorName']] + if gpu_names[0] == 'A100-80GB': + gpu_names = ['A100 80GB'] + if gpu_names[0] == 'H100': + gpu_names = ['H100 80GB'] + if gpu_names[0] == 'H100-MEGA': + # Seems that H100-MEGA has two different descriptions in SKUs in + # different regions: 'H100 80GB Mega' and 'H100 80GB Plus'. + gpu_names = ['H100 80GB Mega', 'H100 80GB Plus'] + if not any(f'{gpu_name} GPU' in sku['description'] + for gpu_name in gpu_names): continue unit_price = _get_unit_price(sku) diff --git a/sky/clouds/service_catalog/gcp_catalog.py b/sky/clouds/service_catalog/gcp_catalog.py index f861b51920e..c9e15f602dc 100644 --- a/sky/clouds/service_catalog/gcp_catalog.py +++ b/sky/clouds/service_catalog/gcp_catalog.py @@ -98,6 +98,9 @@ }, 'H100': { 8: ['a3-highgpu-8g'], + }, + 'H100-MEGA': { + 8: ['a3-megagpu-8g'], } } From 3e98afe8f96531ffb6332679083918f7ad67481e Mon Sep 17 00:00:00 2001 From: Romil Bhardwaj Date: Wed, 16 Oct 2024 20:32:38 -0700 Subject: [PATCH 53/93] [GCP] Add gVNIC support (#4095) * add gvnic support through config.yaml * lint * docs --- docs/source/reference/config.rst | 9 +++++++++ sky/clouds/gcp.py | 5 +++++ sky/provision/gcp/config.py | 6 +++++- sky/templates/gcp-ray.yml.j2 | 3 +++ sky/utils/schemas.py | 3 +++ 5 files changed, 25 insertions(+), 1 deletion(-) diff --git a/docs/source/reference/config.rst b/docs/source/reference/config.rst index 5c52e7487b9..b8255b46402 100644 --- a/docs/source/reference/config.rst +++ b/docs/source/reference/config.rst @@ -419,6 +419,15 @@ Available fields and semantics: # Default: 'LOCAL_CREDENTIALS'. remote_identity: LOCAL_CREDENTIALS + # Enable gVNIC (optional). + # + # Set to true to use gVNIC on GCP instances. gVNIC offers higher performance + # for multi-node clusters, but costs more. + # Reference: https://cloud.google.com/compute/docs/networking/using-gvnic + # + # Default: false. + enable_gvnic: false + # Advanced Azure configurations (optional). # Apply to all new instances but not existing ones. azure: diff --git a/sky/clouds/gcp.py b/sky/clouds/gcp.py index bac297a92c3..0e02f9fd456 100644 --- a/sky/clouds/gcp.py +++ b/sky/clouds/gcp.py @@ -546,6 +546,11 @@ def _failover_disk_tier() -> Optional[resources_utils.DiskTier]: resources_vars[ 'force_enable_external_ips'] = skypilot_config.get_nested( ('gcp', 'force_enable_external_ips'), False) + + # Add gVNIC from config + resources_vars['enable_gvnic'] = skypilot_config.get_nested( + ('gcp', 'enable_gvnic'), False) + return resources_vars def _get_feasible_launchable_resources( diff --git a/sky/provision/gcp/config.py b/sky/provision/gcp/config.py index 416f0c1a694..a8292669a7c 100644 --- a/sky/provision/gcp/config.py +++ b/sky/provision/gcp/config.py @@ -670,8 +670,12 @@ def _configure_subnet(region: str, cluster_name: str, 'accessConfigs': [{ 'name': 'External NAT', 'type': 'ONE_TO_ONE_NAT', - }], + }] }] + # Add gVNIC if specified in config + enable_gvnic = config.provider_config.get('enable_gvnic', False) + if enable_gvnic: + default_interfaces[0]['nicType'] = 'gVNIC' enable_external_ips = _enable_external_ips(config) if not enable_external_ips: # Removing this key means the VM will not be assigned an external IP. diff --git a/sky/templates/gcp-ray.yml.j2 b/sky/templates/gcp-ray.yml.j2 index 5f06eef05c7..f3e6232d5d8 100644 --- a/sky/templates/gcp-ray.yml.j2 +++ b/sky/templates/gcp-ray.yml.j2 @@ -64,6 +64,9 @@ provider: # leakage. disable_launch_config_check: true use_managed_instance_group: {{ gcp_use_managed_instance_group }} +{%- if enable_gvnic %} + enable_gvnic: {{ enable_gvnic }} +{%- endif %} auth: ssh_user: gcpuser diff --git a/sky/utils/schemas.py b/sky/utils/schemas.py index 6e752f73ebc..94a6ed690e1 100644 --- a/sky/utils/schemas.py +++ b/sky/utils/schemas.py @@ -755,6 +755,9 @@ def get_config_schema(): 'force_enable_external_ips': { 'type': 'boolean' }, + 'enable_gvnic': { + 'type': 'boolean' + }, **_LABELS_SCHEMA, **_NETWORK_CONFIG_SCHEMA, }, From c2e12af665c76c2ef95c521d114b40eb095fa0d9 Mon Sep 17 00:00:00 2001 From: Kote Mushegiani Date: Thu, 17 Oct 2024 09:31:15 -0700 Subject: [PATCH 54/93] [Lambda] Lambda Cloud SkyPilot provisioner (#3865) * feat: lambda cloud new provisioner * feat: address cblmemo reviews and other reviews + make multi-node work again * fix: quotes * fix: address some reviews * chore: rm unused option * chore: update typedef * feat: use lists directly * fix: formatting * chore: address reviews * fix: formatting * chore: rm query ports since default impl per review * feat: add back query ports * fix: formatting * chore: add newline at eof * feat: try removing query ports again --- sky/authentication.py | 2 +- sky/clouds/lambda_cloud.py | 5 +- sky/provision/__init__.py | 3 + sky/provision/lambda_cloud/__init__.py | 11 + sky/provision/lambda_cloud/config.py | 10 + sky/provision/lambda_cloud/instance.py | 261 ++++++++++++++ .../lambda_cloud}/lambda_utils.py | 37 +- sky/setup_files/MANIFEST.in | 1 - sky/skylet/providers/lambda_cloud/__init__.py | 2 - .../providers/lambda_cloud/node_provider.py | 320 ------------------ sky/templates/lambda-ray.yml.j2 | 45 +-- 11 files changed, 316 insertions(+), 381 deletions(-) create mode 100644 sky/provision/lambda_cloud/__init__.py create mode 100644 sky/provision/lambda_cloud/config.py create mode 100644 sky/provision/lambda_cloud/instance.py rename sky/{clouds/utils => provision/lambda_cloud}/lambda_utils.py (92%) delete mode 100644 sky/skylet/providers/lambda_cloud/__init__.py delete mode 100644 sky/skylet/providers/lambda_cloud/node_provider.py diff --git a/sky/authentication.py b/sky/authentication.py index eb51aad02ad..41a7d02dfb7 100644 --- a/sky/authentication.py +++ b/sky/authentication.py @@ -43,9 +43,9 @@ from sky.adaptors import ibm from sky.adaptors import kubernetes from sky.adaptors import runpod -from sky.clouds.utils import lambda_utils from sky.provision.fluidstack import fluidstack_utils from sky.provision.kubernetes import utils as kubernetes_utils +from sky.provision.lambda_cloud import lambda_utils from sky.utils import common_utils from sky.utils import kubernetes_enums from sky.utils import subprocess_utils diff --git a/sky/clouds/lambda_cloud.py b/sky/clouds/lambda_cloud.py index d3d20fbd41a..d2573ebbb29 100644 --- a/sky/clouds/lambda_cloud.py +++ b/sky/clouds/lambda_cloud.py @@ -8,7 +8,7 @@ from sky import clouds from sky import status_lib from sky.clouds import service_catalog -from sky.clouds.utils import lambda_utils +from sky.provision.lambda_cloud import lambda_utils from sky.utils import resources_utils if typing.TYPE_CHECKING: @@ -48,6 +48,9 @@ class Lambda(clouds.Cloud): clouds.CloudImplementationFeatures.HOST_CONTROLLERS: f'Host controllers are not supported in {_REPR}.', } + PROVISIONER_VERSION = clouds.ProvisionerVersion.SKYPILOT + STATUS_VERSION = clouds.StatusVersion.SKYPILOT + @classmethod def _unsupported_features_for_resources( cls, resources: 'resources_lib.Resources' diff --git a/sky/provision/__init__.py b/sky/provision/__init__.py index 41d985ade41..bbe92b68c3a 100644 --- a/sky/provision/__init__.py +++ b/sky/provision/__init__.py @@ -19,6 +19,7 @@ from sky.provision import fluidstack from sky.provision import gcp from sky.provision import kubernetes +from sky.provision import lambda_cloud from sky.provision import runpod from sky.provision import vsphere from sky.utils import command_runner @@ -39,6 +40,8 @@ def _wrapper(*args, **kwargs): provider_name = kwargs.pop('provider_name') module_name = provider_name.lower() + if module_name == 'lambda': + module_name = 'lambda_cloud' module = globals().get(module_name) assert module is not None, f'Unknown provider: {module_name}' diff --git a/sky/provision/lambda_cloud/__init__.py b/sky/provision/lambda_cloud/__init__.py new file mode 100644 index 00000000000..4992df4531b --- /dev/null +++ b/sky/provision/lambda_cloud/__init__.py @@ -0,0 +1,11 @@ +"""Lambda provisioner for SkyPilot.""" + +from sky.provision.lambda_cloud.config import bootstrap_instances +from sky.provision.lambda_cloud.instance import cleanup_ports +from sky.provision.lambda_cloud.instance import get_cluster_info +from sky.provision.lambda_cloud.instance import open_ports +from sky.provision.lambda_cloud.instance import query_instances +from sky.provision.lambda_cloud.instance import run_instances +from sky.provision.lambda_cloud.instance import stop_instances +from sky.provision.lambda_cloud.instance import terminate_instances +from sky.provision.lambda_cloud.instance import wait_instances diff --git a/sky/provision/lambda_cloud/config.py b/sky/provision/lambda_cloud/config.py new file mode 100644 index 00000000000..3066e7747fd --- /dev/null +++ b/sky/provision/lambda_cloud/config.py @@ -0,0 +1,10 @@ +"""Lambda Cloud configuration bootstrapping""" + +from sky.provision import common + + +def bootstrap_instances( + region: str, cluster_name: str, + config: common.ProvisionConfig) -> common.ProvisionConfig: + del region, cluster_name # unused + return config diff --git a/sky/provision/lambda_cloud/instance.py b/sky/provision/lambda_cloud/instance.py new file mode 100644 index 00000000000..d10c36496ab --- /dev/null +++ b/sky/provision/lambda_cloud/instance.py @@ -0,0 +1,261 @@ +"""Lambda instance provisioning.""" + +import time +from typing import Any, Dict, List, Optional + +from sky import authentication as auth +from sky import sky_logging +from sky import status_lib +from sky.provision import common +import sky.provision.lambda_cloud.lambda_utils as lambda_utils +from sky.utils import common_utils +from sky.utils import ux_utils + +POLL_INTERVAL = 1 + +logger = sky_logging.init_logger(__name__) +_lambda_client = None + + +def _get_lambda_client(): + global _lambda_client + if _lambda_client is None: + _lambda_client = lambda_utils.LambdaCloudClient() + return _lambda_client + + +def _filter_instances( + cluster_name_on_cloud: str, + status_filters: Optional[List[str]]) -> Dict[str, Dict[str, Any]]: + lambda_client = _get_lambda_client() + instances = lambda_client.list_instances() + possible_names = [ + f'{cluster_name_on_cloud}-head', + f'{cluster_name_on_cloud}-worker', + ] + + filtered_instances = {} + for instance in instances: + if (status_filters is not None and + instance['status'] not in status_filters): + continue + if instance.get('name') in possible_names: + filtered_instances[instance['id']] = instance + return filtered_instances + + +def _get_head_instance_id(instances: Dict[str, Any]) -> Optional[str]: + head_instance_id = None + for instance_id, instance in instances.items(): + if instance['name'].endswith('-head'): + head_instance_id = instance_id + break + return head_instance_id + + +def _get_ssh_key_name(prefix: str = '') -> str: + lambda_client = _get_lambda_client() + _, public_key_path = auth.get_or_generate_keys() + with open(public_key_path, 'r', encoding='utf-8') as f: + public_key = f.read() + name, exists = lambda_client.get_unique_ssh_key_name(prefix, public_key) + if not exists: + raise lambda_utils.LambdaCloudError('SSH key not found') + return name + + +def run_instances(region: str, cluster_name_on_cloud: str, + config: common.ProvisionConfig) -> common.ProvisionRecord: + """Runs instances for the given cluster""" + lambda_client = _get_lambda_client() + pending_status = ['booting'] + while True: + instances = _filter_instances(cluster_name_on_cloud, pending_status) + if not instances: + break + logger.info(f'Waiting for {len(instances)} instances to be ready.') + time.sleep(POLL_INTERVAL) + exist_instances = _filter_instances(cluster_name_on_cloud, ['active']) + head_instance_id = _get_head_instance_id(exist_instances) + + to_start_count = config.count - len(exist_instances) + if to_start_count < 0: + raise RuntimeError( + f'Cluster {cluster_name_on_cloud} already has ' + f'{len(exist_instances)} nodes, but {config.count} are required.') + if to_start_count == 0: + if head_instance_id is None: + raise RuntimeError( + f'Cluster {cluster_name_on_cloud} has no head node.') + logger.info(f'Cluster {cluster_name_on_cloud} already has ' + f'{len(exist_instances)} nodes, no need to start more.') + return common.ProvisionRecord( + provider_name='lambda', + cluster_name=cluster_name_on_cloud, + region=region, + zone=None, + head_instance_id=head_instance_id, + resumed_instance_ids=[], + created_instance_ids=[], + ) + + created_instance_ids = [] + ssh_key_name = _get_ssh_key_name() + + def launch_nodes(node_type: str, quantity: int) -> List[str]: + try: + instance_ids = lambda_client.create_instances( + instance_type=config.node_config['InstanceType'], + region=region, + name=f'{cluster_name_on_cloud}-{node_type}', + quantity=quantity, + ssh_key_name=ssh_key_name, + ) + logger.info(f'Launched {len(instance_ids)} {node_type} node(s), ' + f'instance_ids: {instance_ids}') + return instance_ids + except Exception as e: + logger.warning(f'run_instances error: {e}') + raise + + if head_instance_id is None: + instance_ids = launch_nodes('head', 1) + assert len(instance_ids) == 1 + created_instance_ids.append(instance_ids[0]) + head_instance_id = instance_ids[0] + + assert head_instance_id is not None, 'head_instance_id should not be None' + + worker_node_count = to_start_count - 1 + if worker_node_count > 0: + instance_ids = launch_nodes('worker', worker_node_count) + created_instance_ids.extend(instance_ids) + + while True: + instances = _filter_instances(cluster_name_on_cloud, ['active']) + if len(instances) == config.count: + break + + time.sleep(POLL_INTERVAL) + + return common.ProvisionRecord( + provider_name='lambda', + cluster_name=cluster_name_on_cloud, + region=region, + zone=None, + head_instance_id=head_instance_id, + resumed_instance_ids=[], + created_instance_ids=created_instance_ids, + ) + + +def wait_instances(region: str, cluster_name_on_cloud: str, + state: Optional[status_lib.ClusterStatus]) -> None: + del region, cluster_name_on_cloud, state # Unused. + + +def stop_instances( + cluster_name_on_cloud: str, + provider_config: Optional[Dict[str, Any]] = None, + worker_only: bool = False, +) -> None: + raise NotImplementedError( + 'stop_instances is not supported for Lambda Cloud') + + +def terminate_instances( + cluster_name_on_cloud: str, + provider_config: Optional[Dict[str, Any]] = None, + worker_only: bool = False, +) -> None: + """See sky/provision/__init__.py""" + del provider_config + lambda_client = _get_lambda_client() + instances = _filter_instances(cluster_name_on_cloud, None) + + instance_ids_to_terminate = [] + for instance_id, instance in instances.items(): + if worker_only and not instance['name'].endswith('-worker'): + continue + instance_ids_to_terminate.append(instance_id) + + try: + logger.debug( + f'Terminating instances {", ".join(instance_ids_to_terminate)}') + lambda_client.remove_instances(instance_ids_to_terminate) + except Exception as e: # pylint: disable=broad-except + with ux_utils.print_exception_no_traceback(): + raise RuntimeError( + f'Failed to terminate instances {instance_ids_to_terminate}: ' + f'{common_utils.format_exception(e, use_bracket=False)}') from e + + +def get_cluster_info( + region: str, + cluster_name_on_cloud: str, + provider_config: Optional[Dict[str, Any]] = None, +) -> common.ClusterInfo: + del region # unused + running_instances = _filter_instances(cluster_name_on_cloud, ['active']) + instances: Dict[str, List[common.InstanceInfo]] = {} + head_instance_id = None + for instance_id, instance_info in running_instances.items(): + instances[instance_id] = [ + common.InstanceInfo( + instance_id=instance_id, + internal_ip=instance_info['private_ip'], + external_ip=instance_info['ip'], + ssh_port=22, + tags={}, + ) + ] + if instance_info['name'].endswith('-head'): + head_instance_id = instance_id + + return common.ClusterInfo( + instances=instances, + head_instance_id=head_instance_id, + provider_name='lambda', + provider_config=provider_config, + ) + + +def query_instances( + cluster_name_on_cloud: str, + provider_config: Optional[Dict[str, Any]] = None, + non_terminated_only: bool = True, +) -> Dict[str, Optional[status_lib.ClusterStatus]]: + """See sky/provision/__init__.py""" + assert provider_config is not None, (cluster_name_on_cloud, provider_config) + instances = _filter_instances(cluster_name_on_cloud, None) + + status_map = { + 'booting': status_lib.ClusterStatus.INIT, + 'active': status_lib.ClusterStatus.UP, + 'unhealthy': status_lib.ClusterStatus.INIT, + 'terminating': status_lib.ClusterStatus.INIT, + } + statuses: Dict[str, Optional[status_lib.ClusterStatus]] = {} + for instance_id, instance in instances.items(): + status = status_map.get(instance['status']) + if non_terminated_only and status is None: + continue + statuses[instance_id] = status + return statuses + + +def open_ports( + cluster_name_on_cloud: str, + ports: List[str], + provider_config: Optional[Dict[str, Any]] = None, +) -> None: + raise NotImplementedError('open_ports is not supported for Lambda Cloud') + + +def cleanup_ports( + cluster_name_on_cloud: str, + ports: List[str], + provider_config: Optional[Dict[str, Any]] = None, +) -> None: + """See sky/provision/__init__.py""" + del cluster_name_on_cloud, ports, provider_config # Unused. diff --git a/sky/clouds/utils/lambda_utils.py b/sky/provision/lambda_cloud/lambda_utils.py similarity index 92% rename from sky/clouds/utils/lambda_utils.py rename to sky/provision/lambda_cloud/lambda_utils.py index 61c4b33ebe9..339919e80e7 100644 --- a/sky/clouds/utils/lambda_utils.py +++ b/sky/provision/lambda_cloud/lambda_utils.py @@ -1,4 +1,5 @@ """Lambda Cloud helper functions.""" + import json import os import time @@ -76,7 +77,7 @@ def refresh(self, instance_ids: List[str]) -> None: def raise_lambda_error(response: requests.Response) -> None: - """Raise LambdaCloudError if appropriate. """ + """Raise LambdaCloudError if appropriate.""" status_code = response.status_code if status_code == 200: return @@ -131,20 +132,22 @@ def __init__(self) -> None: self.api_key = self._credentials['api_key'] self.headers = {'Authorization': f'Bearer {self.api_key}'} - def create_instances(self, - instance_type: str = 'gpu_1x_a100_sxm4', - region: str = 'us-east-1', - quantity: int = 1, - name: str = '', - ssh_key_name: str = '') -> List[str]: + def create_instances( + self, + instance_type: str = 'gpu_1x_a100_sxm4', + region: str = 'us-east-1', + quantity: int = 1, + name: str = '', + ssh_key_name: str = '', + ) -> List[str]: """Launch new instances.""" # Optimization: # Most API requests are rate limited at ~1 request every second but # launch requests are rate limited at ~1 request every 10 seconds. # So don't use launch requests to check availability. # See https://docs.lambdalabs.com/cloud/rate-limiting/ for more. - available_regions = self.list_catalog()[instance_type]\ - ['regions_with_capacity_available'] + available_regions = (self.list_catalog()[instance_type] + ['regions_with_capacity_available']) available_regions = [reg['name'] for reg in available_regions] if region not in available_regions: if len(available_regions) > 0: @@ -163,27 +166,25 @@ def create_instances(self, 'instance_type_name': instance_type, 'ssh_key_names': [ssh_key_name], 'quantity': quantity, - 'name': name + 'name': name, }) response = _try_request_with_backoff( 'post', f'{API_ENDPOINT}/instance-operations/launch', data=data, - headers=self.headers) + headers=self.headers, + ) return response.json().get('data', []).get('instance_ids', []) - def remove_instances(self, *instance_ids: str) -> Dict[str, Any]: + def remove_instances(self, instance_ids: List[str]) -> Dict[str, Any]: """Terminate instances.""" - data = json.dumps({ - 'instance_ids': [ - instance_ids[0] # TODO(ewzeng) don't hardcode - ] - }) + data = json.dumps({'instance_ids': instance_ids}) response = _try_request_with_backoff( 'post', f'{API_ENDPOINT}/instance-operations/terminate', data=data, - headers=self.headers) + headers=self.headers, + ) return response.json().get('data', []).get('terminated_instances', []) def list_instances(self) -> List[Dict[str, Any]]: diff --git a/sky/setup_files/MANIFEST.in b/sky/setup_files/MANIFEST.in index 54ab3b55a32..0cd93f485e0 100644 --- a/sky/setup_files/MANIFEST.in +++ b/sky/setup_files/MANIFEST.in @@ -6,7 +6,6 @@ include sky/setup_files/* include sky/skylet/*.sh include sky/skylet/LICENSE include sky/skylet/providers/ibm/* -include sky/skylet/providers/lambda_cloud/* include sky/skylet/providers/oci/* include sky/skylet/providers/scp/* include sky/skylet/providers/*.py diff --git a/sky/skylet/providers/lambda_cloud/__init__.py b/sky/skylet/providers/lambda_cloud/__init__.py deleted file mode 100644 index 64dac295eb5..00000000000 --- a/sky/skylet/providers/lambda_cloud/__init__.py +++ /dev/null @@ -1,2 +0,0 @@ -"""Lambda Cloud node provider""" -from sky.skylet.providers.lambda_cloud.node_provider import LambdaNodeProvider diff --git a/sky/skylet/providers/lambda_cloud/node_provider.py b/sky/skylet/providers/lambda_cloud/node_provider.py deleted file mode 100644 index 557afe75568..00000000000 --- a/sky/skylet/providers/lambda_cloud/node_provider.py +++ /dev/null @@ -1,320 +0,0 @@ -import logging -import os -from threading import RLock -import time -from typing import Any, Dict, List, Optional - -from ray.autoscaler.node_provider import NodeProvider -from ray.autoscaler.tags import NODE_KIND_HEAD -from ray.autoscaler.tags import NODE_KIND_WORKER -from ray.autoscaler.tags import STATUS_UP_TO_DATE -from ray.autoscaler.tags import TAG_RAY_CLUSTER_NAME -from ray.autoscaler.tags import TAG_RAY_NODE_KIND -from ray.autoscaler.tags import TAG_RAY_NODE_NAME -from ray.autoscaler.tags import TAG_RAY_NODE_STATUS -from ray.autoscaler.tags import TAG_RAY_USER_NODE_TYPE - -from sky import authentication as auth -from sky.clouds.utils import lambda_utils -from sky.utils import command_runner -from sky.utils import common_utils -from sky.utils import subprocess_utils -from sky.utils import ux_utils - -_TAG_PATH_PREFIX = '~/.sky/generated/lambda_cloud/metadata' -_REMOTE_SSH_KEY_NAME = '~/.lambda_cloud/ssh_key_name' -_REMOTE_RAY_SSH_KEY = '~/ray_bootstrap_key.pem' -_REMOTE_RAY_YAML = '~/ray_bootstrap_config.yaml' -_GET_INTERNAL_IP_CMD = 's=$(ip -4 -br addr show | grep UP); echo "$s"; echo "$s" | grep -Eo "(10\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|172\.(1[6-9]|2[0-9]|3[0-1])|104\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"' - -logger = logging.getLogger(__name__) - - -def synchronized(f): - - def wrapper(self, *args, **kwargs): - self.lock.acquire() - try: - return f(self, *args, **kwargs) - finally: - self.lock.release() - - return wrapper - - -class LambdaNodeProvider(NodeProvider): - """Node Provider for Lambda Cloud. - - This provider assumes Lambda Cloud credentials are set. - """ - - def __init__(self, provider_config: Dict[str, Any], - cluster_name: str) -> None: - NodeProvider.__init__(self, provider_config, cluster_name) - self.lock = RLock() - self.lambda_client = lambda_utils.LambdaCloudClient() - self.cached_nodes: Dict[str, Dict[str, Any]] = {} - self.metadata = lambda_utils.Metadata(_TAG_PATH_PREFIX, cluster_name) - self.ssh_key_path = os.path.expanduser(auth.PRIVATE_SSH_KEY_PATH) - - def _get_ssh_key_name(prefix: str) -> str: - public_key_path = os.path.expanduser(auth.PUBLIC_SSH_KEY_PATH) - with open(public_key_path, 'r') as f: - public_key = f.read() - name, exists = self.lambda_client.get_unique_ssh_key_name( - prefix, public_key) - if not exists: - raise lambda_utils.LambdaCloudError('SSH key not found') - return name - - ray_yaml_path = os.path.expanduser(_REMOTE_RAY_YAML) - self.on_head = (os.path.exists(ray_yaml_path) and - common_utils.read_yaml(ray_yaml_path)['cluster_name'] - == cluster_name) - - if self.on_head: - self.ssh_key_path = os.path.expanduser(_REMOTE_RAY_SSH_KEY) - ssh_key_name_path = os.path.expanduser(_REMOTE_SSH_KEY_NAME) - if os.path.exists(ssh_key_name_path): - with open(ssh_key_name_path, 'r') as f: - self.ssh_key_name = f.read() - else: - # At this point, `~/.ssh/sky-key.pub` contains the public - # key used to launch this cluster. Use it to determine - # ssh key name and store the name in _REMOTE_SSH_KEY_NAME. - # Note: this case only runs during cluster launch, so it is - # not possible for ~/.ssh/sky-key.pub to already be regenerated - # by the user. - self.ssh_key_name = _get_ssh_key_name('') - with open(ssh_key_name_path, 'w', encoding='utf-8') as f: - f.write(self.ssh_key_name) - else: - # On local - self.ssh_key_name = _get_ssh_key_name( - f'sky-key-{common_utils.get_user_hash()}') - - def _guess_and_add_missing_tags(self, vms: List[Dict[str, Any]]) -> None: - """Adds missing vms to local tag file and guesses their tags.""" - for node in vms: - if self.metadata.get(node['id']) is not None: - pass - elif node['name'] == f'{self.cluster_name}-head': - self.metadata.set( - node['id'], { - 'tags': { - TAG_RAY_CLUSTER_NAME: self.cluster_name, - TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE, - TAG_RAY_NODE_KIND: NODE_KIND_HEAD, - TAG_RAY_USER_NODE_TYPE: 'ray_head_default', - TAG_RAY_NODE_NAME: f'ray-{self.cluster_name}-head', - } - }) - elif node['name'] == f'{self.cluster_name}-worker': - self.metadata.set( - node['id'], { - 'tags': { - TAG_RAY_CLUSTER_NAME: self.cluster_name, - TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE, - TAG_RAY_NODE_KIND: NODE_KIND_WORKER, - TAG_RAY_USER_NODE_TYPE: 'ray_worker_default', - TAG_RAY_NODE_NAME: f'ray-{self.cluster_name}-worker', - } - }) - - def _list_instances_in_cluster(self) -> List[Dict[str, Any]]: - """List running instances in cluster.""" - vms = self.lambda_client.list_instances() - possible_names = [ - f'{self.cluster_name}-head', f'{self.cluster_name}-worker' - ] - return [node for node in vms if node.get('name') in possible_names] - - @synchronized - def _get_filtered_nodes(self, tag_filters: Dict[str, - str]) -> Dict[str, Any]: - - def _extract_metadata(vm: Dict[str, Any]) -> Dict[str, Any]: - metadata = {'id': vm['id'], 'status': vm['status'], 'tags': {}} - instance_info = self.metadata.get(vm['id']) - if instance_info is not None: - metadata['tags'] = instance_info['tags'] - metadata['external_ip'] = vm.get('ip') - return metadata - - def _match_tags(vm: Dict[str, Any]): - vm_info = self.metadata.get(vm['id']) - tags = {} if vm_info is None else vm_info['tags'] - for k, v in tag_filters.items(): - if tags.get(k) != v: - return False - return True - - def _get_internal_ip(node: Dict[str, Any]): - # TODO(ewzeng): cache internal ips in metadata file to reduce - # ssh overhead. - if node['external_ip'] is None or node['status'] != 'active': - node['internal_ip'] = None - return - runner = command_runner.SSHCommandRunner( - node=(node['external_ip'], 22), - ssh_user='ubuntu', - ssh_private_key=self.ssh_key_path) - rc, stdout, stderr = runner.run(_GET_INTERNAL_IP_CMD, - require_outputs=True, - stream_logs=False) - subprocess_utils.handle_returncode( - rc, - _GET_INTERNAL_IP_CMD, - 'Failed get obtain private IP from node', - stderr=stdout + stderr) - node['internal_ip'] = stdout.strip() - - vms = self._list_instances_in_cluster() - self.metadata.refresh([node['id'] for node in vms]) - self._guess_and_add_missing_tags(vms) - nodes = [_extract_metadata(vm) for vm in filter(_match_tags, vms)] - nodes = [ - node for node in nodes - if node['status'] not in ['terminating', 'terminated'] - ] - subprocess_utils.run_in_parallel(_get_internal_ip, nodes) - self.cached_nodes = {node['id']: node for node in nodes} - return self.cached_nodes - - def non_terminated_nodes(self, tag_filters: Dict[str, str]) -> List[str]: - """Return a list of node ids filtered by the specified tags dict. - - This list must not include terminated nodes. For performance reasons, - providers are allowed to cache the result of a call to - non_terminated_nodes() to serve single-node queries - (e.g. is_running(node_id)). This means that non_terminated_nodes() must - be called again to refresh results. - - Examples: - >>> provider.non_terminated_nodes({TAG_RAY_NODE_KIND: "worker"}) - ["node-1", "node-2"] - """ - nodes = self._get_filtered_nodes(tag_filters=tag_filters) - return [k for k, _ in nodes.items()] - - def is_running(self, node_id: str) -> bool: - """Return whether the specified node is running.""" - return self._get_cached_node(node_id=node_id) is not None - - def is_terminated(self, node_id: str) -> bool: - """Return whether the specified node is terminated.""" - return self._get_cached_node(node_id=node_id) is None - - def node_tags(self, node_id: str) -> Dict[str, str]: - """Returns the tags of the given node (string dict).""" - node = self._get_cached_node(node_id=node_id) - if node is None: - return {} - return node['tags'] - - def external_ip(self, node_id: str) -> Optional[str]: - """Returns the external ip of the given node.""" - node = self._get_cached_node(node_id=node_id) - if node is None: - return None - ip = node.get('external_ip') - with ux_utils.print_exception_no_traceback(): - if ip is None: - raise lambda_utils.LambdaCloudError( - 'A node ip address was not found. Either ' - '(1) Lambda Cloud has internally errored, or ' - '(2) the cluster is still booting. ' - 'You can manually terminate the cluster on the ' - 'Lambda Cloud console or (in case 2) wait for ' - 'booting to finish (~2 minutes).') - return ip - - def internal_ip(self, node_id: str) -> Optional[str]: - """Returns the internal ip (Ray ip) of the given node.""" - node = self._get_cached_node(node_id=node_id) - if node is None: - return None - ip = node.get('internal_ip') - with ux_utils.print_exception_no_traceback(): - if ip is None: - raise lambda_utils.LambdaCloudError( - 'A node ip address was not found. Either ' - '(1) Lambda Cloud has internally errored, or ' - '(2) the cluster is still booting. ' - 'You can manually terminate the cluster on the ' - 'Lambda Cloud console or (in case 2) wait for ' - 'booting to finish (~2 minutes).') - return ip - - def create_node(self, node_config: Dict[str, Any], tags: Dict[str, str], - count: int) -> None: - """Creates a number of nodes within the namespace.""" - # Get tags - config_tags = node_config.get('tags', {}).copy() - config_tags.update(tags) - config_tags[TAG_RAY_CLUSTER_NAME] = self.cluster_name - - # Create nodes - instance_type = node_config['InstanceType'] - region = self.provider_config['region'] - - if config_tags[TAG_RAY_NODE_KIND] == NODE_KIND_HEAD: - name = f'{self.cluster_name}-head' - # Occasionally, the head node will continue running for a short - # period after termination. This can lead to the following bug: - # 1. Head node autodowns but continues running. - # 2. The next autodown event is triggered, which executes ray up. - # 3. Head node stops running. - # In this case, a new head node is created after the cluster has - # terminated. We avoid this with the following check: - if self.on_head: - raise lambda_utils.LambdaCloudError('Head already exists.') - else: - name = f'{self.cluster_name}-worker' - - # Lambda launch api only supports launching one node at a time, - # so we do a loop. Remove loop when launch api allows quantity > 1 - booting_list = [] - for _ in range(count): - vm_id = self.lambda_client.create_instances( - instance_type=instance_type, - region=region, - quantity=1, - name=name, - ssh_key_name=self.ssh_key_name)[0] - self.metadata.set(vm_id, {'tags': config_tags}) - booting_list.append(vm_id) - time.sleep(10) # Avoid api rate limits - - # Wait for nodes to finish booting - while True: - vms = self._list_instances_in_cluster() - for vm_id in booting_list.copy(): - for vm in vms: - if vm['id'] == vm_id and vm['status'] == 'active': - booting_list.remove(vm_id) - if len(booting_list) == 0: - return - time.sleep(10) - - @synchronized - def set_node_tags(self, node_id: str, tags: Dict[str, str]) -> None: - """Sets the tag values (string dict) for the specified node.""" - node = self._get_node(node_id) - assert node is not None, node_id - node['tags'].update(tags) - self.metadata.set(node_id, {'tags': node['tags']}) - - def terminate_node(self, node_id: str) -> None: - """Terminates the specified node.""" - self.lambda_client.remove_instances(node_id) - self.metadata.set(node_id, None) - - def _get_node(self, node_id: str) -> Optional[Dict[str, Any]]: - self._get_filtered_nodes({}) # Side effect: updates cache - return self.cached_nodes.get(node_id, None) - - def _get_cached_node(self, node_id: str) -> Optional[Dict[str, Any]]: - if node_id in self.cached_nodes: - return self.cached_nodes[node_id] - return self._get_node(node_id=node_id) diff --git a/sky/templates/lambda-ray.yml.j2 b/sky/templates/lambda-ray.yml.j2 index 6b6d94cfb3c..c4b8dba1a9f 100644 --- a/sky/templates/lambda-ray.yml.j2 +++ b/sky/templates/lambda-ray.yml.j2 @@ -7,7 +7,7 @@ idle_timeout_minutes: 60 provider: type: external - module: sky.skylet.providers.lambda_cloud.LambdaNodeProvider + module: sky.provision.lambda region: {{region}} # Disable launch config check for worker nodes as it can cause resource # leakage. @@ -25,14 +25,6 @@ available_node_types: resources: {} node_config: InstanceType: {{instance_type}} -{% if num_nodes > 1 %} - ray_worker_default: - min_workers: {{num_nodes - 1}} - max_workers: {{num_nodes - 1}} - resources: {} - node_config: - InstanceType: {{instance_type}} -{%- endif %} head_node_type: ray_head_default @@ -64,7 +56,10 @@ setup_commands: # Line 'sudo grep ..': set the number of threads per process to unlimited to avoid ray job submit stucking issue when the number of running ray jobs increase. # Line 'mkdir -p ..': disable host key check # Line 'python3 -c ..': patch the buggy ray files and enable `-o allow_other` option for `goofys` - - sudo systemctl stop unattended-upgrades || true; + - {%- for initial_setup_command in initial_setup_commands %} + {{ initial_setup_command }} + {%- endfor %} + sudo systemctl stop unattended-upgrades || true; sudo systemctl disable unattended-upgrades || true; sudo sed -i 's/Unattended-Upgrade "1"/Unattended-Upgrade "0"/g' /etc/apt/apt.conf.d/20auto-upgrades || true; sudo kill -9 `sudo lsof /var/lib/dpkg/lock-frontend | awk '{print $2}' | tail -n 1` || true; @@ -81,31 +76,5 @@ setup_commands: mkdir -p ~/.ssh; (grep -Pzo -q "Host \*\n StrictHostKeyChecking no" ~/.ssh/config) || printf "Host *\n StrictHostKeyChecking no\n" >> ~/.ssh/config; [ -f /etc/fuse.conf ] && sudo sed -i 's/#user_allow_other/user_allow_other/g' /etc/fuse.conf || (sudo sh -c 'echo "user_allow_other" > /etc/fuse.conf'); -# Command to start ray on the head node. You don't need to change this. -# NOTE: these are very performance-sensitive. Each new item opens/closes an SSH -# connection, which is expensive. Try your best to co-locate commands into fewer -# items! The same comment applies for worker_start_ray_commands. -# -# Increment the following for catching performance bugs easier: -# current num items (num SSH connections): 2 -head_start_ray_commands: - - {{ sky_activate_python_env }}; {{ sky_ray_cmd }} stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 {{ sky_ray_cmd }} start --disable-usage-stats --head --port={{ray_port}} --min-worker-port 11002 --dashboard-port={{ray_dashboard_port}} --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1; - which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done; - {{dump_port_command}}; {{ray_head_wait_initialized_command}} - -{%- if num_nodes > 1 %} -worker_start_ray_commands: - - {{ sky_activate_python_env }}; {{ sky_ray_cmd }} stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 {{ sky_ray_cmd }} start --disable-usage-stats --address=$RAY_HEAD_IP:{{ray_port}} --min-worker-port 11002 --object-manager-port=8076 {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1; - which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done; -{%- else %} -worker_start_ray_commands: [] -{%- endif %} - -head_node: {} -worker_nodes: {} - -# These fields are required for external cloud providers. -head_setup_commands: [] -worker_setup_commands: [] -cluster_synced_files: [] -file_mounts_sync_continuously: False +# Command to start ray clusters are now placed in `sky.provision.instance_setup`. +# We do not need to list it here anymore. From 93e2b567b39df41e03ab7075e35e9f2d693f62cf Mon Sep 17 00:00:00 2001 From: Romil Bhardwaj Date: Thu, 17 Oct 2024 11:33:35 -0700 Subject: [PATCH 55/93] [Docs] GKE Nvidia Driver installation instructions update (#4106) * docs * docs * docs --- .../reference/kubernetes/kubernetes-deployment.rst | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/docs/source/reference/kubernetes/kubernetes-deployment.rst b/docs/source/reference/kubernetes/kubernetes-deployment.rst index d7e7127f6e7..e9489e9149e 100644 --- a/docs/source/reference/kubernetes/kubernetes-deployment.rst +++ b/docs/source/reference/kubernetes/kubernetes-deployment.rst @@ -114,9 +114,9 @@ Deploying on Google Cloud GKE # Example: # gcloud container clusters get-credentials testcluster --region us-central1-c -3. [If using GPUs] If your GKE nodes have GPUs, you may need to to - `manually install `_ - nvidia drivers. You can do so by deploying the daemonset +3. [If using GPUs] For GKE versions newer than 1.30.1-gke.115600, NVIDIA drivers are pre-installed and no additional setup is required. If you are using an older GKE version, you may need to + `manually install `_ + NVIDIA drivers for GPU support. You can do so by deploying the daemonset depending on the GPU and OS on your nodes: .. code-block:: console @@ -133,7 +133,8 @@ Deploying on Google Cloud GKE # For Ubuntu based nodes with L4 GPUs: $ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded-R525.yaml - To verify if GPU drivers are set up, run ``kubectl describe nodes`` and verify that ``nvidia.com/gpu`` is listed under the ``Capacity`` section. + .. tip:: + To verify if GPU drivers are set up, run ``kubectl describe nodes`` and verify that ``nvidia.com/gpu`` resource is listed under the ``Capacity`` section. 4. Verify your kubernetes cluster is correctly set up for SkyPilot by running :code:`sky check`: From 5dc70e81eaf24aeb6ba2020c3bfdb8eefcbb0604 Mon Sep 17 00:00:00 2001 From: yika-luo Date: Thu, 17 Oct 2024 13:01:22 -0700 Subject: [PATCH 56/93] [Performance] Use new AWS custom images (#4091) --- sky/clouds/aws.py | 19 +++++++++++++++---- 1 file changed, 15 insertions(+), 4 deletions(-) diff --git a/sky/clouds/aws.py b/sky/clouds/aws.py index 2207a977f25..a0962b17cac 100644 --- a/sky/clouds/aws.py +++ b/sky/clouds/aws.py @@ -32,6 +32,14 @@ logger = sky_logging.init_logger(__name__) +# Image ID tags +_DEFAULT_CPU_IMAGE_ID = 'skypilot:custom-cpu-ubuntu' +# For GPU-related package version, +# see sky/clouds/service_catalog/images/provisioners/cuda.sh +_DEFAULT_GPU_IMAGE_ID = 'skypilot:custom-gpu-ubuntu' +_DEFAULT_GPU_K80_IMAGE_ID = 'skypilot:k80-ubuntu-2004' +_DEFAULT_NEURON_IMAGE_ID = 'skypilot:neuron-ubuntu-2204' + # This local file (under ~/.aws/) will be uploaded to remote nodes (any # cloud), if all of the following conditions hold: # - the current user identity is not using AWS SSO @@ -217,17 +225,20 @@ def zones_provision_loop( @classmethod def _get_default_ami(cls, region_name: str, instance_type: str) -> str: acc = cls.get_accelerators_from_instance_type(instance_type) - image_id = service_catalog.get_image_id_from_tag( - 'skypilot:gpu-ubuntu-2004', region_name, clouds='aws') + image_id = service_catalog.get_image_id_from_tag(_DEFAULT_CPU_IMAGE_ID, + region_name, + clouds='aws') if acc is not None: + image_id = service_catalog.get_image_id_from_tag( + _DEFAULT_GPU_IMAGE_ID, region_name, clouds='aws') assert len(acc) == 1, acc acc_name = list(acc.keys())[0] if acc_name == 'K80': image_id = service_catalog.get_image_id_from_tag( - 'skypilot:k80-ubuntu-2004', region_name, clouds='aws') + _DEFAULT_GPU_K80_IMAGE_ID, region_name, clouds='aws') if acc_name in ['Trainium', 'Inferentia']: image_id = service_catalog.get_image_id_from_tag( - 'skypilot:neuron-ubuntu-2204', region_name, clouds='aws') + _DEFAULT_NEURON_IMAGE_ID, region_name, clouds='aws') if image_id is not None: return image_id # Raise ResourcesUnavailableError to make sure the failover in From 92fd1095a696063e2e8f81468f2a9a50a69bc4f3 Mon Sep 17 00:00:00 2001 From: Yika Date: Thu, 17 Oct 2024 18:18:44 -0700 Subject: [PATCH 57/93] [Performance] Add Packer image generation scripts for GCP and AWS (#4068) * [Performance] Add Packer image generation scripts for GCP and AWS * Add docker install and tests * solve nvidia container issue * Install cuDNN * [Performance] Scripts to copy/delete AWS images for all regions and add cloud deps (#4073) * [Performance] Add AWS script to copy images for all regions * script to delete all AWS images across regions * Add cloud dependencies to image --------- Co-authored-by: Yika Luo --- sky/clouds/service_catalog/images/README.md | 72 +++++++++ .../images/aws_utils/image_delete.py | 63 ++++++++ .../images/aws_utils/image_gen.py | 151 ++++++++++++++++++ .../service_catalog/images/plugins.pkr.hcl | 17 ++ .../images/provisioners/cloud.sh | 50 ++++++ .../images/provisioners/cuda.sh | 24 +++ .../images/provisioners/docker.sh | 22 +++ .../provisioners/nvidia-container-toolkit.sh | 26 +++ .../images/provisioners/skypilot.sh | 69 ++++++++ .../images/skypilot-aws-cpu-ubuntu.pkr.hcl | 47 ++++++ .../images/skypilot-aws-gpu-ubuntu.pkr.hcl | 55 +++++++ .../images/skypilot-gcp-cpu-ubuntu.pkr.hcl | 33 ++++ .../images/skypilot-gcp-gpu-ubuntu.pkr.hcl | 46 ++++++ tests/test_smoke.py | 10 +- 14 files changed, 680 insertions(+), 5 deletions(-) create mode 100644 sky/clouds/service_catalog/images/README.md create mode 100644 sky/clouds/service_catalog/images/aws_utils/image_delete.py create mode 100644 sky/clouds/service_catalog/images/aws_utils/image_gen.py create mode 100644 sky/clouds/service_catalog/images/plugins.pkr.hcl create mode 100644 sky/clouds/service_catalog/images/provisioners/cloud.sh create mode 100644 sky/clouds/service_catalog/images/provisioners/cuda.sh create mode 100644 sky/clouds/service_catalog/images/provisioners/docker.sh create mode 100644 sky/clouds/service_catalog/images/provisioners/nvidia-container-toolkit.sh create mode 100644 sky/clouds/service_catalog/images/provisioners/skypilot.sh create mode 100644 sky/clouds/service_catalog/images/skypilot-aws-cpu-ubuntu.pkr.hcl create mode 100644 sky/clouds/service_catalog/images/skypilot-aws-gpu-ubuntu.pkr.hcl create mode 100644 sky/clouds/service_catalog/images/skypilot-gcp-cpu-ubuntu.pkr.hcl create mode 100644 sky/clouds/service_catalog/images/skypilot-gcp-gpu-ubuntu.pkr.hcl diff --git a/sky/clouds/service_catalog/images/README.md b/sky/clouds/service_catalog/images/README.md new file mode 100644 index 00000000000..31ce7c6d9ce --- /dev/null +++ b/sky/clouds/service_catalog/images/README.md @@ -0,0 +1,72 @@ +# SkyPilot OS Image Generation Guide + +## Prerequisites +You only need to do this once. +1. Install [Packer](https://developer.hashicorp.com/packer/tutorials/aws-get-started/get-started-install-cli) +2. Download plugins used by Packer +```bash +packer init plugins.pkr.hcl +``` +3. Setup cloud credentials + +## Generate Images +```bash +export CLOUD=gcp # Update this +export TYPE=gpu # Update this +export IMAGE=skypilot-${CLOUD}-${TYPE}-ubuntu +packer build ${IMAGE}.pkr.hcl +``` +You will see the image ID after the build is complete. + +FYI time to packer build an image: + +| Cloud | Type | Approx. Time | +|-------|------|------------------------| +| AWS | GPU | 15 min | +| AWS | CPU | 10 min | +| GCP | GPU | 16 min | +| GCP | CPU | 5 min | + +### GCP +```bash +export IMAGE_NAME=skypilot-gcp-cpu-ubuntu-20241011003407 # Update this + +# Make image public +export IMAGE_ID=projects/sky-dev-465/global/images/${IMAGE_NAME} +gcloud compute images add-iam-policy-binding ${IMAGE_NAME} --member='allAuthenticatedUsers' --role='roles/compute.imageUser' +``` + +### AWS +1. Generate images for all regions +```bash +export IMAGE_ID=ami-0b31b24524afa8e47 # Update this + +python aws_utils/image_gen.py --image-id ${IMAGE_ID} --processor ${TYPE} +``` +2. Add fallback images if any region failed \ +Look for "NEED_FALLBACK" in the output `images.csv` and edit. (You can use public [ubuntu images](https://cloud-images.ubuntu.com/locator/ec2/) as fallback.) + +## Test Images +1. Minimal GPU test: `sky launch --image ${IMAGE_ID} --gpus=L4:1 --cloud ${CLOUD}` then run `nvidia-smi` in the launched instance. +2. Update the image ID in `sky/clouds/gcp.py` and run the test: +```bash +pytest tests/test_smoke.py::test_minimal --gcp +pytest tests/test_smoke.py::test_huggingface --gcp +pytest tests/test_smoke.py::test_job_queue_with_docker --gcp +pytest tests/test_smoke.py::test_cancel_gcp +``` + +## Ship Images & Cleanup +Submit a PR to update [`SkyPilot Catalog`](https://github.com/skypilot-org/skypilot-catalog/tree/master/catalogs) then clean up the old images to avoid extra iamge storage fees. + +### GCP +1. Example PR: [#86](https://github.com/skypilot-org/skypilot-catalog/pull/86) +2. Go to console and delete old images. + +### AWS +1. Copy the old custom image rows from Catalog's existing `images.csv` to a local `images.csv` in this folder. +2. Update Catalog with new images. Example PR: [#89](https://github.com/skypilot-org/skypilot-catalog/pull/89) +3. Delete AMIs across regions by running +```bash +python aws_utils/image_delete.py --tag ${TAG} +``` diff --git a/sky/clouds/service_catalog/images/aws_utils/image_delete.py b/sky/clouds/service_catalog/images/aws_utils/image_delete.py new file mode 100644 index 00000000000..52cbb5b2382 --- /dev/null +++ b/sky/clouds/service_catalog/images/aws_utils/image_delete.py @@ -0,0 +1,63 @@ +"""Delete all images with a given tag and their associated snapshots from images.csv + +Example Usage: put images.csv in the same folder as this script and run + python image_delete.py --tag skypilot:custom-gpu-ubuntu-2204 +""" + +import argparse +import csv +import json +import subprocess + +parser = argparse.ArgumentParser( + description='Delete AWS images and their snapshots across regions.') +parser.add_argument('--tag', + required=True, + help='Tag of the image to delete, see tags in images.csv') +args = parser.parse_args() + + +def get_snapshots(image_id, region): + cmd = f'aws ec2 describe-images --image-ids {image_id} --region {region} --query "Images[*].BlockDeviceMappings[*].Ebs.SnapshotId" --output json' + result = subprocess.run(cmd, + shell=True, + check=True, + capture_output=True, + text=True) + snapshots = json.loads(result.stdout) + return [ + snapshot for sublist in snapshots for snapshot in sublist if snapshot + ] + + +def delete_image_and_snapshots(image_id, region): + # Must get snapshots before deleting the image + snapshots = get_snapshots(image_id, region) + + # Deregister the image + cmd = f'aws ec2 deregister-image --image-id {image_id} --region {region}' + subprocess.run(cmd, shell=True, check=True) + print(f"Deregistered image {image_id} in region {region}") + + # Delete snapshots + for snapshot in snapshots: + cmd = f'aws ec2 delete-snapshot --snapshot-id {snapshot} --region {region}' + subprocess.run(cmd, shell=True, check=True) + print(f'Deleted snapshot {snapshot} in region {region}') + + +def main(): + with open('images.csv', 'r') as csvfile: + reader = csv.DictReader(csvfile) + for row in reader: + if row['Tag'] == args.tag: + try: + delete_image_and_snapshots(row['ImageId'], row['Region']) + except subprocess.CalledProcessError as e: + print( + f'Failed to delete image {row["ImageId"]} or its snapshots in region {row["Region"]}: {e}' + ) + + +if __name__ == "__main__": + main() diff --git a/sky/clouds/service_catalog/images/aws_utils/image_gen.py b/sky/clouds/service_catalog/images/aws_utils/image_gen.py new file mode 100644 index 00000000000..cb39355ad2c --- /dev/null +++ b/sky/clouds/service_catalog/images/aws_utils/image_gen.py @@ -0,0 +1,151 @@ +"""Copy SkyPilot AMI to multiple regions, make them public, and generate images.csv + +Example Usage: + python aws_image_gen.py --source-image-id ami-00000 --processor gpu +""" + +import argparse +import concurrent.futures +import csv +import json +import os +import subprocess +import threading +import time + +parser = argparse.ArgumentParser( + description='Generate AWS images across regions.') +parser.add_argument('--image-id', + required=True, + help='The source AMI ID to copy from') +parser.add_argument('--processor', required=True, help='e.g. gpu, cpu, etc.') +parser.add_argument('--region', + default='us-east-1', + help='Region of the source AMI') +parser.add_argument('--base-image-id', + default='ami-005fc0f236362e99f', + help='The base AMI of the source AMI.') +parser.add_argument('--os-type', default='ubuntu', help='The OS type') +parser.add_argument('--os-version', default='22.04', help='The OS version') +parser.add_argument('--output-csv', + default='images.csv', + help='The output CSV file name') +args = parser.parse_args() + +# 25 regions +ALL_REGIONS = [ + # 'us-east-1', # Source AMI is already in this region + 'us-east-2', + 'us-west-1', + 'us-west-2', + 'ca-central-1', + 'eu-central-1', # need for smoke test + 'eu-central-2', + 'eu-west-1', + 'eu-west-2', + 'eu-south-1', + 'eu-south-2', + 'eu-west-3', + 'eu-north-1', + 'me-south-1', + 'me-central-1', + 'af-south-1', + 'ap-east-1', + 'ap-south-1', + 'ap-south-2', + 'ap-northeast-3', + 'ap-northeast-2', + 'ap-southeast-1', + 'ap-southeast-2', + 'ap-southeast-3', + 'ap-northeast-1', +] + + +def make_image_public(image_id, region): + unblock_command = f"aws ec2 disable-image-block-public-access --region {region}" + subprocess.run(unblock_command, shell=True, check=True) + public_command = ( + f'aws ec2 modify-image-attribute --image-id {image_id} ' + f'--launch-permission "{{\\\"Add\\\": [{{\\\"Group\\\":\\\"all\\\"}}]}}" --region {region}' + ) + subprocess.run(public_command, shell=True, check=True) + print(f"Made {image_id} public") + + +def copy_image_and_make_public(target_region): + # Copy the AMI to the target region + copy_command = ( + f"aws ec2 copy-image --source-region {args.region} " + f"--source-image-id {args.image_id} --region {target_region} " + f"--name 'skypilot-aws-{args.processor}-{args.os_type}-{time.time()}' --output json" + ) + print(copy_command) + result = subprocess.run(copy_command, + shell=True, + check=True, + capture_output=True, + text=True) + print(result.stdout) + new_image_id = json.loads(result.stdout)['ImageId'] + print(f"Copied image to {target_region} with new image ID: {new_image_id}") + + # Wait for the image to be available + print(f"Waiting for {new_image_id} to be available...") + wait_command = f"aws ec2 wait image-available --image-ids {new_image_id} --region {target_region}" + subprocess.run(wait_command, shell=True, check=True) + + make_image_public(new_image_id, target_region) + + return new_image_id + + +def write_image_to_csv(image_id, region): + with open(args.output_csv, 'a', newline='', encoding='utf-8') as csvfile: + writer = csv.writer(csvfile) + row = [ + f'skypilot:custom-{args.processor}-{args.os_type}', region, + args.os_type, args.os_version, image_id, + time.strftime('%Y%m%d'), args.base_image_id + ] + writer.writerow(row) + print(f"Wrote to CSV: {row}") + + +def main(): + make_image_public(args.image_id, args.region) + if not os.path.exists(args.output_csv): + with open(args.output_csv, 'w', newline='') as csvfile: + writer = csv.writer(csvfile) + writer.writerow([ + 'Tag', 'Region', 'OS', 'OSVersion', 'ImageId', 'CreationDate', + 'BaseImageId' + ]) # Header + print(f"No existing {args.output_csv} so created it.") + + # Process other regions + image_cache = [(args.image_id, args.region)] + + def process_region(copy_to_region): + print(f"Start copying image to {copy_to_region}...") + try: + new_image_id = copy_image_and_make_public(copy_to_region) + except Exception as e: + print(f"Error generating image to {copy_to_region}: {str(e)}") + new_image_id = 'NEED_FALLBACK' + image_cache.append((new_image_id, copy_to_region)) + + with concurrent.futures.ThreadPoolExecutor() as executor: + executor.map(process_region, ALL_REGIONS) + executor.shutdown(wait=True) + + # Sort the images by it's region and write to CSV + sorted_image_cache = sorted(image_cache, key=lambda x: x[1]) + for new_image_id, copy_to_region in sorted_image_cache: + write_image_to_csv(new_image_id, copy_to_region) + + print("All done!") + + +if __name__ == "__main__": + main() diff --git a/sky/clouds/service_catalog/images/plugins.pkr.hcl b/sky/clouds/service_catalog/images/plugins.pkr.hcl new file mode 100644 index 00000000000..e007c1723bf --- /dev/null +++ b/sky/clouds/service_catalog/images/plugins.pkr.hcl @@ -0,0 +1,17 @@ +packer { + required_plugins { + amazon = { + version = ">= 1.2.8" + source = "github.com/hashicorp/amazon" + } + } +} + +packer { + required_plugins { + googlecompute = { + version = ">= 1.1.1" + source = "github.com/hashicorp/googlecompute" + } + } +} diff --git a/sky/clouds/service_catalog/images/provisioners/cloud.sh b/sky/clouds/service_catalog/images/provisioners/cloud.sh new file mode 100644 index 00000000000..b326c9fde51 --- /dev/null +++ b/sky/clouds/service_catalog/images/provisioners/cloud.sh @@ -0,0 +1,50 @@ +#!/bin/bash + +PYTHON_EXEC=$(echo ~/skypilot-runtime)/bin/python + +# TODO: keep this dependency installation align with utils/controller_utils.py and setup.py +install_azure() { + echo "Install cloud dependencies on controller: Azure" + $PYTHON_EXEC -m pip install "azure-cli>=2.31.0" azure-core "azure-identity>=1.13.0" azure-mgmt-network + $PYTHON_EXEC -m pip install azure-storage-blob msgraph-sdk +} + +install_gcp() { + echo "Install cloud dependencies on controller: GCP" + $PYTHON_EXEC -m pip install "google-api-python-client>=2.69.0" + $PYTHON_EXEC -m pip install google-cloud-storage + if ! gcloud --help > /dev/null 2>&1; then + pushd /tmp &>/dev/null + mkdir -p ~/.sky/logs + wget --quiet https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-424.0.0-linux-x86_64.tar.gz > ~/.sky/logs/gcloud_installation.log + tar xzf google-cloud-sdk-424.0.0-linux-x86_64.tar.gz >> ~/.sky/logs/gcloud_installation.log + rm -rf ~/google-cloud-sdk >> ~/.sky/logs/gcloud_installation.log + mv google-cloud-sdk ~/ + ~/google-cloud-sdk/install.sh -q >> ~/.sky/logs/gcloud_installation.log 2>&1 + echo "source ~/google-cloud-sdk/path.bash.inc > /dev/null 2>&1" >> ~/.bashrc + source ~/google-cloud-sdk/path.bash.inc >> ~/.sky/logs/gcloud_installation.log 2>&1 + popd &>/dev/null + fi +} + +install_aws() { + echo "Install cloud dependencies on controller: AWS" + $PYTHON_EXEC -m pip install botocore>=1.29.10 boto3>=1.26.1 + $PYTHON_EXEC -m pip install "urllib3<2" awscli>=1.27.10 "colorama<0.4.5" +} + +if [ "$CLOUD" = "azure" ]; then + install_azure +elif [ "$CLOUD" = "gcp" ]; then + install_gcp +elif [ "$CLOUD" = "aws" ]; then + install_aws +else + echo "Error: Unknown cloud $CLOUD so not installing any cloud dependencies." +fi + +if [ $? -eq 0 ]; then + echo "Successfully installed cloud dependencies on controller: $CLOUD" +else + echo "Error: Failed to install cloud dependencies on controller: $CLOUD" +fi diff --git a/sky/clouds/service_catalog/images/provisioners/cuda.sh b/sky/clouds/service_catalog/images/provisioners/cuda.sh new file mode 100644 index 00000000000..1b2b4ec977e --- /dev/null +++ b/sky/clouds/service_catalog/images/provisioners/cuda.sh @@ -0,0 +1,24 @@ +#!/bin/bash + +# This script installs the latest CUDA driver and toolkit version that is compatible with all GPU types. +# For CUDA driver version, choose the latest version that works for ALL GPU types. +# GCP: https://cloud.google.com/compute/docs/gpus/install-drivers-gpu#minimum-driver +# AWS: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html +export DEBIAN_FRONTEND=noninteractive + +wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb +sudo dpkg -i cuda-keyring_1.1-1_all.deb +sudo apt-get update + +# Make sure CUDA toolkit and driver versions are compatible: https://docs.nvidia.com/deploy/cuda-compatibility/index.html +# Current State: Driver Version 535.183.06 and CUDA Version 12.2 +sudo apt-get install -y cuda-drivers-535 +sudo apt-get install -y cuda-toolkit-12-4 + +# Install cuDNN +# https://docs.nvidia.com/deeplearning/cudnn/latest/installation/linux.html#installing-on-linux +sudo apt-get install libcudnn8 +sudo apt-get install libcudnn8-dev + +# Cleanup +rm cuda-keyring_1.1-1_all.deb diff --git a/sky/clouds/service_catalog/images/provisioners/docker.sh b/sky/clouds/service_catalog/images/provisioners/docker.sh new file mode 100644 index 00000000000..da2366408ab --- /dev/null +++ b/sky/clouds/service_catalog/images/provisioners/docker.sh @@ -0,0 +1,22 @@ +#!/bin/bash + +# Add Docker's official GPG key: +sudo apt-get update +sudo apt-get install ca-certificates curl +sudo install -m 0755 -d /etc/apt/keyrings +sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc +sudo chmod a+r /etc/apt/keyrings/docker.asc + +# Add the repository to Apt sources: +echo \ + "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \ + $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \ + sudo tee /etc/apt/sources.list.d/docker.list > /dev/null +sudo apt-get update + +# Install Docker +sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin + +# Add user to Docker group so that user does not need to use sudo to run Docker commands +sudo usermod -aG docker $USER +newgrp docker diff --git a/sky/clouds/service_catalog/images/provisioners/nvidia-container-toolkit.sh b/sky/clouds/service_catalog/images/provisioners/nvidia-container-toolkit.sh new file mode 100644 index 00000000000..b6b3625176b --- /dev/null +++ b/sky/clouds/service_catalog/images/provisioners/nvidia-container-toolkit.sh @@ -0,0 +1,26 @@ +#!/bin/bash + +set -e + +curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && + curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | + sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | + sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list + +sudo apt-get update +sudo apt-get install -y nvidia-container-toolkit + +# if there's an empty /etc/docker/daemon.json, `nvidia-ctk runtime configure --runtime=docker` will fail +if [ -f /etc/docker/daemon.json ] && [ ! -s /etc/docker/daemon.json ]; then + sudo rm /etc/docker/daemon.json +fi + +sudo nvidia-ctk runtime configure --runtime=docker +sudo systemctl restart docker + +# Validate +if sudo docker info -f "{{.Runtimes}}" | grep "nvidia-container-runtime"; then + echo "Successfully installed NVIDIA container runtime" +else + echo "Failed to install NVIDIA container runtime" +fi diff --git a/sky/clouds/service_catalog/images/provisioners/skypilot.sh b/sky/clouds/service_catalog/images/provisioners/skypilot.sh new file mode 100644 index 00000000000..ff2aa06b2b6 --- /dev/null +++ b/sky/clouds/service_catalog/images/provisioners/skypilot.sh @@ -0,0 +1,69 @@ +#!/bin/bash + +# Stop and disable unattended-upgrades +sudo systemctl stop unattended-upgrades || true +sudo systemctl disable unattended-upgrades || true +sudo sed -i 's/Unattended-Upgrade "1"/Unattended-Upgrade "0"/g' /etc/apt/apt.conf.d/20auto-upgrades || true + +# Configure dpkg +sudo dpkg --configure --force-overwrite -a + +# Apt-get installs +sudo apt-get install jq -y + +# Create necessary directories +mkdir -p ~/sky_workdir +mkdir -p ~/.sky/ +mkdir -p ~/.sky/sky_app +mkdir -p ~/.ssh +touch ~/.ssh/config + +# Install Miniconda +curl -o Miniconda3-Linux-x86_64.sh https://repo.anaconda.com/miniconda/Miniconda3-py310_23.11.0-2-Linux-x86_64.sh +bash Miniconda3-Linux-x86_64.sh -b +eval "$(~/miniconda3/bin/conda shell.bash hook)" +rm Miniconda3-Linux-x86_64.sh +conda init +conda config --set auto_activate_base true +conda activate base + +# Conda, Python +echo "Creating conda env with Python 3.10" +conda create -y -n skypilot-runtime python=3.10 +conda activate skypilot-runtime +export PIP_DISABLE_PIP_VERSION_CHECK=1 +echo PATH=$PATH +python3 -m venv ~/skypilot-runtime +PYTHON_EXEC=$(echo ~/skypilot-runtime)/bin/python + +# Pip installs +$PYTHON_EXEC -m pip install "setuptools<70" +$PYTHON_EXEC -m pip install "grpcio!=1.48.0,<=1.51.3,>=1.42.0" +$PYTHON_EXEC -m pip install "skypilot-nightly" + +# Install ray +RAY_ADDRESS=127.0.0.1:6380 +$PYTHON_EXEC -m pip install --exists-action w -U ray[default]==2.9.3 +export PATH=$PATH:$HOME/.local/bin +source ~/skypilot-runtime/bin/activate +which ray > ~/.sky/ray_path || exit 1 +$PYTHON_EXEC -m pip list | grep "ray " | grep 2.9.3 2>&1 > /dev/null && { + $PYTHON_EXEC -c "from sky.skylet.ray_patches import patch; patch()" || exit 1 +} + +# System configurations +sudo bash -c 'rm -rf /etc/security/limits.d; echo "* soft nofile 1048576" >> /etc/security/limits.conf; echo "* hard nofile 1048576" >> /etc/security/limits.conf' +sudo grep -e '^DefaultTasksMax' /etc/systemd/system.conf || sudo bash -c 'echo "DefaultTasksMax=infinity" >> /etc/systemd/system.conf' +sudo systemctl set-property user-$(id -u $(whoami)).slice TasksMax=infinity +sudo systemctl daemon-reload + +# Stop and disable Jupyter service +sudo systemctl stop jupyter > /dev/null 2>&1 || true +sudo systemctl disable jupyter > /dev/null 2>&1 || true + +# Configure fuse +[ -f /etc/fuse.conf ] && sudo sed -i 's/#user_allow_other/user_allow_other/g' /etc/fuse.conf || sudo sh -c 'echo "user_allow_other" > /etc/fuse.conf' + +# Cleanup +# Remove SkyPilot in OS image because when user sky launch we will install whatever version of SkyPilot user has on their local machine. +$PYTHON_EXEC -m pip uninstall "skypilot-nightly" -y diff --git a/sky/clouds/service_catalog/images/skypilot-aws-cpu-ubuntu.pkr.hcl b/sky/clouds/service_catalog/images/skypilot-aws-cpu-ubuntu.pkr.hcl new file mode 100644 index 00000000000..c21fbf51b20 --- /dev/null +++ b/sky/clouds/service_catalog/images/skypilot-aws-cpu-ubuntu.pkr.hcl @@ -0,0 +1,47 @@ +variable "region" { + type = string + default = "us-east-1" +} + +locals { + timestamp = regex_replace(timestamp(), "[- TZ:]", "") +} + +source "amazon-ebs" "cpu-ubuntu" { + ami_name = "skypilot-aws-cpu-ubuntu-${local.timestamp}" + instance_type = "t2.micro" + region = var.region + ssh_username = "ubuntu" + source_ami_filter { + filters = { + name = "ubuntu/images/*ubuntu-jammy-22.04-amd64-server-*" + root-device-type = "ebs" + virtualization-type = "hvm" + } + most_recent = true + owners = ["099720109477"] + } + launch_block_device_mappings { + device_name = "/dev/sda1" + volume_size = 8 + volume_type = "gp2" + delete_on_termination = true + } +} + +build { + name = "aws-cpu-ubuntu-build" + sources = ["sources.amazon-ebs.cpu-ubuntu"] + provisioner "shell" { + script = "./provisioners/docker.sh" + } + provisioner "shell" { + script = "./provisioners/skypilot.sh" + } + provisioner "shell" { + environment_vars = [ + "CLOUD=aws", + ] + script = "./provisioners/cloud.sh" + } +} diff --git a/sky/clouds/service_catalog/images/skypilot-aws-gpu-ubuntu.pkr.hcl b/sky/clouds/service_catalog/images/skypilot-aws-gpu-ubuntu.pkr.hcl new file mode 100644 index 00000000000..c4a8efac4dc --- /dev/null +++ b/sky/clouds/service_catalog/images/skypilot-aws-gpu-ubuntu.pkr.hcl @@ -0,0 +1,55 @@ +variable "region" { + type = string + default = "us-east-1" +} + +locals { + timestamp = regex_replace(timestamp(), "[- TZ:]", "") +} + +source "amazon-ebs" "gpu-ubuntu" { + ami_name = "skypilot-aws-gpu-ubuntu-${local.timestamp}" + instance_type = "g6.xlarge" + region = var.region + ssh_username = "ubuntu" + source_ami_filter { + filters = { + name = "ubuntu/images/*ubuntu-jammy-22.04-amd64-server-*" + root-device-type = "ebs" + virtualization-type = "hvm" + } + most_recent = true + owners = ["099720109477"] + } + launch_block_device_mappings { + device_name = "/dev/sda1" + volume_size = 30 + volume_type = "gp2" + delete_on_termination = true + } +} + +build { + name = "aws-gpu-ubuntu-build" + sources = [ + "source.amazon-ebs.gpu-ubuntu" + ] + provisioner "shell" { + script = "./provisioners/docker.sh" + } + provisioner "shell" { + script = "./provisioners/cuda.sh" + } + provisioner "shell" { + script = "./provisioners/nvidia-container-toolkit.sh" + } + provisioner "shell" { + script = "./provisioners/skypilot.sh" + } + provisioner "shell" { + environment_vars = [ + "CLOUD=aws", + ] + script = "./provisioners/cloud.sh" + } +} diff --git a/sky/clouds/service_catalog/images/skypilot-gcp-cpu-ubuntu.pkr.hcl b/sky/clouds/service_catalog/images/skypilot-gcp-cpu-ubuntu.pkr.hcl new file mode 100644 index 00000000000..bf3af0519e4 --- /dev/null +++ b/sky/clouds/service_catalog/images/skypilot-gcp-cpu-ubuntu.pkr.hcl @@ -0,0 +1,33 @@ + +locals { + timestamp = regex_replace(timestamp(), "[- TZ:]", "") +} + +source "googlecompute" "cpu-ubuntu" { + project_id = "sky-dev-465" + image_name = "skypilot-gcp-cpu-ubuntu-${local.timestamp}" + source_image_family = "ubuntu-2204-lts" + zone = "us-west1-a" + image_description = "SkyPilot custom image for launching GCP CPU instances." + tags = ["packer"] + disk_size = 10 + machine_type = "e2-medium" + ssh_username = "gcpuser" +} + +build { + name = "gcp-cpu-ubuntu-build" + sources = ["sources.googlecompute.cpu-ubuntu"] + provisioner "shell" { + script = "./provisioners/docker.sh" + } + provisioner "shell" { + script = "./provisioners/skypilot.sh" + } + provisioner "shell" { + environment_vars = [ + "CLOUD=gcp", + ] + script = "./provisioners/cloud.sh" + } +} diff --git a/sky/clouds/service_catalog/images/skypilot-gcp-gpu-ubuntu.pkr.hcl b/sky/clouds/service_catalog/images/skypilot-gcp-gpu-ubuntu.pkr.hcl new file mode 100644 index 00000000000..f46d414493b --- /dev/null +++ b/sky/clouds/service_catalog/images/skypilot-gcp-gpu-ubuntu.pkr.hcl @@ -0,0 +1,46 @@ +variable "zone" { + type = string + default = "us-west1-a" +} + +locals { + timestamp = regex_replace(timestamp(), "[- TZ:]", "") +} + +source "googlecompute" "gpu-ubuntu" { + image_name = "skypilot-gcp-gpu-ubuntu-${local.timestamp}" + project_id = "sky-dev-465" + source_image_family = "ubuntu-2204-lts" + zone = var.zone + image_description = "SkyPilot custom image for launching GCP GPU instances." + tags = ["packer", "gpu", "ubuntu"] + disk_size = 50 + machine_type = "g2-standard-4" + accelerator_type = "projects/sky-dev-465/zones/${var.zone}/acceleratorTypes/nvidia-l4" + accelerator_count = 1 + on_host_maintenance = "TERMINATE" + ssh_username = "gcpuser" +} + +build { + name = "gcp-gpu-ubuntu-build" + sources = ["sources.googlecompute.gpu-ubuntu"] + provisioner "shell" { + script = "./provisioners/docker.sh" + } + provisioner "shell" { + script = "./provisioners/cuda.sh" + } + provisioner "shell" { + script = "./provisioners/nvidia-container-toolkit.sh" + } + provisioner "shell" { + script = "./provisioners/skypilot.sh" + } + provisioner "shell" { + environment_vars = [ + "CLOUD=gcp", + ] + script = "./provisioners/cloud.sh" + } +} diff --git a/tests/test_smoke.py b/tests/test_smoke.py index 22084e9c368..ed86f93ca27 100644 --- a/tests/test_smoke.py +++ b/tests/test_smoke.py @@ -383,7 +383,7 @@ def test_aws_region(): f'sky exec {name} \'echo $SKYPILOT_CLUSTER_INFO | jq .region | grep us-east-2\'', f'sky logs {name} 2 --status', # Ensure the job succeeded. # A user program should not access SkyPilot runtime env python by default. - f'sky exec {name} \'which python | grep {constants.SKY_REMOTE_PYTHON_ENV_NAME} || exit 1\'', + f'sky exec {name} \'which python | grep {constants.SKY_REMOTE_PYTHON_ENV_NAME} && exit 1 || true\'', f'sky logs {name} 3 --status', # Ensure the job succeeded. ], f'sky down -y {name}', @@ -406,7 +406,7 @@ def test_gcp_region_and_service_account(): f'sky exec {name} \'echo $SKYPILOT_CLUSTER_INFO | jq .region | grep us-central1\'', f'sky logs {name} 3 --status', # Ensure the job succeeded. # A user program should not access SkyPilot runtime env python by default. - f'sky exec {name} \'which python | grep {constants.SKY_REMOTE_PYTHON_ENV_NAME} || exit 1\'', + f'sky exec {name} \'which python | grep {constants.SKY_REMOTE_PYTHON_ENV_NAME} && exit 1 || true\'', f'sky logs {name} 4 --status', # Ensure the job succeeded. ], f'sky down -y {name}', @@ -446,7 +446,7 @@ def test_azure_region(): f'sky exec {name} \'echo $SKYPILOT_CLUSTER_INFO | jq .zone | grep null\'', f'sky logs {name} 3 --status', # Ensure the job succeeded. # A user program should not access SkyPilot runtime env python by default. - f'sky exec {name} \'which python | grep {constants.SKY_REMOTE_PYTHON_ENV_NAME} || exit 1\'', + f'sky exec {name} \'which python | grep {constants.SKY_REMOTE_PYTHON_ENV_NAME} && exit 1 || true\'', f'sky logs {name} 4 --status', # Ensure the job succeeded. ], f'sky down -y {name}', @@ -864,14 +864,14 @@ def test_custom_default_conda_env(generic_cloud: str): f'sky launch -c {name} -y --cloud {generic_cloud} tests/test_yamls/test_custom_default_conda_env.yaml', f'sky status -r {name} | grep "UP"', f'sky logs {name} 1 --status', - f'sky logs {name} 1 --no-follow | grep -P "myenv\\s+\\*"', + f'sky logs {name} 1 --no-follow | grep -E "myenv\\s+\\*"', f'sky exec {name} tests/test_yamls/test_custom_default_conda_env.yaml', f'sky logs {name} 2 --status', f'sky autostop -y -i 0 {name}', 'sleep 60', f'sky status -r {name} | grep "STOPPED"', f'sky start -y {name}', - f'sky logs {name} 2 --no-follow | grep -P "myenv\\s+\\*"', + f'sky logs {name} 2 --no-follow | grep -E "myenv\\s+\\*"', f'sky exec {name} tests/test_yamls/test_custom_default_conda_env.yaml', f'sky logs {name} 3 --status', ], f'sky down -y {name}') From 3042a27a496dbd36a5234cfe646909a9e7bc9ad6 Mon Sep 17 00:00:00 2001 From: Yika Date: Thu, 17 Oct 2024 19:02:41 -0700 Subject: [PATCH 58/93] Disable AWS images.csv refreshing (#4116) --- sky/clouds/service_catalog/data_fetchers/fetch_aws.py | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/sky/clouds/service_catalog/data_fetchers/fetch_aws.py b/sky/clouds/service_catalog/data_fetchers/fetch_aws.py index e0e5ffa21a1..b630123648e 100644 --- a/sky/clouds/service_catalog/data_fetchers/fetch_aws.py +++ b/sky/clouds/service_catalog/data_fetchers/fetch_aws.py @@ -538,11 +538,13 @@ def _check_regions_integrity(df: 'pd.DataFrame', name: str): instance_df.to_csv('aws/vms.csv', index=False) print('AWS Service Catalog saved to aws/vms.csv') - image_df = get_all_regions_images_df(user_regions) - _check_regions_integrity(image_df, 'images') + # Disable refreshing images.csv as we are using skypilot custom AMIs + # See sky/clouds/service_catalog/images/README.md for more details. + # image_df = get_all_regions_images_df(user_regions) + # _check_regions_integrity(image_df, 'images') - image_df.to_csv('aws/images.csv', index=False) - print('AWS Images saved to aws/images.csv') + # image_df.to_csv('aws/images.csv', index=False) + # print('AWS Images saved to aws/images.csv') if args.az_mappings: az_mappings_df = fetch_availability_zone_mappings() From a34ccb7d46003fdc66452b2d0ae9dd234972bcfb Mon Sep 17 00:00:00 2001 From: Yika Date: Thu, 17 Oct 2024 21:07:39 -0700 Subject: [PATCH 59/93] [Docs] .skyignore doc (#4114) * [Docs] .skyignore doc * Correct typos Co-authored-by: Zongheng Yang --------- Co-authored-by: Zongheng Yang --- .../examples/syncing-code-artifacts.rst | 53 ++++++++++--------- docs/source/reference/yaml-spec.rst | 4 +- 2 files changed, 30 insertions(+), 27 deletions(-) diff --git a/docs/source/examples/syncing-code-artifacts.rst b/docs/source/examples/syncing-code-artifacts.rst index ded8d03f739..1b05c68b84f 100644 --- a/docs/source/examples/syncing-code-artifacts.rst +++ b/docs/source/examples/syncing-code-artifacts.rst @@ -46,31 +46,7 @@ VMs. The task is invoked under that working directory (so that it can call scripts, access checkpoints, etc.). .. note:: - - **Exclude files from syncing** - - For large, multi-gigabyte workdirs, uploading may be slow because they - are synced to the remote VM(s). To exclude large files in - your workdir from being uploaded, add them to a :code:`.skyignore` file - under your workdir. :code:`.skyignore` follows RSYNC filter rules. - - Example :code:`.skyignore` file: - - .. code-block:: - - # Files that match pattern under ONLY CURRENT directory - /hello.py - /*.txt - /dir - - # Files that match pattern under ALL directories - *.txt - hello.py - - # Files that match pattern under a directory ./dir/ - /dir/*.txt - - Do NOT use ``.`` to indicate local directory (e.g. ``./hello.py``). + To exclude large files from being uploaded, see :ref:`exclude-uploading-files`. .. note:: @@ -140,6 +116,33 @@ file_mount may be slow because they are processed by ``rsync``. Use :ref:`SkyPilot bucket mounting ` to efficiently handle large files. +.. _exclude-uploading-files: + +Exclude uploading files +-------------------------------------- +By default, SkyPilot uses your existing :code:`.gitignore` and :code:`.git/info/exclude` to exclude files from syncing. + +Alternatively, you can use :code:`.skyignore` if you want to separate SkyPilot's syncing behavior from Git's. +If you use a :code:`.skyignore` file, SkyPilot will only exclude files based on that file without using the default Git files. + +Any :code:`.skyignore` file under either your workdir or source paths of file_mounts is respected. + +:code:`.skyignore` follows RSYNC filter rules, e.g. + +.. code-block:: + + # Files that match pattern under CURRENT directory + /file.txt + /dir + /*.jar + /dir/*.jar + + # Files that match pattern under ALL directories + *.jar + file.txt + +Do _not_ use ``.`` to indicate local directory (e.g., instead of ``./file``, write ``/file``). + .. _downloading-files-and-artifacts: Downloading files and artifacts diff --git a/docs/source/reference/yaml-spec.rst b/docs/source/reference/yaml-spec.rst index c5339bcc184..f874b4d37b4 100644 --- a/docs/source/reference/yaml-spec.rst +++ b/docs/source/reference/yaml-spec.rst @@ -22,8 +22,8 @@ Available fields: # If a relative path is used, it's evaluated relative to the location from # which `sky` is called. # - # To exclude files from syncing, add them to a .skyignore file under your working directory. - # Details: https://skypilot.readthedocs.io/en/latest/examples/syncing-code-artifacts.html#uploading-code-and-project-files + # To exclude files from syncing, see + # https://skypilot.readthedocs.io/en/latest/examples/syncing-code-artifacts.html#exclude-uploading-files workdir: ~/my-task-code # Number of nodes (optional; defaults to 1) to launch including the head node. From 71a95f4bf7f1446e80bb5c24d23c1695bc4fc031 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Thu, 17 Oct 2024 23:11:46 -0700 Subject: [PATCH 60/93] [Core] Raise error for none existing cluster when endpoint is called (#4117) raise error for none existing cluster --- sky/backends/backend_utils.py | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/sky/backends/backend_utils.py b/sky/backends/backend_utils.py index 2521fcbcfe5..caa6c9292d5 100644 --- a/sky/backends/backend_utils.py +++ b/sky/backends/backend_utils.py @@ -2772,6 +2772,10 @@ def get_endpoints(cluster: str, cluster_records = get_clusters(include_controller=True, refresh=False, cluster_names=[cluster]) + if not cluster_records: + with ux_utils.print_exception_no_traceback(): + raise exceptions.ClusterNotUpError( + f'Cluster {cluster!r} not found.', cluster_status=None) assert len(cluster_records) == 1, cluster_records cluster_record = cluster_records[0] if (not skip_status_check and From 7971aa25fb6a5ffc45464be62d1af64fc3f46527 Mon Sep 17 00:00:00 2001 From: Yika Date: Fri, 18 Oct 2024 16:41:19 -0700 Subject: [PATCH 61/93] Refresh local aws images.csv when image not found (#4127) Refresh local aws images.csv by pulling from github catalog when image tag not found --- sky/clouds/service_catalog/aws_catalog.py | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/sky/clouds/service_catalog/aws_catalog.py b/sky/clouds/service_catalog/aws_catalog.py index a44750c4ec4..d156135047b 100644 --- a/sky/clouds/service_catalog/aws_catalog.py +++ b/sky/clouds/service_catalog/aws_catalog.py @@ -308,7 +308,17 @@ def list_accelerators( def get_image_id_from_tag(tag: str, region: Optional[str]) -> Optional[str]: """Returns the image id from the tag.""" - return common.get_image_id_from_tag_impl(_image_df, tag, region) + global _image_df + + image_id = common.get_image_id_from_tag_impl(_image_df, tag, region) + if image_id is None: + # Refresh the image catalog and try again, if the image tag is not + # found. + logger.debug('Refreshing the image catalog and trying again.') + _image_df = common.read_catalog('aws/images.csv', + pull_frequency_hours=0) + image_id = common.get_image_id_from_tag_impl(_image_df, tag, region) + return image_id def is_image_tag_valid(tag: str, region: Optional[str]) -> bool: From 9201def0ff1ac73681a82a26d46f56d0b027b03b Mon Sep 17 00:00:00 2001 From: Zongheng Yang Date: Fri, 18 Oct 2024 19:29:38 -0700 Subject: [PATCH 62/93] [Docs] News revamps. (#4126) * News revamps. updates updates updates updates updates updates updates updates * Apply suggestions from code review Co-authored-by: Zhanghao Wu --------- Co-authored-by: Zhanghao Wu --- README.md | 42 ++++++++++++++++++++++-------------------- 1 file changed, 22 insertions(+), 20 deletions(-) diff --git a/README.md b/README.md index dc7de3ea574..01b3ab08c8a 100644 --- a/README.md +++ b/README.md @@ -26,30 +26,32 @@ ---- :fire: *News* :fire: -- [Sep, 2024] Point, Launch and Serve **Llama 3.2** on Kubernetes or Any Cloud: [**example**](./llm/llama-3_2/) -- [Sep, 2024] Run and deploy [**Pixtral**](./llm/pixtral), the first open-source multimodal model from Mistral AI. -- [Jul, 2024] [**Finetune**](./llm/llama-3_1-finetuning/) and [**serve**](./llm/llama-3_1/) **Llama 3.1** on your infra -- [Jun, 2024] Reproduce **GPT** with [llm.c](https://github.com/karpathy/llm.c/discussions/481) on any cloud: [**guide**](./llm/gpt-2/) -- [Apr, 2024] Serve **Qwen-110B** on your infra: [**example**](./llm/qwen/) -- [Apr, 2024] Using **Ollama** to deploy quantized LLMs on CPUs and GPUs: [**example**](./llm/ollama/) -- [Feb, 2024] Deploying and scaling **Gemma** with SkyServe: [**example**](./llm/gemma/) -- [Feb, 2024] Serving **Code Llama 70B** with vLLM and SkyServe: [**example**](./llm/codellama/) -- [Dec, 2023] **Mixtral 8x7B**, a high quality sparse mixture-of-experts model, was released by Mistral AI! Deploy via SkyPilot on any cloud: [**example**](./llm/mixtral/) -- [Nov, 2023] Using **Axolotl** to finetune Mistral 7B on the cloud (on-demand and spot): [**example**](./llm/axolotl/) +- [Oct 2024] :tada: **SkyPilot crossed 1M+ downloads** :tada:: Thank you to our community! [**Twitter/X**](https://x.com/skypilot_org/status/1844770841718067638) +- [Sep 2024] Point, Launch and Serve **Llama 3.2** on Kubernetes or Any Cloud: [**example**](./llm/llama-3_2/) +- [Sep 2024] Run and deploy [**Pixtral**](./llm/pixtral), the first open-source multimodal model from Mistral AI. +- [Jun 2024] Reproduce **GPT** with [llm.c](https://github.com/karpathy/llm.c/discussions/481) on any cloud: [**guide**](./llm/gpt-2/) +- [Apr 2024] Serve [**Qwen-110B**](https://qwenlm.github.io/blog/qwen1.5-110b/) on your infra: [**example**](./llm/qwen/) +- [Apr 2024] Using [**Ollama**](https://github.com/ollama/ollama) to deploy quantized LLMs on CPUs and GPUs: [**example**](./llm/ollama/) +- [Feb 2024] Deploying and scaling [**Gemma**](https://blog.google/technology/developers/gemma-open-models/) with SkyServe: [**example**](./llm/gemma/) +- [Feb 2024] Serving [**Code Llama 70B**](https://ai.meta.com/blog/code-llama-large-language-model-coding/) with vLLM and SkyServe: [**example**](./llm/codellama/) +- [Dec 2023] [**Mixtral 8x7B**](https://mistral.ai/news/mixtral-of-experts/), a high quality sparse mixture-of-experts model, was released by Mistral AI! Deploy via SkyPilot on any cloud: [**example**](./llm/mixtral/) +- [Nov 2023] Using [**Axolotl**](https://github.com/OpenAccess-AI-Collective/axolotl) to finetune Mistral 7B on the cloud (on-demand and spot): [**example**](./llm/axolotl/) + +**LLM Finetuning Cookbooks**: Finetuning Llama 2 / Llama 3.1 in your own cloud environment, privately: Llama 2 [**example**](./llm/vicuna-llama-2/) and [**blog**](https://blog.skypilot.co/finetuning-llama2-operational-guide/); Llama 3.1 [**example**](./llm/llama-3_1-finetuning/) and [**blog**](https://blog.skypilot.co/finetune-llama-3_1-on-your-infra/)
Archived -- [Apr, 2024] Serve and finetune [**Llama 3**](https://skypilot.readthedocs.io/en/latest/gallery/llms/llama-3.html) on any cloud or Kubernetes: [**example**](./llm/llama-3/) -- [Mar, 2024] Serve and deploy [**Databricks DBRX**](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm) on your infra: [**example**](./llm/dbrx/) -- [Feb, 2024] Speed up your LLM deployments with [**SGLang**](https://github.com/sgl-project/sglang) for 5x throughput on SkyServe: [**example**](./llm/sglang/) -- [Dec, 2023] Using [**LoRAX**](https://github.com/predibase/lorax) to serve 1000s of finetuned LLMs on a single instance in the cloud: [**example**](./llm/lorax/) -- [Sep, 2023] [**Mistral 7B**](https://mistral.ai/news/announcing-mistral-7b/), a high-quality open LLM, was released! Deploy via SkyPilot on any cloud: [**Mistral docs**](https://docs.mistral.ai/self-deployment/skypilot) -- [Sep, 2023] Case study: [**Covariant**](https://covariant.ai/) transformed AI development on the cloud using SkyPilot, delivering models 4x faster cost-effectively: [**read the case study**](https://blog.skypilot.co/covariant/) -- [Aug, 2023] **Finetuning Cookbook**: Finetuning Llama 2 in your own cloud environment, privately: [**example**](./llm/vicuna-llama-2/), [**blog post**](https://blog.skypilot.co/finetuning-llama2-operational-guide/) -- [July, 2023] Self-Hosted **Llama-2 Chatbot** on Any Cloud: [**example**](./llm/llama-2/) -- [June, 2023] Serving LLM 24x Faster On the Cloud [**with vLLM**](https://vllm.ai/) and SkyPilot: [**example**](./llm/vllm/), [**blog post**](https://blog.skypilot.co/serving-llm-24x-faster-on-the-cloud-with-vllm-and-skypilot/) -- [April, 2023] [SkyPilot YAMLs](./llm/vicuna/) for finetuning & serving the [Vicuna LLM](https://lmsys.org/blog/2023-03-30-vicuna/) with a single command! +- [Jul 2024] [**Finetune**](./llm/llama-3_1-finetuning/) and [**serve**](./llm/llama-3_1/) **Llama 3.1** on your infra +- [Apr 2024] Serve and finetune [**Llama 3**](https://skypilot.readthedocs.io/en/latest/gallery/llms/llama-3.html) on any cloud or Kubernetes: [**example**](./llm/llama-3/) +- [Mar 2024] Serve and deploy [**Databricks DBRX**](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm) on your infra: [**example**](./llm/dbrx/) +- [Feb 2024] Speed up your LLM deployments with [**SGLang**](https://github.com/sgl-project/sglang) for 5x throughput on SkyServe: [**example**](./llm/sglang/) +- [Dec 2023] Using [**LoRAX**](https://github.com/predibase/lorax) to serve 1000s of finetuned LLMs on a single instance in the cloud: [**example**](./llm/lorax/) +- [Sep 2023] [**Mistral 7B**](https://mistral.ai/news/announcing-mistral-7b/), a high-quality open LLM, was released! Deploy via SkyPilot on any cloud: [**Mistral docs**](https://docs.mistral.ai/self-deployment/skypilot) +- [Sep 2023] Case study: [**Covariant**](https://covariant.ai/) transformed AI development on the cloud using SkyPilot, delivering models 4x faster cost-effectively: [**read the case study**](https://blog.skypilot.co/covariant/) +- [Jul 2023] Self-Hosted **Llama-2 Chatbot** on Any Cloud: [**example**](./llm/llama-2/) +- [Jun 2023] Serving LLM 24x Faster On the Cloud [**with vLLM**](https://vllm.ai/) and SkyPilot: [**example**](./llm/vllm/), [**blog post**](https://blog.skypilot.co/serving-llm-24x-faster-on-the-cloud-with-vllm-and-skypilot/) +- [Apr 2023] [SkyPilot YAMLs](./llm/vicuna/) for finetuning & serving the [Vicuna LLM](https://lmsys.org/blog/2023-03-30-vicuna/) with a single command!
From c6ae536d8dfedc3bbcf427a81480382b9d5f4c29 Mon Sep 17 00:00:00 2001 From: Andy Lee Date: Sat, 19 Oct 2024 15:41:44 -0700 Subject: [PATCH 63/93] [Serve] Support manually terminating a replica and with purge option (#4032) * define replica id param in cli * create endpoint on controller * call controller endpoint to scale down replica * add classmethod decorator * add handler methods for readability in cli * update docstr and error msg, and inline in cli * update log and return err msg * add docstr, catch and reraise err, add stopped and nonexistent message * inline constant to avoid circular import * fix error statement and return encoded str * add purge feature * add purge replica usage in docstr * use .get to handle unexpected packages * fix: diff terminate replica when failed/purging or not * fix: stay up to date for `is_controller_accessible` * revert * up to date with current APIs * error handling * when purged remove record in the main loop * refactor due to reviewer's suggestions * combine functions * fix: terminate the healthy replica even with purge option * remove abbr * Update sky/serve/core.py Co-authored-by: Tian Xia * Update sky/serve/core.py Co-authored-by: Tian Xia * Update sky/serve/controller.py Co-authored-by: Tian Xia * Update sky/serve/controller.py Co-authored-by: Tian Xia * Update sky/cli.py Co-authored-by: Tian Xia * got services hint * check if not yes in the outside if branch * fix some output messages * Update sky/serve/core.py Co-authored-by: Tian Xia * set conflict status code for already scheduled termination * combine purge and normal terminating down branch together * bump version * global exception handler to render a json response with error messages * fix: use responses.JSONResponse for dict serialize * error messages for old controller * fix: check version mismatch in generated code * revert mistakenly change update_service * refine already in terminating message * fix: branch code workaround in cls.build * wording Co-authored-by: Tian Xia * refactor due to reviewer's comments * fix use ux_utils Co-authored-by: Tian Xia * add changelog as comments * fix messages * edit the message for mismatch error Co-authored-by: Tian Xia * no traceback when raising in `terminate_replica` * messages decode * Apply suggestions from code review Co-authored-by: Tian Xia * format * forma * Empty commit --------- Co-authored-by: David Tran Co-authored-by: David Tran Co-authored-by: Tian Xia --- sky/cli.py | 58 +++++++++++++++++++++++------ sky/serve/__init__.py | 2 + sky/serve/constants.py | 9 ++++- sky/serve/controller.py | 70 +++++++++++++++++++++++++++++++++++ sky/serve/core.py | 47 +++++++++++++++++++++++ sky/serve/replica_managers.py | 17 +++++++-- sky/serve/serve_utils.py | 44 +++++++++++++++++++++- 7 files changed, 229 insertions(+), 18 deletions(-) diff --git a/sky/cli.py b/sky/cli.py index 114c18c9256..fb5a38bba7b 100644 --- a/sky/cli.py +++ b/sky/cli.py @@ -4380,9 +4380,14 @@ def serve_status(all: bool, endpoint: bool, service_names: List[str]): default=False, required=False, help='Skip confirmation prompt.') +@click.option('--replica-id', + default=None, + type=int, + help='Tear down a given replica') # pylint: disable=redefined-builtin -def serve_down(service_names: List[str], all: bool, purge: bool, yes: bool): - """Teardown service(s). +def serve_down(service_names: List[str], all: bool, purge: bool, yes: bool, + replica_id: Optional[int]): + """Teardown service(s) or a replica. SERVICE_NAMES is the name of the service (or glob pattern) to tear down. If both SERVICE_NAMES and ``--all`` are supplied, the latter takes precedence. @@ -4408,6 +4413,12 @@ def serve_down(service_names: List[str], all: bool, purge: bool, yes: bool): \b # Forcefully tear down a service in failed status. sky serve down failed-service --purge + \b + # Tear down a specific replica + sky serve down my-service --replica-id 1 + \b + # Forcefully tear down a specific replica, even in failed status. + sky serve down my-service --replica-id 1 --purge """ if sum([len(service_names) > 0, all]) != 1: argument_str = f'SERVICE_NAMES={",".join(service_names)}' if len( @@ -4417,22 +4428,45 @@ def serve_down(service_names: List[str], all: bool, purge: bool, yes: bool): 'Can only specify one of SERVICE_NAMES or --all. ' f'Provided {argument_str!r}.') + replica_id_is_defined = replica_id is not None + if replica_id_is_defined: + if len(service_names) != 1: + service_names_str = ', '.join(service_names) + raise click.UsageError(f'The --replica-id option can only be used ' + f'with a single service name. Got: ' + f'{service_names_str}.') + if all: + raise click.UsageError('The --replica-id option cannot be used ' + 'with the --all option.') + backend_utils.is_controller_accessible( controller=controller_utils.Controllers.SKY_SERVE_CONTROLLER, stopped_message='All services should have been terminated.', exit_if_not_accessible=True) if not yes: - quoted_service_names = [f'{name!r}' for name in service_names] - service_identity_str = f'service(s) {", ".join(quoted_service_names)}' - if all: - service_identity_str = 'all services' - click.confirm(f'Terminating {service_identity_str}. Proceed?', - default=True, - abort=True, - show_default=True) - - serve_lib.down(service_names=service_names, all=all, purge=purge) + if replica_id_is_defined: + click.confirm( + f'Terminating replica ID {replica_id} in ' + f'{service_names[0]!r}. Proceed?', + default=True, + abort=True, + show_default=True) + else: + quoted_service_names = [f'{name!r}' for name in service_names] + service_identity_str = (f'service(s) ' + f'{", ".join(quoted_service_names)}') + if all: + service_identity_str = 'all services' + click.confirm(f'Terminating {service_identity_str}. Proceed?', + default=True, + abort=True, + show_default=True) + + if replica_id_is_defined: + serve_lib.terminate_replica(service_names[0], replica_id, purge) + else: + serve_lib.down(service_names=service_names, all=all, purge=purge) @serve.command('logs', cls=_DocumentedCodeCommand) diff --git a/sky/serve/__init__.py b/sky/serve/__init__.py index d85b6e9311e..f93495809c3 100644 --- a/sky/serve/__init__.py +++ b/sky/serve/__init__.py @@ -8,6 +8,7 @@ from sky.serve.core import down from sky.serve.core import status from sky.serve.core import tail_logs +from sky.serve.core import terminate_replica from sky.serve.core import up from sky.serve.core import update from sky.serve.serve_state import ReplicaStatus @@ -42,6 +43,7 @@ 'SKY_SERVE_CONTROLLER_NAME', 'SKYSERVE_METADATA_DIR', 'status', + 'terminate_replica', 'tail_logs', 'up', 'update', diff --git a/sky/serve/constants.py b/sky/serve/constants.py index 7775c3f8a6e..3974293190e 100644 --- a/sky/serve/constants.py +++ b/sky/serve/constants.py @@ -92,4 +92,11 @@ # change for the serve_utils.ServeCodeGen, we need to bump this version, so that # the user can be notified to update their SkyPilot serve version on the remote # cluster. -SERVE_VERSION = 1 +# Changelog: +# v1.0 - Introduce rolling update. +# v2.0 - Added template-replica feature. +SERVE_VERSION = 2 + +TERMINATE_REPLICA_VERSION_MISMATCH_ERROR = ( + 'The version of service is outdated and does not support manually ' + 'terminating replicas. Please terminate the service and spin up again.') diff --git a/sky/serve/controller.py b/sky/serve/controller.py index 580964273ef..75d14b76079 100644 --- a/sky/serve/controller.py +++ b/sky/serve/controller.py @@ -9,6 +9,7 @@ import traceback from typing import Any, Dict, List +import colorama import fastapi from fastapi import responses import uvicorn @@ -157,6 +158,75 @@ async def update_service(request: fastapi.Request) -> fastapi.Response: return responses.JSONResponse(content={'message': 'Error'}, status_code=500) + @self._app.post('/controller/terminate_replica') + async def terminate_replica( + request: fastapi.Request) -> fastapi.Response: + request_data = await request.json() + replica_id = request_data['replica_id'] + assert isinstance(replica_id, + int), 'Error: replica ID must be an integer.' + purge = request_data['purge'] + assert isinstance(purge, bool), 'Error: purge must be a boolean.' + replica_info = serve_state.get_replica_info_from_id( + self._service_name, replica_id) + assert replica_info is not None, (f'Error: replica ' + f'{replica_id} does not exist.') + replica_status = replica_info.status + + if replica_status == serve_state.ReplicaStatus.SHUTTING_DOWN: + return responses.JSONResponse( + status_code=409, + content={ + 'message': + f'Replica {replica_id} of service ' + f'{self._service_name!r} is already in the process ' + f'of terminating. Skip terminating now.' + }) + + if (replica_status in serve_state.ReplicaStatus.failed_statuses() + and not purge): + return responses.JSONResponse( + status_code=409, + content={ + 'message': f'{colorama.Fore.YELLOW}Replica ' + f'{replica_id} of service ' + f'{self._service_name!r} is in failed ' + f'status ({replica_info.status}). ' + f'Skipping its termination as it could ' + f'lead to a resource leak. ' + f'(Use `sky serve down ' + f'{self._service_name!r} --replica-id ' + f'{replica_id} --purge` to ' + 'forcefully terminate the replica.)' + f'{colorama.Style.RESET_ALL}' + }) + + self._replica_manager.scale_down(replica_id, purge=purge) + + action = 'terminated' if not purge else 'purged' + message = (f'{colorama.Fore.GREEN}Replica {replica_id} of service ' + f'{self._service_name!r} is scheduled to be ' + f'{action}.{colorama.Style.RESET_ALL}\n' + f'Please use {ux_utils.BOLD}sky serve status ' + f'{self._service_name}{ux_utils.RESET_BOLD} ' + f'to check the latest status.') + return responses.JSONResponse(status_code=200, + content={'message': message}) + + @self._app.exception_handler(Exception) + async def validation_exception_handler( + request: fastapi.Request, exc: Exception) -> fastapi.Response: + with ux_utils.enable_traceback(): + logger.error(f'Error in controller: {exc!r}') + return responses.JSONResponse( + status_code=500, + content={ + 'message': + (f'Failed method {request.method} at URL {request.url}.' + f' Exception message is {exc!r}.') + }, + ) + threading.Thread(target=self._run_autoscaler).start() logger.info('SkyServe Controller started on ' diff --git a/sky/serve/core.py b/sky/serve/core.py index 3ad260213f1..691a3edea0b 100644 --- a/sky/serve/core.py +++ b/sky/serve/core.py @@ -503,6 +503,53 @@ def down( sky_logging.print(stdout) +@usage_lib.entrypoint +def terminate_replica(service_name: str, replica_id: int, purge: bool) -> None: + """Tear down a specific replica for the given service. + + Args: + service_name: Name of the service. + replica_id: ID of replica to terminate. + purge: Whether to terminate replicas in a failed status. These replicas + may lead to resource leaks, so we require the user to explicitly + specify this flag to make sure they are aware of this potential + resource leak. + + Raises: + sky.exceptions.ClusterNotUpError: if the sky sere controller is not up. + RuntimeError: if failed to terminate the replica. + """ + handle = backend_utils.is_controller_accessible( + controller=controller_utils.Controllers.SKY_SERVE_CONTROLLER, + stopped_message= + 'No service is running now. Please spin up a service first.', + non_existent_message='No service is running now. ' + 'Please spin up a service first.', + ) + + backend = backend_utils.get_backend_from_handle(handle) + assert isinstance(backend, backends.CloudVmRayBackend) + + code = serve_utils.ServeCodeGen.terminate_replica(service_name, replica_id, + purge) + returncode, stdout, stderr = backend.run_on_head(handle, + code, + require_outputs=True, + stream_logs=False, + separate_stderr=True) + + try: + subprocess_utils.handle_returncode(returncode, + code, + 'Failed to terminate the replica', + stderr, + stream_logs=True) + except exceptions.CommandError as e: + raise RuntimeError(e.error_msg) from e + + sky_logging.print(stdout) + + @usage_lib.entrypoint def status( service_names: Optional[Union[str, diff --git a/sky/serve/replica_managers.py b/sky/serve/replica_managers.py index 337b28ba61b..c0e5220e779 100644 --- a/sky/serve/replica_managers.py +++ b/sky/serve/replica_managers.py @@ -247,6 +247,8 @@ class ReplicaStatusProperty: is_scale_down: bool = False # The replica's spot instance was preempted. preempted: bool = False + # Whether the replica is purged. + purged: bool = False def remove_terminated_replica(self) -> bool: """Whether to remove the replica record from the replica table. @@ -307,6 +309,8 @@ def should_track_service_status(self) -> bool: return False if self.preempted: return False + if self.purged: + return False return True def to_replica_status(self) -> serve_state.ReplicaStatus: @@ -590,7 +594,7 @@ def scale_up(self, """ raise NotImplementedError - def scale_down(self, replica_id: int) -> None: + def scale_down(self, replica_id: int, purge: bool = False) -> None: """Scale down replica with replica_id.""" raise NotImplementedError @@ -679,7 +683,8 @@ def _terminate_replica(self, replica_id: int, sync_down_logs: bool, replica_drain_delay_seconds: int, - is_scale_down: bool = False) -> None: + is_scale_down: bool = False, + purge: bool = False) -> None: if replica_id in self._launch_process_pool: info = serve_state.get_replica_info_from_id(self._service_name, @@ -763,16 +768,18 @@ def _download_and_stream_logs(info: ReplicaInfo): ) info.status_property.sky_down_status = ProcessStatus.RUNNING info.status_property.is_scale_down = is_scale_down + info.status_property.purged = purge serve_state.add_or_update_replica(self._service_name, replica_id, info) p.start() self._down_process_pool[replica_id] = p - def scale_down(self, replica_id: int) -> None: + def scale_down(self, replica_id: int, purge: bool = False) -> None: self._terminate_replica( replica_id, sync_down_logs=False, replica_drain_delay_seconds=_DEFAULT_DRAIN_SECONDS, - is_scale_down=True) + is_scale_down=True, + purge=purge) def _handle_preemption(self, info: ReplicaInfo) -> bool: """Handle preemption of the replica if any error happened. @@ -911,6 +918,8 @@ def _refresh_process_pool(self) -> None: # since user should fixed the error before update. elif info.version != self.latest_version: removal_reason = 'for version outdated' + elif info.status_property.purged: + removal_reason = 'for purge' else: logger.info(f'Termination of replica {replica_id} ' 'finished. Replica info is kept since some ' diff --git a/sky/serve/serve_utils.py b/sky/serve/serve_utils.py index 0ecf34135a7..cb8b53f9814 100644 --- a/sky/serve/serve_utils.py +++ b/sky/serve/serve_utils.py @@ -313,6 +313,36 @@ def update_service_encoded(service_name: str, version: int, mode: str) -> str: return common_utils.encode_payload(service_msg) +def terminate_replica(service_name: str, replica_id: int, purge: bool) -> str: + service_status = _get_service_status(service_name) + if service_status is None: + with ux_utils.print_exception_no_traceback(): + raise ValueError(f'Service {service_name!r} does not exist.') + replica_info = serve_state.get_replica_info_from_id(service_name, + replica_id) + if replica_info is None: + with ux_utils.print_exception_no_traceback(): + raise ValueError( + f'Replica {replica_id} for service {service_name} does not ' + 'exist.') + + controller_port = service_status['controller_port'] + resp = requests.post( + _CONTROLLER_URL.format(CONTROLLER_PORT=controller_port) + + '/controller/terminate_replica', + json={ + 'replica_id': replica_id, + 'purge': purge, + }) + + message: str = resp.json()['message'] + if resp.status_code != 200: + with ux_utils.print_exception_no_traceback(): + raise ValueError(f'Failed to terminate replica {replica_id} ' + f'in {service_name}. Reason:\n{message}') + return message + + def _get_service_status( service_name: str, with_replica_info: bool = True) -> Optional[Dict[str, Any]]: @@ -735,7 +765,7 @@ def _get_replicas(service_record: Dict[str, Any]) -> str: def get_endpoint(service_record: Dict[str, Any]) -> str: - # Don't use backend_utils.is_controller_up since it is too slow. + # Don't use backend_utils.is_controller_accessible since it is too slow. handle = global_user_state.get_handle_from_cluster_name( SKY_SERVE_CONTROLLER_NAME) assert isinstance(handle, backends.CloudVmRayResourceHandle) @@ -915,6 +945,18 @@ def terminate_services(cls, service_names: Optional[List[str]], ] return cls._build(code) + @classmethod + def terminate_replica(cls, service_name: str, replica_id: int, + purge: bool) -> str: + code = [ + f'(lambda: print(serve_utils.terminate_replica({service_name!r}, ' + f'{replica_id}, {purge}), end="", flush=True) ' + 'if getattr(constants, "SERVE_VERSION", 0) >= 2 else ' + f'exec("raise RuntimeError(' + f'{constants.TERMINATE_REPLICA_VERSION_MISMATCH_ERROR!r})"))()' + ] + return cls._build(code) + @classmethod def wait_service_registration(cls, service_name: str, job_id: int) -> str: code = [ From 63e96f49a94a3bdd0263baf52de5eb746b4adc77 Mon Sep 17 00:00:00 2001 From: Tian Xia Date: Sun, 20 Oct 2024 13:37:12 -0700 Subject: [PATCH 64/93] [Provisioner] Support docker in Lambda Cloud and TPU (#4115) * [Provisioner] Support docker in Lambda Cloud * fix permission issue * merge with check docker installed * add tpu support & test * patch lambda cloud * add comment --- sky/clouds/azure.py | 1 - sky/clouds/gcp.py | 3 +++ sky/clouds/lambda_cloud.py | 14 +++++++++----- sky/provision/docker_utils.py | 19 ++++++++++++------- sky/provision/paperspace/utils.py | 2 ++ sky/resources.py | 23 +++++++++++------------ sky/templates/lambda-ray.yml.j2 | 20 ++++++++++++++++++++ sky/utils/command_runner.py | 4 +++- 8 files changed, 60 insertions(+), 26 deletions(-) diff --git a/sky/clouds/azure.py b/sky/clouds/azure.py index afa85f48fa5..adffd32ad88 100644 --- a/sky/clouds/azure.py +++ b/sky/clouds/azure.py @@ -329,7 +329,6 @@ def make_deploy_resources_variables( runcmd: - sed -i 's/#Banner none/Banner none/' /etc/ssh/sshd_config - echo '\\nif [ ! -f "/tmp/__restarted" ]; then\\n sudo systemctl restart ssh\\n sleep 2\\n touch /tmp/__restarted\\nfi' >> /home/skypilot:ssh_user/.bashrc - - usermod -aG docker skypilot:ssh_user write_files: - path: /etc/apt/apt.conf.d/20auto-upgrades content: | diff --git a/sky/clouds/gcp.py b/sky/clouds/gcp.py index 0e02f9fd456..1b70abf914d 100644 --- a/sky/clouds/gcp.py +++ b/sky/clouds/gcp.py @@ -477,6 +477,9 @@ def _failover_disk_tier() -> Optional[resources_utils.DiskTier]: 'runtime_version'] resources_vars['tpu_node_name'] = r.accelerator_args.get( 'tpu_name') + # TPU VMs require privileged mode for docker containers to + # access TPU devices. + resources_vars['docker_run_options'] = ['--privileged'] else: # Convert to GCP names: # https://cloud.google.com/compute/docs/gpus diff --git a/sky/clouds/lambda_cloud.py b/sky/clouds/lambda_cloud.py index d2573ebbb29..0201f4f76ad 100644 --- a/sky/clouds/lambda_cloud.py +++ b/sky/clouds/lambda_cloud.py @@ -37,10 +37,6 @@ class Lambda(clouds.Cloud): _CLOUD_UNSUPPORTED_FEATURES = { clouds.CloudImplementationFeatures.STOP: 'Lambda cloud does not support stopping VMs.', clouds.CloudImplementationFeatures.CLONE_DISK_FROM_CLUSTER: f'Migrating disk is currently not supported on {_REPR}.', - clouds.CloudImplementationFeatures.DOCKER_IMAGE: ( - f'Docker image is currently not supported on {_REPR}. ' - 'You can try running docker command inside the `run` section in task.yaml.' - ), clouds.CloudImplementationFeatures.SPOT_INSTANCE: f'Spot instances are not supported in {_REPR}.', clouds.CloudImplementationFeatures.IMAGE_ID: f'Specifying image ID is not supported in {_REPR}.', clouds.CloudImplementationFeatures.CUSTOM_DISK_TIER: f'Custom disk tiers are not supported in {_REPR}.', @@ -173,12 +169,20 @@ def make_deploy_resources_variables( else: custom_resources = None - return { + resources_vars = { 'instance_type': resources.instance_type, 'custom_resources': custom_resources, 'region': region.name, } + if acc_dict is not None: + # Lambda cloud's docker runtime information does not contain + # 'nvidia-container-runtime', causing no GPU option is added to + # the docker run command. We patch this by adding it here. + resources_vars['docker_run_options'] = ['--gpus all'] + + return resources_vars + def _get_feasible_launchable_resources( self, resources: 'resources_lib.Resources' ) -> 'resources_utils.FeasibleResources': diff --git a/sky/provision/docker_utils.py b/sky/provision/docker_utils.py index 7bfa1724b83..3ee5d4dfc0c 100644 --- a/sky/provision/docker_utils.py +++ b/sky/provision/docker_utils.py @@ -253,12 +253,13 @@ def initialize(self) -> str: # issue with nvidia container toolkit: # https://github.com/NVIDIA/nvidia-container-toolkit/issues/48 self._run( - '[ -f /etc/docker/daemon.json ] || ' + '{ which jq || sudo apt update && sudo apt install -y jq; } && ' + '{ [ -f /etc/docker/daemon.json ] || ' 'echo "{}" | sudo tee /etc/docker/daemon.json;' 'sudo jq \'.["exec-opts"] = ["native.cgroupdriver=cgroupfs"]\' ' '/etc/docker/daemon.json > /tmp/daemon.json;' 'sudo mv /tmp/daemon.json /etc/docker/daemon.json;' - 'sudo systemctl restart docker') + 'sudo systemctl restart docker; } || true') user_docker_run_options = self.docker_config.get('run_options', []) start_command = docker_start_cmds( specific_image, @@ -335,7 +336,11 @@ def initialize(self) -> str: def _check_docker_installed(self): no_exist = 'NoExist' + # SkyPilot: Add the current user to the docker group first (if needed), + # before checking if docker is installed to avoid permission issues. cleaned_output = self._run( + 'id -nG $USER | grep -qw docker || ' + 'sudo usermod -aG docker $USER > /dev/null 2>&1;' f'command -v {self.docker_cmd} || echo {no_exist!r}') if no_exist in cleaned_output or 'docker' not in cleaned_output: logger.error( @@ -424,8 +429,8 @@ def _auto_configure_shm(self, run_options: List[str]) -> List[str]: def _check_container_exited(self) -> bool: if self.initialized: return True - output = (self._run(check_docker_running_cmd(self.container_name, - self.docker_cmd), - wait_for_docker_daemon=True)) - return 'false' in output.lower( - ) and 'no such object' not in output.lower() + output = self._run(check_docker_running_cmd(self.container_name, + self.docker_cmd), + wait_for_docker_daemon=True) + return ('false' in output.lower() and + 'no such object' not in output.lower()) diff --git a/sky/provision/paperspace/utils.py b/sky/provision/paperspace/utils.py index db2da7b4610..d9eceefba19 100644 --- a/sky/provision/paperspace/utils.py +++ b/sky/provision/paperspace/utils.py @@ -132,6 +132,8 @@ def set_sky_key_script(self, public_key: str) -> None: 'apt-get update \n' 'apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin \n' # pylint: disable=line-too-long 'fi \n' + # TODO(tian): Maybe remove this as well since we are now adding + # users to docker group in the DockerInitializer. Need to test. 'usermod -aG docker paperspace \n' f'echo "{public_key}" >> /home/paperspace/.ssh/authorized_keys \n') try: diff --git a/sky/resources.py b/sky/resources.py index e9a522cef48..384f2b6a548 100644 --- a/sky/resources.py +++ b/sky/resources.py @@ -842,12 +842,6 @@ def _try_validate_image_id(self) -> None: if self.extract_docker_image() is not None: # TODO(tian): validate the docker image exists / of reasonable size - if self.accelerators is not None: - for acc in self.accelerators.keys(): - if acc.lower().startswith('tpu'): - with ux_utils.print_exception_no_traceback(): - raise ValueError( - 'Docker image is not supported for TPU VM.') if self.cloud is not None: self.cloud.check_features_are_supported( self, {clouds.CloudImplementationFeatures.DOCKER_IMAGE}) @@ -1032,6 +1026,12 @@ def make_deploy_variables(self, cluster_name: resources_utils.ClusterName, self.accelerators is not None): initial_setup_commands = [constants.DISABLE_GPU_ECC_COMMAND] + docker_image = self.extract_docker_image() + + # Cloud specific variables + cloud_specific_variables = self.cloud.make_deploy_resources_variables( + self, cluster_name, region, zones, dryrun) + # Docker run options docker_run_options = skypilot_config.get_nested( ('docker', 'run_options'), @@ -1039,18 +1039,17 @@ def make_deploy_variables(self, cluster_name: resources_utils.ClusterName, override_configs=self.cluster_config_overrides) if isinstance(docker_run_options, str): docker_run_options = [docker_run_options] + # Special accelerator runtime might require additional docker run + # options. e.g., for TPU, we need --privileged. + if 'docker_run_options' in cloud_specific_variables: + docker_run_options.extend( + cloud_specific_variables['docker_run_options']) if docker_run_options and isinstance(self.cloud, clouds.Kubernetes): logger.warning( f'{colorama.Style.DIM}Docker run options are specified, ' 'but ignored for Kubernetes: ' f'{" ".join(docker_run_options)}' f'{colorama.Style.RESET_ALL}') - - docker_image = self.extract_docker_image() - - # Cloud specific variables - cloud_specific_variables = self.cloud.make_deploy_resources_variables( - self, cluster_name, region, zones, dryrun) return dict( cloud_specific_variables, **{ diff --git a/sky/templates/lambda-ray.yml.j2 b/sky/templates/lambda-ray.yml.j2 index c4b8dba1a9f..5df3655c566 100644 --- a/sky/templates/lambda-ray.yml.j2 +++ b/sky/templates/lambda-ray.yml.j2 @@ -5,6 +5,26 @@ max_workers: {{num_nodes - 1}} upscaling_speed: {{num_nodes - 1}} idle_timeout_minutes: 60 +{%- if docker_image is not none %} +docker: + image: {{docker_image}} + container_name: {{docker_container_name}} + run_options: + - --ulimit nofile=1048576:1048576 + {%- for run_option in docker_run_options %} + - {{run_option}} + {%- endfor %} + {%- if docker_login_config is not none %} + docker_login_config: + username: |- + {{docker_login_config.username}} + password: |- + {{docker_login_config.password}} + server: |- + {{docker_login_config.server}} + {%- endif %} +{%- endif %} + provider: type: external module: sky.provision.lambda diff --git a/sky/utils/command_runner.py b/sky/utils/command_runner.py index be6e8346e3d..bbe287d9f79 100644 --- a/sky/utils/command_runner.py +++ b/sky/utils/command_runner.py @@ -502,8 +502,10 @@ def close_cached_connection(self) -> None: if self.ssh_control_name is not None: control_path = _ssh_control_path(self.ssh_control_name) if control_path is not None: + # Suppress the `Exit request sent.` output for this comamnd + # which would interrupt the CLI spinner. cmd = (f'ssh -O exit -S {control_path}/%C ' - f'{self.ssh_user}@{self.ip}') + f'{self.ssh_user}@{self.ip} > /dev/null 2>&1') logger.debug(f'Closing cached connection {control_path!r} with ' f'cmd: {cmd}') log_lib.run_with_log(cmd, From 03c2adbc599ec70e5d260f0fc03702b2638fbb67 Mon Sep 17 00:00:00 2001 From: Andy Lee Date: Sun, 20 Oct 2024 15:25:45 -0700 Subject: [PATCH 65/93] [Serve] Add `ux_utils.print_exception_no_traceback()` for cleaner error output (#4111) * add `ux_utils.print_exception_no_traceback()` for cleaner error output * Empty commit * remove unnecessary with block --- sky/serve/serve_utils.py | 47 +++++++++++++++++++++++++--------------- 1 file changed, 30 insertions(+), 17 deletions(-) diff --git a/sky/serve/serve_utils.py b/sky/serve/serve_utils.py index cb8b53f9814..85bdd56b9ac 100644 --- a/sky/serve/serve_utils.py +++ b/sky/serve/serve_utils.py @@ -246,9 +246,11 @@ def set_service_status_and_active_versions_from_replica( update_mode: UpdateMode) -> None: record = serve_state.get_service_from_name(service_name) if record is None: - raise ValueError('The service is up-ed in an old version and does not ' - 'support update. Please `sky serve down` ' - 'it first and relaunch the service.') + with ux_utils.print_exception_no_traceback(): + raise ValueError( + 'The service is up-ed in an old version and does not ' + 'support update. Please `sky serve down` ' + 'it first and relaunch the service.') if record['status'] == serve_state.ServiceStatus.SHUTTING_DOWN: # When the service is shutting down, there is a period of time which the # controller still responds to the request, and the replica is not @@ -289,7 +291,8 @@ def update_service_status() -> None: def update_service_encoded(service_name: str, version: int, mode: str) -> str: service_status = _get_service_status(service_name) if service_status is None: - raise ValueError(f'Service {service_name!r} does not exist.') + with ux_utils.print_exception_no_traceback(): + raise ValueError(f'Service {service_name!r} does not exist.') controller_port = service_status['controller_port'] resp = requests.post( _CONTROLLER_URL.format(CONTROLLER_PORT=controller_port) + @@ -299,15 +302,21 @@ def update_service_encoded(service_name: str, version: int, mode: str) -> str: 'mode': mode, }) if resp.status_code == 404: - raise ValueError('The service is up-ed in an old version and does not ' - 'support update. Please `sky serve down` ' - 'it first and relaunch the service. ') + with ux_utils.print_exception_no_traceback(): + raise ValueError( + 'The service is up-ed in an old version and does not ' + 'support update. Please `sky serve down` ' + 'it first and relaunch the service. ') elif resp.status_code == 400: - raise ValueError(f'Client error during service update: {resp.text}') + with ux_utils.print_exception_no_traceback(): + raise ValueError(f'Client error during service update: {resp.text}') elif resp.status_code == 500: - raise RuntimeError(f'Server error during service update: {resp.text}') + with ux_utils.print_exception_no_traceback(): + raise RuntimeError( + f'Server error during service update: {resp.text}') elif resp.status_code != 200: - raise ValueError(f'Failed to update service: {resp.text}') + with ux_utils.print_exception_no_traceback(): + raise ValueError(f'Failed to update service: {resp.text}') service_msg = resp.json()['message'] return common_utils.encode_payload(service_msg) @@ -557,10 +566,12 @@ def load_service_initialization_result(payload: str) -> int: def check_service_status_healthy(service_name: str) -> Optional[str]: service_record = serve_state.get_service_from_name(service_name) if service_record is None: - return f'Service {service_name!r} does not exist.' + with ux_utils.print_exception_no_traceback(): + return f'Service {service_name!r} does not exist.' if service_record['status'] == serve_state.ServiceStatus.CONTROLLER_INIT: - return (f'Service {service_name!r} is still initializing its ' - 'controller. Please try again later.') + with ux_utils.print_exception_no_traceback(): + return (f'Service {service_name!r} is still initializing its ' + 'controller. Please try again later.') return None @@ -663,8 +674,9 @@ def stream_replica_logs(service_name: str, replica_id: int, launch_log_file_name = generate_replica_launch_log_file_name( service_name, replica_id) if not os.path.exists(launch_log_file_name): - return (f'{colorama.Fore.RED}Replica {replica_id} doesn\'t exist.' - f'{colorama.Style.RESET_ALL}') + with ux_utils.print_exception_no_traceback(): + return (f'{colorama.Fore.RED}Replica {replica_id} doesn\'t exist.' + f'{colorama.Style.RESET_ALL}') replica_cluster_name = generate_replica_cluster_name( service_name, replica_id) @@ -674,8 +686,9 @@ def _get_replica_status() -> serve_state.ReplicaStatus: for info in replica_info: if info.replica_id == replica_id: return info.status - raise ValueError( - _FAILED_TO_FIND_REPLICA_MSG.format(replica_id=replica_id)) + with ux_utils.print_exception_no_traceback(): + raise ValueError( + _FAILED_TO_FIND_REPLICA_MSG.format(replica_id=replica_id)) finish_stream = ( lambda: _get_replica_status() != serve_state.ReplicaStatus.PROVISIONING) From 985df832e0fad24f31f09c80aeeb7c5e219e085f Mon Sep 17 00:00:00 2001 From: Andy Lee Date: Sun, 20 Oct 2024 16:13:11 -0700 Subject: [PATCH 66/93] Partially revert: Remove unnecessary `ux_utils.print_exception_no_traceback()` wrappers (#4130) fix unnecessary with block for returning --- sky/serve/serve_utils.py | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/sky/serve/serve_utils.py b/sky/serve/serve_utils.py index 85bdd56b9ac..3a416dd2932 100644 --- a/sky/serve/serve_utils.py +++ b/sky/serve/serve_utils.py @@ -566,12 +566,10 @@ def load_service_initialization_result(payload: str) -> int: def check_service_status_healthy(service_name: str) -> Optional[str]: service_record = serve_state.get_service_from_name(service_name) if service_record is None: - with ux_utils.print_exception_no_traceback(): - return f'Service {service_name!r} does not exist.' + return f'Service {service_name!r} does not exist.' if service_record['status'] == serve_state.ServiceStatus.CONTROLLER_INIT: - with ux_utils.print_exception_no_traceback(): - return (f'Service {service_name!r} is still initializing its ' - 'controller. Please try again later.') + return (f'Service {service_name!r} is still initializing its ' + 'controller. Please try again later.') return None @@ -674,9 +672,8 @@ def stream_replica_logs(service_name: str, replica_id: int, launch_log_file_name = generate_replica_launch_log_file_name( service_name, replica_id) if not os.path.exists(launch_log_file_name): - with ux_utils.print_exception_no_traceback(): - return (f'{colorama.Fore.RED}Replica {replica_id} doesn\'t exist.' - f'{colorama.Style.RESET_ALL}') + return (f'{colorama.Fore.RED}Replica {replica_id} doesn\'t exist.' + f'{colorama.Style.RESET_ALL}') replica_cluster_name = generate_replica_cluster_name( service_name, replica_id) From 067a0a35dfcb4789b1f72a083a7309814a807e80 Mon Sep 17 00:00:00 2001 From: Romil Bhardwaj Date: Sun, 20 Oct 2024 18:26:46 -0700 Subject: [PATCH 67/93] [examples] Deepspeed fixes + k8s support (#4124) deepspeed kubernetes fixes --- examples/deepspeed-multinode/sky.yaml | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/examples/deepspeed-multinode/sky.yaml b/examples/deepspeed-multinode/sky.yaml index 37d7445a2a1..07bd3746894 100644 --- a/examples/deepspeed-multinode/sky.yaml +++ b/examples/deepspeed-multinode/sky.yaml @@ -2,10 +2,16 @@ # # This takes care constructing a "hostfile" to pass to DeepSpeed. # +# If running on Kubernetes, use the nvidia/cuda:12.1.1-devel-ubuntu20.04 image +# because DeepSpeed requires nvcc. +# # Usage: # # $ sky launch sky.yaml -r --down -c ds # +# If running on Kubernetes: +# $ sky launch sky.yaml -r --down -c ds --cloud kubernetes --image nvidia/cuda:12.1.1-devel-ubuntu20.04 +# # # Optional: After the job starts running, you can log into the two nodes and # # check gpustat: # $ ssh ds @@ -18,6 +24,7 @@ resources: # accelerators: A100-80GB:1 # Azure, GCP, SCP # accelerators: A10G:1 # AWS. Will OOM for (1) single_node/run_1.3b_lora.sh (2) multi_node/run_66b.sh. # accelerators: T4:1 # AWS, Azure, GCP. Will OOM for (1) single_node/run_1.3b_lora.sh (2) multi_node/run_66b.sh. + # image_id: docker:nvidia/cuda:12.1.1-devel-ubuntu20.04 # Use this image if running on Kubernetes num_nodes: 2 @@ -28,6 +35,13 @@ envs: DEEPSPEED_ENVS: "MY_VAR_1,MY_VAR_2,SKYPILOT_NODE_RANK" setup: | + if ! command -v git &> /dev/null + then + echo "git is not installed. Installing git..." + sudo apt-get update + sudo apt-get install -y git + fi + git clone https://github.com/microsoft/DeepSpeedExamples.git || true cd DeepSpeedExamples git checkout d7c42b4f34df91035e7ed3e0c51500bb53d0bc71 @@ -39,16 +53,19 @@ setup: | conda create -n deepspeed python=3.8 -y conda activate deepspeed - pip install deepspeed + pip install deepspeed==0.14.4 cd applications/DeepSpeed-Chat pip install -r requirements.txt + + pip install transformers==4.44.0 # Required by DeepSpeed in multi-node settings. # # NOTE(skypilot): DeepSpeed uses `pdsh` to log into each node and calls # `ninja --version`; so it has to be installed system-wide rather than in # the above 'deepspeed' conda env. + sudo apt-get update sudo apt-get -y install pdsh ninja-build fi From 3c3bcee5cfe720a96ab67f4049a557a79e7f077f Mon Sep 17 00:00:00 2001 From: Hysun He Date: Mon, 21 Oct 2024 12:13:51 +0800 Subject: [PATCH 68/93] [OCI] Support more OS types in addition to ubuntu (#4080) * Bug fix for sky config file path resolution. * format * [OCI] Bug fix for image_id in Task YAML * [OCI]: Support more OS types (esp. oraclelinux) in addition to ubuntu. * format * Disable system firewall * Bug fix for validation of the Marketplace images * Update sky/clouds/oci.py Co-authored-by: Zhanghao Wu * Update sky/clouds/oci.py Co-authored-by: Zhanghao Wu * variable/function naming * address review comments: not to change the service_catalog api. call oci_catalog directly for get os type for a image. * Update sky/clouds/oci.py Co-authored-by: Zhanghao Wu * Update sky/clouds/oci.py Co-authored-by: Zhanghao Wu * Update sky/clouds/oci.py Co-authored-by: Zhanghao Wu * address review comments --------- Co-authored-by: Zhanghao Wu --- sky/clouds/oci.py | 73 +++++++++++------------ sky/clouds/service_catalog/oci_catalog.py | 22 +++++++ sky/clouds/utils/oci_utils.py | 12 +++- sky/templates/oci-ray.yml.j2 | 14 ++++- 4 files changed, 81 insertions(+), 40 deletions(-) diff --git a/sky/clouds/oci.py b/sky/clouds/oci.py index f4ac4d577e3..810e43fe3b5 100644 --- a/sky/clouds/oci.py +++ b/sky/clouds/oci.py @@ -17,6 +17,8 @@ make_deploy_resources_variables(): Bug fix for specify the image_id as the ocid of the image in the task.yaml file, in this case the image_id for the node config should be set to the ocid instead of a dict. + - Hysun He (hysun.he@oracle.com) @ Oct 13, 2024: + Support more OS types additional to ubuntu for OCI resources. """ import json import logging @@ -295,10 +297,21 @@ def make_deploy_resources_variables( cpus=None if cpus is None else float(cpus), disk_tier=resources.disk_tier) + image_str = self._get_image_str(image_id=resources.image_id, + instance_type=resources.instance_type, + region=region.name) + + # pylint: disable=import-outside-toplevel + from sky.clouds.service_catalog import oci_catalog + os_type = oci_catalog.get_image_os_from_tag(tag=image_str, + region=region.name) + logger.debug(f'OS type for the image {image_str} is {os_type}') + return { 'instance_type': instance_type, 'custom_resources': custom_resources, 'region': region.name, + 'os_type': os_type, 'cpus': str(cpus), 'memory': resources.memory, 'disk_size': resources.disk_size, @@ -501,59 +514,45 @@ def _get_image_id( region_name: str, instance_type: str, ) -> str: - if image_id is None: - return self._get_default_image(region_name=region_name, - instance_type=instance_type) - if None in image_id: - image_id_str = image_id[None] - else: - assert region_name in image_id, image_id - image_id_str = image_id[region_name] + image_id_str = self._get_image_str(image_id=image_id, + instance_type=instance_type, + region=region_name) + if image_id_str.startswith('skypilot:'): image_id_str = service_catalog.get_image_id_from_tag(image_id_str, region_name, clouds='oci') - if image_id_str is None: - logger.critical( - '! Real image_id not found! - {region_name}:{image_id}') - # Raise ResourcesUnavailableError to make sure the failover - # in CloudVMRayBackend will be correctly triggered. - # TODO(zhwu): This is a information leakage to the cloud - # implementor, we need to find a better way to handle this. - raise exceptions.ResourcesUnavailableError( - '! ERR: No image found in catalog for region ' - f'{region_name}. Try setting a valid image_id.') + + # Image_id should be impossible be None, except for the case when + # user specify an image tag which does not exist in the image.csv + # catalog file which only possible in "test" / "evaluation" phase. + # Therefore, we use assert here. + assert image_id_str is not None logger.debug(f'Got real image_id {image_id_str}') return image_id_str - def _get_default_image(self, region_name: str, instance_type: str) -> str: + def _get_image_str(self, image_id: Optional[Dict[Optional[str], str]], + instance_type: str, region: str): + if image_id is None: + image_str = self._get_default_image_tag(instance_type) + elif None in image_id: + image_str = image_id[None] + else: + assert region in image_id, image_id + image_str = image_id[region] + return image_str + + def _get_default_image_tag(self, instance_type: str) -> str: acc = self.get_accelerators_from_instance_type(instance_type) if acc is None: image_tag = oci_utils.oci_config.get_default_image_tag() - image_id_str = service_catalog.get_image_id_from_tag(image_tag, - region_name, - clouds='oci') else: assert len(acc) == 1, acc image_tag = oci_utils.oci_config.get_default_gpu_image_tag() - image_id_str = service_catalog.get_image_id_from_tag(image_tag, - region_name, - clouds='oci') - if image_id_str is not None: - logger.debug( - f'Got default image_id {image_id_str} from tag {image_tag}') - return image_id_str - - # Raise ResourcesUnavailableError to make sure the failover in - # CloudVMRayBackend will be correctly triggered. - # TODO(zhwu): This is a information leakage to the cloud implementor, - # we need to find a better way to handle this. - raise exceptions.ResourcesUnavailableError( - 'ERR: No image found in catalog for region ' - f'{region_name}. Try update your default image_id settings.') + return image_tag def get_vpu_from_disktier( self, cpus: Optional[float], diff --git a/sky/clouds/service_catalog/oci_catalog.py b/sky/clouds/service_catalog/oci_catalog.py index a18dee79be5..47d0489f6ab 100644 --- a/sky/clouds/service_catalog/oci_catalog.py +++ b/sky/clouds/service_catalog/oci_catalog.py @@ -7,6 +7,8 @@ - Hysun He (hysun.he@oracle.com) @ Apr, 2023: Initial implementation - Hysun He (hysun.he@oracle.com) @ Jun, 2023: Reduce retry times by excluding those unsubscribed regions. + - Hysun He (hysun.he@oracle.com) @ Oct 14, 2024: Bug fix for validation + of the Marketplace images """ import logging @@ -206,4 +208,24 @@ def get_image_id_from_tag(tag: str, region: Optional[str]) -> Optional[str]: def is_image_tag_valid(tag: str, region: Optional[str]) -> bool: """Returns whether the image tag is valid.""" + # Oct.14, 2024 by Hysun He: Marketplace images are region neutral, so don't + # check with region for the Marketplace images. + df = _image_df[_image_df['Tag'].str.fullmatch(tag)] + if df.empty: + return False + app_catalog_listing_id = df['AppCatalogListingId'].iloc[0] + if app_catalog_listing_id: + return True return common.is_image_tag_valid_impl(_image_df, tag, region) + + +def get_image_os_from_tag(tag: str, region: Optional[str]) -> Optional[str]: + del region + df = _image_df[_image_df['Tag'].str.fullmatch(tag)] + if df.empty: + os_type = oci_utils.oci_config.get_default_image_os() + else: + os_type = df['OS'].iloc[0] + + logger.debug(f'Operation system for the image {tag} is {os_type}') + return os_type diff --git a/sky/clouds/utils/oci_utils.py b/sky/clouds/utils/oci_utils.py index 3d11bab24da..86647071f3e 100644 --- a/sky/clouds/utils/oci_utils.py +++ b/sky/clouds/utils/oci_utils.py @@ -1,7 +1,9 @@ """OCI Configuration. History: - - Zhanghao Wu @ Oct 2023: Formatting and refactoring - Hysun He (hysun.he@oracle.com) @ Apr, 2023: Initial implementation + - Zhanghao Wu @ Oct 2023: Formatting and refactoring + - Hysun He (hysun.he@oracle.com) @ Oct, 2024: Add default image OS + configuration. """ import logging import os @@ -121,5 +123,13 @@ def get_profile(cls) -> str: return skypilot_config.get_nested( ('oci', 'default', 'oci_config_profile'), 'DEFAULT') + @classmethod + def get_default_image_os(cls) -> str: + # Get the default image OS. Instead of hardcoding, we give a choice to + # set the default image OS type in the sky's user-config file. (if not + # specified, use the hardcode one at last) + return skypilot_config.get_nested(('oci', 'default', 'image_os_type'), + 'ubuntu') + oci_config = OCIConfig() diff --git a/sky/templates/oci-ray.yml.j2 b/sky/templates/oci-ray.yml.j2 index 32bd6326ee2..64fa4e745c7 100644 --- a/sky/templates/oci-ray.yml.j2 +++ b/sky/templates/oci-ray.yml.j2 @@ -16,7 +16,11 @@ provider: disable_launch_config_check: true auth: +{% if os_type == "ubuntu" %} ssh_user: ubuntu +{% else %} + ssh_user: opc +{% endif %} ssh_private_key: {{ssh_private_key}} available_node_types: @@ -85,14 +89,20 @@ setup_commands: # Line 'sudo grep ..': set the number of threads per process to unlimited to avoid ray job submit stucking issue when the number of running ray jobs increase. # Line 'mkdir -p ..': disable host key check # Line 'python3 -c ..': patch the buggy ray files and enable `-o allow_other` option for `goofys` - - sudo systemctl stop unattended-upgrades || true; + - echo "setup commands runs at $(date)" > /tmp/provision.tmp.out || true; + {%- if os_type == "ubuntu" %} + sudo systemctl stop unattended-upgrades || true; sudo systemctl disable unattended-upgrades || true; sudo sed -i 's/Unattended-Upgrade "1"/Unattended-Upgrade "0"/g' /etc/apt/apt.conf.d/20auto-upgrades || true; sudo kill -9 `sudo lsof /var/lib/dpkg/lock-frontend | awk '{print $2}' | tail -n 1` || true; sudo pkill -9 apt-get; sudo pkill -9 dpkg; sudo dpkg --configure -a; - ([ `sudo lshw -class display | grep "NVIDIA Corporation" | wc -l` -gt 0 ]) && (sudo which nvidia-smi > /dev/null || ( sudo apt-get install nvidia-driver-530-open -y && sudo apt-get install nvidia-driver-525-server -y ) || true); + {%- else %} + sudo /usr/libexec/oci-growfs -y || true; + sudo systemctl stop firewalld || true; + sudo systemctl disable firewalld || true; + {%- endif %} mkdir -p ~/.ssh; touch ~/.ssh/config; {{ conda_installation_commands }} {{ ray_skypilot_installation_commands }} From 900819da6707d5ba5e3b9e97dca3cf9e3fda8c48 Mon Sep 17 00:00:00 2001 From: Tian Xia Date: Mon, 21 Oct 2024 18:01:59 -0700 Subject: [PATCH 69/93] [Catalog] Silently ignore TPU price not found. (#4134) * [Catalog] Silently ignore TPU price not found. * assert for non tpu v6e * format --- sky/clouds/service_catalog/data_fetchers/fetch_gcp.py | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/sky/clouds/service_catalog/data_fetchers/fetch_gcp.py b/sky/clouds/service_catalog/data_fetchers/fetch_gcp.py index 097efe74deb..6550c6bbe64 100644 --- a/sky/clouds/service_catalog/data_fetchers/fetch_gcp.py +++ b/sky/clouds/service_catalog/data_fetchers/fetch_gcp.py @@ -681,7 +681,13 @@ def get_tpu_price(row: pd.Series, spot: bool) -> Optional[float]: spot_str = 'spot ' if spot else '' print(f'The {spot_str}price of {tpu_name} in {tpu_region} is ' 'not found in SKUs or hidden TPU price DF.') - assert spot or tpu_price is not None, (row, hidden_tpu, HIDDEN_TPU_DF) + # TODO(tian): Hack. Should investigate how to retrieve the price + # for TPU-v6e. + if not tpu_name.startswith('tpu-v6e'): + assert spot or tpu_price is not None, (row, hidden_tpu, + HIDDEN_TPU_DF) + else: + tpu_price = 0.0 return tpu_price df['Price'] = df.apply(lambda row: get_tpu_price(row, spot=False), axis=1) From f5d4f64dd42e831546df0982fa2e46d280a74cbd Mon Sep 17 00:00:00 2001 From: Romil Bhardwaj Date: Tue, 22 Oct 2024 09:05:48 -0700 Subject: [PATCH 70/93] [docs] Update GPUs used in docs (#4138) * Change V100 to H100 * updates * update --- README.md | 2 +- docs/source/cloud-setup/quota.rst | 13 +++--- docs/source/getting-started/quickstart.rst | 12 +++--- docs/source/reference/job-queue.rst | 40 +++++++++---------- .../reference/kubernetes/kubernetes-setup.rst | 2 +- docs/source/reference/yaml-spec.rst | 12 +++--- docs/source/running-jobs/distributed-jobs.rst | 4 +- llm/vllm/README.md | 4 +- 8 files changed, 45 insertions(+), 44 deletions(-) diff --git a/README.md b/README.md index 01b3ab08c8a..2629cc4e4c8 100644 --- a/README.md +++ b/README.md @@ -110,7 +110,7 @@ Paste the following into a file `my_task.yaml`: ```yaml resources: - accelerators: V100:1 # 1x NVIDIA V100 GPU + accelerators: A100:8 # 8x NVIDIA A100 GPU num_nodes: 1 # Number of VMs to launch diff --git a/docs/source/cloud-setup/quota.rst b/docs/source/cloud-setup/quota.rst index 35042c7bab1..f30862b75fd 100644 --- a/docs/source/cloud-setup/quota.rst +++ b/docs/source/cloud-setup/quota.rst @@ -5,7 +5,7 @@ Requesting Quota Increase Most cloud providers enforce a quota policy to limit the number of VM instances that can exist in a given region. -Users may encounter `QuotaExceeded` or `VcpuLimitExceeded` errors during resources provisioning, especially for high end GPUs such as V100/A100. +Users may encounter `QuotaExceeded` or `VcpuLimitExceeded` errors during resources provisioning, especially for high end GPUs such as H100/A100. To check or increase your quota limits, please follow the below instructions. After submitting the request, it will usually take a few days for the support team to review. To increase chances of being approved, you may respond their inquiry emails on how the requested resources will be used your projects. @@ -34,7 +34,7 @@ Azure - For Deployment model, ensure **Resource Manager** is selected. - For Locations, select all regions in which you want to increase quotas. - For each region you selected, select one or more VM series from the Quotas drop-down list. - - For each VM Series you selected (e.g., ``NCSv3``, ``NDv2`` for V100 instances), enter the new vCPU limit that you want for this subscription. You may check `for more VM Series `_. + - For each VM Series you selected (e.g., ``ND_H100_v5`` for H100 instances), enter the new vCPU limit that you want for this subscription. You may check `for more VM Series `_. - When you're finished, select **Save and continue**. 5. Enter or confirm your contact details, then select **Next**. @@ -45,10 +45,11 @@ GCP 1. In the Google Cloud Console, go to the `Quota page `_. 2. Click **Filter** and select ``Service: Compute Engine API``. -3. Choose ``Limit Name: instance_name``. (e.g., ``NVIDIA-V100-GPUS-per-project-region``). You may check the `the compute GPU list `_. -4. Select the checkbox of the region whose quota you want to change. -5. Click **Edit Quotas** and fill out the new limit. -6. Click **Submit Request**. +3. For H100 GPUs: choose ``metric: GPUS_PER_GPU_FAMILY`` and select dimension ``gpu_family: NVIDIA_H100``. +4. For all other GPUs: choose ``Limit Name: instance_name``. (e.g., ``NVIDIA-V100-GPUS-per-project-region``). You may check the `the compute GPU list `_. +5. Select the checkbox of the region whose quota you want to change. +6. Click **Edit Quotas** and fill out the new limit. +7. Click **Submit Request**. OCI ------------------------------- diff --git a/docs/source/getting-started/quickstart.rst b/docs/source/getting-started/quickstart.rst index cdef2335dd7..f7574194317 100644 --- a/docs/source/getting-started/quickstart.rst +++ b/docs/source/getting-started/quickstart.rst @@ -31,8 +31,8 @@ Copy the following YAML into a ``hello_sky.yaml`` file: resources: # Optional; if left out, automatically pick the cheapest cloud. cloud: aws - # 1x NVIDIA V100 GPU - accelerators: V100:1 + # 8x NVIDIA A100 GPU + accelerators: A100:8 # Working directory (optional) containing the project codebase. # Its contents are synced to ~/sky_workdir/ on the cluster. @@ -106,7 +106,7 @@ Bash commands are also supported, such as: .. code-block:: console $ sky exec mycluster python train_cpu.py - $ sky exec mycluster --gpus=V100:1 python train_gpu.py + $ sky exec mycluster --gpus=A100:8 python train_gpu.py For interactive/monitoring commands, such as ``htop`` or ``gpustat -i``, use ``ssh`` instead (see below) to avoid job submission overheads. @@ -124,9 +124,9 @@ This may show multiple clusters, if you have created several: .. code-block:: - NAME LAUNCHED RESOURCES COMMAND STATUS - mygcp 1 day ago 1x GCP(n1-highmem-8) sky launch -c mygcp --cloud gcp STOPPED - mycluster 4 mins ago 1x AWS(p3.2xlarge) sky exec mycluster hello_sky.yaml UP + NAME LAUNCHED RESOURCES COMMAND STATUS + mygcp 1 day ago 1x GCP(n1-highmem-8) sky launch -c mygcp --cloud gcp STOPPED + mycluster 4 mins ago 1x AWS(p4d.24xlarge, {'A100': 8}) sky exec mycluster hello_sky.yaml UP See here for a list of all possible :ref:`cluster states `. diff --git a/docs/source/reference/job-queue.rst b/docs/source/reference/job-queue.rst index c0016c4d6da..4cb8d3b915c 100644 --- a/docs/source/reference/job-queue.rst +++ b/docs/source/reference/job-queue.rst @@ -57,14 +57,14 @@ First, create a :code:`cluster.yaml` to specify the desired cluster: num_nodes: 4 resources: - accelerators: V100:8 + accelerators: H100:8 workdir: ... setup: | # Install dependencies. ... -Use :code:`sky launch -c mycluster cluster.yaml` to provision a 4-node (each having 8 V100 GPUs) cluster. +Use :code:`sky launch -c mycluster cluster.yaml` to provision a 4-node (each having 8 H100 GPUs) cluster. The :code:`num_nodes` field is used to specify how many nodes are required. Next, create a :code:`task.yaml` to specify each task: @@ -73,13 +73,13 @@ Next, create a :code:`task.yaml` to specify each task: num_nodes: 2 resources: - accelerators: V100:4 + accelerators: H100:4 run: | # Run training script. ... -This specifies a task that needs to be run on 2 nodes, each of which must have 4 free V100s. +This specifies a task that needs to be run on 2 nodes, each of which must have 4 free H100s. Use :code:`sky exec mycluster task.yaml` to submit this task, which will be scheduled correctly by the job queue. @@ -107,18 +107,18 @@ To submit multiple trials with different hyperparameters to a cluster: .. code-block:: bash - $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-3 - $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 3e-3 - $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-4 - $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-2 - $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-6 + $ sky exec mycluster --gpus H100:1 -d -- python train.py --lr 1e-3 + $ sky exec mycluster --gpus H100:1 -d -- python train.py --lr 3e-3 + $ sky exec mycluster --gpus H100:1 -d -- python train.py --lr 1e-4 + $ sky exec mycluster --gpus H100:1 -d -- python train.py --lr 1e-2 + $ sky exec mycluster --gpus H100:1 -d -- python train.py --lr 1e-6 Options used: - :code:`--gpus`: specify the resource requirement for each job. - :code:`-d` / :code:`--detach`: detach the run and logging from the terminal, allowing multiple trials to run concurrently. -If there are only 4 V100 GPUs on the cluster, SkyPilot will queue 1 job while the +If there are only 4 H100 GPUs on the cluster, SkyPilot will queue 1 job while the other 4 run in parallel. Once a job finishes, the next job will begin executing immediately. See :ref:`below ` for more details on SkyPilot's scheduling behavior. @@ -131,12 +131,12 @@ Example: Fractional GPUs ------------------------- To run multiple trials per GPU, use *fractional GPUs* in the resource requirement. -For example, use :code:`--gpus V100:0.5` to make 2 trials share 1 GPU: +For example, use :code:`--gpus H100:0.5` to make 2 trials share 1 GPU: .. code-block:: bash - $ sky exec mycluster --gpus V100:0.5 -d -- python train.py --lr 1e-3 - $ sky exec mycluster --gpus V100:0.5 -d -- python train.py --lr 3e-3 + $ sky exec mycluster --gpus H100:0.5 -d -- python train.py --lr 1e-3 + $ sky exec mycluster --gpus H100:0.5 -d -- python train.py --lr 3e-3 ... When sharing a GPU, ensure that the GPU's memory is not oversubscribed @@ -168,12 +168,12 @@ In that tutorial, we have a task YAML that specifies these resource requirements # dnn.yaml ... resources: - accelerators: V100:4 + accelerators: H100:4 ... Since a new cluster was created when we ran :code:`sky launch -c lm-cluster dnn.yaml`, SkyPilot provisioned the cluster with exactly the same resources as those -required for the task. Thus, :code:`lm-cluster` has 4 V100 GPUs. +required for the task. Thus, :code:`lm-cluster` has 4 H100 GPUs. While this initial job is running, let us submit more tasks: @@ -182,12 +182,12 @@ While this initial job is running, let us submit more tasks: $ # Launch 4 jobs, perhaps with different hyperparameters. $ # You can override the task name with `-n` (optional) and $ # the resource requirement with `--gpus` (optional). - $ sky exec lm-cluster dnn.yaml -d -n job2 --gpus=V100:1 - $ sky exec lm-cluster dnn.yaml -d -n job3 --gpus=V100:1 - $ sky exec lm-cluster dnn.yaml -d -n job4 --gpus=V100:4 - $ sky exec lm-cluster dnn.yaml -d -n job5 --gpus=V100:2 + $ sky exec lm-cluster dnn.yaml -d -n job2 --gpus=H100:1 + $ sky exec lm-cluster dnn.yaml -d -n job3 --gpus=H100:1 + $ sky exec lm-cluster dnn.yaml -d -n job4 --gpus=H100:4 + $ sky exec lm-cluster dnn.yaml -d -n job5 --gpus=H100:2 -Because the cluster has only 4 V100 GPUs, we will see the following sequence of events: +Because the cluster has only 4 H100 GPUs, we will see the following sequence of events: - The initial :code:`sky launch` job is running and occupies 4 GPUs; all other jobs are pending (no free GPUs). - The first two :code:`sky exec` jobs (job2, job3) then start running and occupy 1 GPU each. diff --git a/docs/source/reference/kubernetes/kubernetes-setup.rst b/docs/source/reference/kubernetes/kubernetes-setup.rst index 6ae8d7e61f6..a827d49ea19 100644 --- a/docs/source/reference/kubernetes/kubernetes-setup.rst +++ b/docs/source/reference/kubernetes/kubernetes-setup.rst @@ -182,7 +182,7 @@ Manually Labelling Nodes You can also manually label nodes, if required. Labels must be of the format ``skypilot.co/accelerator: `` where ```` is the lowercase name of the GPU. -For example, a node with V100 GPUs must have a label :code:`skypilot.co/accelerator: v100`. +For example, a node with H100 GPUs must have a label :code:`skypilot.co/accelerator: h100`. Use the following command to label a node: diff --git a/docs/source/reference/yaml-spec.rst b/docs/source/reference/yaml-spec.rst index f874b4d37b4..7c298dd4079 100644 --- a/docs/source/reference/yaml-spec.rst +++ b/docs/source/reference/yaml-spec.rst @@ -51,18 +51,18 @@ Available fields: # # To specify a single type of accelerator: # Format: : (or simply , short for a count of 1). - # accelerators: V100:4 + # accelerators: H100:4 # # To specify an ordered list of accelerators (try the accelerators in # the specified order): # Format: [:, ...] - # accelerators: ['K80:1', 'V100:1', 'T4:1'] + # accelerators: ['L4:1', 'H100:1', 'A100:1'] # # To specify an unordered set of accelerators (optimize all specified # accelerators together, and try accelerator with lowest cost first): # Format: {:, ...} - # accelerators: {'K80:1', 'V100:1', 'T4:1'} - accelerators: V100:4 + # accelerators: {'L4:1', 'H100:1', 'A100:1'} + accelerators: H100:8 # Number of vCPUs per node (optional). # @@ -249,9 +249,9 @@ Available fields: any_of: - cloud: aws region: us-west-2 - acceraltors: V100 + accelerators: H100 - cloud: gcp - acceraltors: A100 + accelerators: H100 # Environment variables (optional). These values can be accessed in the diff --git a/docs/source/running-jobs/distributed-jobs.rst b/docs/source/running-jobs/distributed-jobs.rst index da3ddd8e94f..22bea04593e 100644 --- a/docs/source/running-jobs/distributed-jobs.rst +++ b/docs/source/running-jobs/distributed-jobs.rst @@ -14,7 +14,7 @@ For example, here is a simple PyTorch Distributed training example: name: resnet-distributed-app resources: - accelerators: V100:4 + accelerators: A100:8 num_nodes: 2 @@ -42,7 +42,7 @@ For example, here is a simple PyTorch Distributed training example: In the above, -- :code:`num_nodes: 2` specifies that this task is to be run on 2 nodes, with each node having 4 V100s; +- :code:`num_nodes: 2` specifies that this task is to be run on 2 nodes, with each node having 8 A100s; - The highlighted lines in the ``run`` section show common environment variables that are useful for launching distributed training, explained below. .. note:: diff --git a/llm/vllm/README.md b/llm/vllm/README.md index 9fb3c0c1364..78617f3746d 100644 --- a/llm/vllm/README.md +++ b/llm/vllm/README.md @@ -29,9 +29,9 @@ Before you get started, you need to have access to the Llama-2 model weights on ```bash sky launch -c vllm-llama2 serve-openai-api.yaml --env HF_TOKEN=YOUR_HUGGING_FACE_API_TOKEN ``` -**Optional**: Only GCP offers the specified L4 GPUs currently. To use other clouds, use the `--gpus` flag to request other GPUs. For example, to use V100 GPUs: +**Optional**: Only GCP offers the specified L4 GPUs currently. To use other clouds, use the `--gpus` flag to request other GPUs. For example, to use H100 GPUs: ```bash -sky launch -c vllm-llama2 serve-openai-api.yaml --gpus V100:1 --env HF_TOKEN=YOUR_HUGGING_FACE_API_TOKEN +sky launch -c vllm-llama2 serve-openai-api.yaml --gpus H100:1 --env HF_TOKEN=YOUR_HUGGING_FACE_API_TOKEN ``` **Tip**: You can also use the vLLM docker container for faster setup. Refer to [serve-openai-api-docker.yaml](https://github.com/skypilot-org/skypilot/tree/master/llm/vllm/serve-openai-api-docker.yaml) for more. From 36044f450c06addf70a15039376dec9df8e03371 Mon Sep 17 00:00:00 2001 From: Romil Bhardwaj Date: Tue, 22 Oct 2024 14:57:48 -0700 Subject: [PATCH 71/93] [k8s] Fix GPU labeling for EKS (#4146) Fix GPU labelling --- sky/utils/kubernetes/k8s_gpu_labeler_job.yaml | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/sky/utils/kubernetes/k8s_gpu_labeler_job.yaml b/sky/utils/kubernetes/k8s_gpu_labeler_job.yaml index f2283bab4c3..78ad13facdc 100644 --- a/sky/utils/kubernetes/k8s_gpu_labeler_job.yaml +++ b/sky/utils/kubernetes/k8s_gpu_labeler_job.yaml @@ -14,9 +14,10 @@ spec: containers: - name: gpu-labeler image: us-central1-docker.pkg.dev/skypilot-375900/skypilotk8s/skypilot-gpu:latest # Using this image also serves as a way to "pre-pull" the image onto nodes - command: - - "python" - - "/label_gpus.py" + command: ["/bin/bash", "-i", "-c"] + args: + - | + python /label_gpus.py env: - name: MY_NODE_NAME valueFrom: From baf0bfb91e13dc6d2e76279cf7d339986001f01d Mon Sep 17 00:00:00 2001 From: Romil Bhardwaj Date: Tue, 22 Oct 2024 15:11:18 -0700 Subject: [PATCH 72/93] [k8s] Handle @ in context name (#4147) Handle @ in context name --- sky/utils/command_runner.py | 5 +++-- sky/utils/kubernetes/rsync_helper.sh | 2 +- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/sky/utils/command_runner.py b/sky/utils/command_runner.py index bbe287d9f79..7eae76040d8 100644 --- a/sky/utils/command_runner.py +++ b/sky/utils/command_runner.py @@ -838,8 +838,9 @@ def get_remote_home_dir() -> str: # default delimiter for options and arguments. # rsync_helper.sh will parse the namespace_context by reverting the # encoding and pass it to kubectl exec. - encoded_namespace_context = namespace_context.replace( - ':', '%3A').replace('/', '%2F').replace('+', '%2B') + encoded_namespace_context = (namespace_context.replace( + '@', '%40').replace(':', '%3A').replace('/', + '%2F').replace('+', '%2B')) self._rsync( source, target, diff --git a/sky/utils/kubernetes/rsync_helper.sh b/sky/utils/kubernetes/rsync_helper.sh index 0ee93d8521a..79bd5fa79f8 100755 --- a/sky/utils/kubernetes/rsync_helper.sh +++ b/sky/utils/kubernetes/rsync_helper.sh @@ -7,7 +7,7 @@ shift echo "pod: $pod" >&2 encoded_namespace_context=$1 # Revert the encoded namespace+context to the original string. -namespace_context=$(echo "$encoded_namespace_context" | sed 's|%3A|:|g' | sed 's|%2B|+|g' | sed 's|%2F|/|g') +namespace_context=$(echo "$encoded_namespace_context" | sed 's|%40|@|g' | sed 's|%3A|:|g' | sed 's|%2B|+|g' | sed 's|%2F|/|g') echo "namespace_context: $namespace_context" >&2 namespace=$(echo $namespace_context | cut -d+ -f1) echo "namespace: $namespace" >&2 From f2991b144d4b15eac55dd7f759f361b6146033b3 Mon Sep 17 00:00:00 2001 From: Romil Bhardwaj Date: Tue, 22 Oct 2024 15:29:29 -0700 Subject: [PATCH 73/93] [Docs] Typo in distributed jobs docs (#4149) minor typo --- docs/source/running-jobs/distributed-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/running-jobs/distributed-jobs.rst b/docs/source/running-jobs/distributed-jobs.rst index 22bea04593e..f6c8cba9c9d 100644 --- a/docs/source/running-jobs/distributed-jobs.rst +++ b/docs/source/running-jobs/distributed-jobs.rst @@ -33,7 +33,7 @@ For example, here is a simple PyTorch Distributed training example: MASTER_ADDR=`echo "$SKYPILOT_NODE_IPS" | head -n1` torchrun \ - --nnodes=$SKPILOT_NUM_NODES \ + --nnodes=$SKYPILOT_NUM_NODES \ --master_addr=$MASTER_ADDR \ --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \ --node_rank=$SKYPILOT_NODE_RANK \ From 8283e4cecb1677ae48df937bea746d7bc538ee3a Mon Sep 17 00:00:00 2001 From: Yika Date: Wed, 23 Oct 2024 16:43:29 -0700 Subject: [PATCH 74/93] [Performance] Refactor Azure SDK usage (#4139) * [Performance] Refactor Azure SDK usage * lazy import and address comments * address comments --- sky/adaptors/azure.py | 11 + sky/provision/azure/azure-vm-template.json | 301 --------------------- sky/provision/azure/config.py | 1 + sky/provision/azure/instance.py | 247 +++++++++++------ sky/templates/azure-ray.yml.j2 | 2 + 5 files changed, 174 insertions(+), 388 deletions(-) delete mode 100644 sky/provision/azure/azure-vm-template.json diff --git a/sky/adaptors/azure.py b/sky/adaptors/azure.py index 61d8d14352e..2752129e305 100644 --- a/sky/adaptors/azure.py +++ b/sky/adaptors/azure.py @@ -69,6 +69,17 @@ def exceptions(): return azure_exceptions +@functools.lru_cache() +@common.load_lazy_modules(modules=_LAZY_MODULES) +def azure_mgmt_models(name: str): + if name == 'compute': + from azure.mgmt.compute import models + return models + elif name == 'network': + from azure.mgmt.network import models + return models + + # We should keep the order of the decorators having 'lru_cache' followed # by 'load_lazy_modules' as we need to make sure a caller can call # 'get_client.cache_clear', which is a function provided by 'lru_cache' diff --git a/sky/provision/azure/azure-vm-template.json b/sky/provision/azure/azure-vm-template.json deleted file mode 100644 index 52e82dc532c..00000000000 --- a/sky/provision/azure/azure-vm-template.json +++ /dev/null @@ -1,301 +0,0 @@ -{ - "$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#", - "contentVersion": "1.0.0.0", - "parameters": { - "vmName": { - "type": "string", - "metadata": { - "description": "The name of you Virtual Machine." - } - }, - "adminUsername": { - "type": "string", - "metadata": { - "description": "Username for the Virtual Machine." - } - }, - "publicKey": { - "type": "securestring", - "metadata": { - "description": "SSH Key for the Virtual Machine" - } - }, - "imagePublisher": { - "type": "string", - "metadata": { - "description": "The publisher of the VM image" - } - }, - "imageOffer": { - "type": "string", - "metadata": { - "description": "The offer of the VM image" - } - }, - "imageSku": { - "type": "string", - "metadata": { - "description": "The sku of the VM image" - } - }, - "imageVersion": { - "type": "string", - "metadata": { - "description": "The version of the VM image" - } - }, - "vmSize": { - "type": "string", - "metadata": { - "description": "The size of the VM" - } - }, - "vmTags": { - "type": "object", - "metadata": { - "description": "Tags for the VM" - } - }, - "vmCount": { - "type": "int", - "metadata": { - "description": "Number of VMs to deploy" - } - }, - "provisionPublicIp": { - "type": "bool", - "defaultValue": true, - "metadata": { - "description": "If true creates a public ip" - } - }, - "priority": { - "type": "string", - "defaultValue": "Regular", - "metadata": { - "description": "Specifies the priority for the virtual machine." - } - }, - "billingProfile": { - "type": "object", - "defaultValue": {}, - "metadata": { - "description": "Specifies the maximum price to pay for Azure Spot VM." - } - }, - "osDiskSizeGB": { - "type": "int", - "metadata": { - "description": "OS disk size in GBs." - } - }, - "msi": { - "type": "string", - "metadata": { - "description": "Managed service identity resource id." - } - }, - "nsg": { - "type": "string", - "metadata": { - "description": "Network security group resource id." - } - }, - "subnet": { - "type": "string", - "metadata": { - "descriptions": "Subnet resource id." - } - }, - "osDiskTier": { - "type": "string", - "allowedValues": [ - "Premium_LRS", - "StandardSSD_LRS", - "Standard_LRS" - ], - "metadata": { - "description": "OS disk tier." - } - }, - "cloudInitSetupCommands": { - "type": "string", - "metadata": { - "description": "Base64 encoded cloud-init setup commands." - } - } - }, - "variables": { - "location": "[resourceGroup().location]", - "networkInterfaceNamePrivate": "[concat(parameters('vmName'), '-nic')]", - "networkInterfaceNamePublic": "[concat(parameters('vmName'), '-nic-public')]", - "networkInterfaceName": "[if(parameters('provisionPublicIp'), variables('networkInterfaceNamePublic'), variables('networkInterfaceNamePrivate'))]", - "networkIpConfig": "[guid(resourceGroup().id, parameters('vmName'))]", - "publicIpAddressName": "[concat(parameters('vmName'), '-ip')]" - }, - "resources": [ - { - "type": "Microsoft.Network/networkInterfaces", - "apiVersion": "2020-06-01", - "name": "[concat(variables('networkInterfaceNamePublic'), copyIndex())]", - "location": "[variables('location')]", - "dependsOn": [ - "[resourceId('Microsoft.Network/publicIpAddresses/', concat(variables('publicIpAddressName'), copyIndex()))]" - ], - "copy": { - "name": "NICPublicCopy", - "count": "[parameters('vmCount')]" - }, - "properties": { - "ipConfigurations": [ - { - "name": "[variables('networkIpConfig')]", - "properties": { - "subnet": { - "id": "[parameters('subnet')]" - }, - "privateIPAllocationMethod": "Dynamic", - "publicIpAddress": { - "id": "[resourceId('Microsoft.Network/publicIPAddresses', concat(variables('publicIPAddressName'), copyIndex()))]" - } - } - } - ], - "networkSecurityGroup": { - "id": "[parameters('nsg')]" - } - }, - "condition": "[parameters('provisionPublicIp')]" - }, - { - "type": "Microsoft.Network/networkInterfaces", - "apiVersion": "2020-06-01", - "name": "[concat(variables('networkInterfaceNamePrivate'), copyIndex())]", - "location": "[variables('location')]", - "copy": { - "name": "NICPrivateCopy", - "count": "[parameters('vmCount')]" - }, - "properties": { - "ipConfigurations": [ - { - "name": "[variables('networkIpConfig')]", - "properties": { - "subnet": { - "id": "[parameters('subnet')]" - }, - "privateIPAllocationMethod": "Dynamic" - } - } - ], - "networkSecurityGroup": { - "id": "[parameters('nsg')]" - } - }, - "condition": "[not(parameters('provisionPublicIp'))]" - }, - { - "type": "Microsoft.Network/publicIpAddresses", - "apiVersion": "2019-02-01", - "name": "[concat(variables('publicIpAddressName'), copyIndex())]", - "location": "[variables('location')]", - "properties": { - "publicIpAllocationMethod": "Static", - "publicIPAddressVersion": "IPv4" - }, - "copy": { - "name": "PublicIpCopy", - "count": "[parameters('vmCount')]" - }, - "sku": { - "name": "Basic", - "tier": "Regional" - }, - "condition": "[parameters('provisionPublicIp')]" - }, - { - "type": "Microsoft.Compute/virtualMachines", - "apiVersion": "2019-03-01", - "name": "[concat(parameters('vmName'), copyIndex())]", - "location": "[variables('location')]", - "dependsOn": [ - "[resourceId('Microsoft.Network/networkInterfaces/', concat(variables('networkInterfaceName'), copyIndex()))]" - ], - "copy": { - "name": "VmCopy", - "count": "[parameters('vmCount')]" - }, - "tags": "[parameters('vmTags')]", - "properties": { - "hardwareProfile": { - "vmSize": "[parameters('vmSize')]" - }, - "storageProfile": { - "osDisk": { - "createOption": "fromImage", - "managedDisk": { - "storageAccountType": "[parameters('osDiskTier')]" - }, - "diskSizeGB": "[parameters('osDiskSizeGB')]" - }, - "imageReference": { - "publisher": "[parameters('imagePublisher')]", - "offer": "[parameters('imageOffer')]", - "sku": "[parameters('imageSku')]", - "version": "[parameters('imageVersion')]" - } - }, - "networkProfile": { - "networkInterfaces": [ - { - "id": "[resourceId('Microsoft.Network/networkInterfaces', concat(variables('networkInterfaceName'), copyIndex()))]" - } - ] - }, - "osProfile": { - "computerName": "[concat(parameters('vmName'), copyIndex())]", - "adminUsername": "[parameters('adminUsername')]", - "adminPassword": "[parameters('publicKey')]", - "linuxConfiguration": { - "disablePasswordAuthentication": true, - "ssh": { - "publicKeys": [ - { - "path": "[concat('/home/', parameters('adminUsername'), '/.ssh/authorized_keys')]", - "keyData": "[parameters('publicKey')]" - } - ] - } - }, - "customData": "[parameters('cloudInitSetupCommands')]" - }, - "priority": "[parameters('priority')]", - "billingProfile": "[parameters('billingProfile')]" - }, - "identity": { - "type": "UserAssigned", - "userAssignedIdentities": { - "[parameters('msi')]": { - } - } - } - } - ], - "outputs": { - "publicIp": { - "type": "array", - "copy": { - "count": "[parameters('vmCount')]", - "input": "[reference(concat(variables('publicIpAddressName'), copyIndex())).ipAddress]" - }, - "condition": "[parameters('provisionPublicIp')]" - }, - "privateIp": { - "type": "array", - "copy": { - "count": "[parameters('vmCount')]", - "input": "[reference(concat(variables('networkInterfaceName'), copyIndex())).ipConfigurations[0].properties.privateIPAddress]" - } - } - } -} diff --git a/sky/provision/azure/config.py b/sky/provision/azure/config.py index b3cb357512a..22982a99075 100644 --- a/sky/provision/azure/config.py +++ b/sky/provision/azure/config.py @@ -46,6 +46,7 @@ def bootstrap_instances( region: str, cluster_name_on_cloud: str, config: common.ProvisionConfig) -> common.ProvisionConfig: """See sky/provision/__init__.py""" + # TODO: use new azure sdk instead of ARM deployment. del region # unused provider_config = config.provider_config subscription_id = provider_config.get('subscription_id') diff --git a/sky/provision/azure/instance.py b/sky/provision/azure/instance.py index 3c5ed8801a4..f6c865e29c8 100644 --- a/sky/provision/azure/instance.py +++ b/sky/provision/azure/instance.py @@ -2,10 +2,8 @@ import base64 import copy import enum -import json import logging from multiprocessing import pool -import pathlib import time import typing from typing import Any, Callable, Dict, List, Optional, Tuple @@ -23,7 +21,9 @@ if typing.TYPE_CHECKING: from azure.mgmt import compute as azure_compute - from azure.mgmt import resource as azure_resource + from azure.mgmt import network as azure_network + from azure.mgmt.compute import models as azure_compute_models + from azure.mgmt.network import models as azure_network_models logger = sky_logging.init_logger(__name__) @@ -184,14 +184,150 @@ def _get_head_instance_id(instances: List) -> Optional[str]: return head_instance_id -def _create_instances( - compute_client: 'azure_compute.ComputeManagementClient', - resource_client: 'azure_resource.ResourceManagementClient', - cluster_name_on_cloud: str, resource_group: str, - provider_config: Dict[str, Any], node_config: Dict[str, Any], - tags: Dict[str, str], count: int) -> List: +def _create_network_interface( + network_client: 'azure_network.NetworkManagementClient', vm_name: str, + provider_config: Dict[str, + Any]) -> 'azure_network_models.NetworkInterface': + network = azure.azure_mgmt_models('network') + compute = azure.azure_mgmt_models('compute') + logger.info(f'Start creating network interface for {vm_name}...') + if provider_config.get('use_internal_ips', False): + name = f'{vm_name}-nic-private' + ip_config = network.IPConfiguration( + name=f'ip-config-private-{vm_name}', + subnet=compute.SubResource(id=provider_config['subnet']), + private_ip_allocation_method=network.IPAllocationMethod.DYNAMIC) + else: + name = f'{vm_name}-nic-public' + public_ip_address = network.PublicIPAddress( + location=provider_config['location'], + public_ip_allocation_method='Static', + public_ip_address_version='IPv4', + sku=network.PublicIPAddressSku(name='Basic', tier='Regional')) + ip_poller = network_client.public_ip_addresses.begin_create_or_update( + resource_group_name=provider_config['resource_group'], + public_ip_address_name=f'{vm_name}-ip', + parameters=public_ip_address) + logger.info(f'Created public IP address {ip_poller.result().name} ' + f'with address {ip_poller.result().ip_address}.') + ip_config = network.IPConfiguration( + name=f'ip-config-public-{vm_name}', + subnet=compute.SubResource(id=provider_config['subnet']), + private_ip_allocation_method=network.IPAllocationMethod.DYNAMIC, + public_ip_address=network.PublicIPAddress(id=ip_poller.result().id)) + + ni_poller = network_client.network_interfaces.begin_create_or_update( + resource_group_name=provider_config['resource_group'], + network_interface_name=name, + parameters=network.NetworkInterface( + location=provider_config['location'], + ip_configurations=[ip_config], + network_security_group=network.NetworkSecurityGroup( + id=provider_config['nsg']))) + logger.info(f'Created network interface {ni_poller.result().name}.') + return ni_poller.result() + + +def _create_vm( + compute_client: 'azure_compute.ComputeManagementClient', vm_name: str, + node_tags: Dict[str, str], provider_config: Dict[str, Any], + node_config: Dict[str, Any], + network_interface_id: str) -> 'azure_compute_models.VirtualMachine': + compute = azure.azure_mgmt_models('compute') + logger.info(f'Start creating VM {vm_name}...') + hardware_profile = compute.HardwareProfile( + vm_size=node_config['azure_arm_parameters']['vmSize']) + network_profile = compute.NetworkProfile(network_interfaces=[ + compute.NetworkInterfaceReference(id=network_interface_id, primary=True) + ]) + public_key = node_config['azure_arm_parameters']['publicKey'] + username = node_config['azure_arm_parameters']['adminUsername'] + os_linux_custom_data = base64.b64encode( + node_config['azure_arm_parameters']['cloudInitSetupCommands'].encode( + 'utf-8')).decode('utf-8') + os_profile = compute.OSProfile( + admin_username=username, + computer_name=vm_name, + admin_password=public_key, + linux_configuration=compute.LinuxConfiguration( + disable_password_authentication=True, + ssh=compute.SshConfiguration(public_keys=[ + compute.SshPublicKey( + path=f'/home/{username}/.ssh/authorized_keys', + key_data=public_key) + ])), + custom_data=os_linux_custom_data) + community_image_id = node_config['azure_arm_parameters'].get( + 'communityGalleryImageId', None) + if community_image_id is not None: + # Prioritize using community gallery image if specified. + image_reference = compute.ImageReference( + community_gallery_image_id=community_image_id) + logger.info( + f'Used community_image_id: {community_image_id} for VM {vm_name}.') + else: + image_reference = compute.ImageReference( + publisher=node_config['azure_arm_parameters']['imagePublisher'], + offer=node_config['azure_arm_parameters']['imageOffer'], + sku=node_config['azure_arm_parameters']['imageSku'], + version=node_config['azure_arm_parameters']['imageVersion']) + storage_profile = compute.StorageProfile( + image_reference=image_reference, + os_disk=compute.OSDisk( + create_option=compute.DiskCreateOptionTypes.FROM_IMAGE, + managed_disk=compute.ManagedDiskParameters( + storage_account_type=node_config['azure_arm_parameters'] + ['osDiskTier']), + disk_size_gb=node_config['azure_arm_parameters']['osDiskSizeGB'])) + vm_instance = compute.VirtualMachine( + location=provider_config['location'], + tags=node_tags, + hardware_profile=hardware_profile, + os_profile=os_profile, + storage_profile=storage_profile, + network_profile=network_profile, + identity=compute.VirtualMachineIdentity( + type='UserAssigned', + user_assigned_identities={provider_config['msi']: {}})) + vm_poller = compute_client.virtual_machines.begin_create_or_update( + resource_group_name=provider_config['resource_group'], + vm_name=vm_name, + parameters=vm_instance, + ) + # poller.result() will block on async operation until it's done. + logger.info(f'Created VM {vm_poller.result().name}.') + # Configure driver extension for A10 GPUs. A10 GPUs requires a + # special type of drivers which is available at Microsoft HPC + # extension. Reference: + # https://forums.developer.nvidia.com/t/ubuntu-22-04-installation-driver-error-nvidia-a10/285195/2 + # This can take more than 20mins for setting up the A10 GPUs + if node_config.get('need_nvidia_driver_extension', False): + ext_poller = compute_client.virtual_machine_extensions.\ + begin_create_or_update( + resource_group_name=provider_config['resource_group'], + vm_name=vm_name, + vm_extension_name='NvidiaGpuDriverLinux', + extension_parameters=compute.VirtualMachineExtension( + location=provider_config['location'], + publisher='Microsoft.HpcCompute', + type_properties_type='NvidiaGpuDriverLinux', + type_handler_version='1.9', + auto_upgrade_minor_version=True, + settings='{}')) + logger.info( + f'Created VM extension {ext_poller.result().name} for VM {vm_name}.' + ) + return vm_poller.result() + + +def _create_instances(compute_client: 'azure_compute.ComputeManagementClient', + network_client: 'azure_network.NetworkManagementClient', + cluster_name_on_cloud: str, resource_group: str, + provider_config: Dict[str, Any], node_config: Dict[str, + Any], + tags: Dict[str, str], count: int) -> List: vm_id = uuid4().hex[:UNIQUE_ID_LEN] - tags = { + all_tags = { constants.TAG_RAY_CLUSTER_NAME: cluster_name_on_cloud, constants.TAG_SKYPILOT_CLUSTER_NAME: cluster_name_on_cloud, **constants.WORKER_NODE_TAGS, @@ -199,83 +335,19 @@ def _create_instances( **tags, } node_tags = node_config['tags'].copy() - node_tags.update(tags) - - # load the template file - current_path = pathlib.Path(__file__).parent - template_path = current_path.joinpath('azure-vm-template.json') - with open(template_path, 'r', encoding='utf-8') as template_fp: - template = json.load(template_fp) - - vm_name = f'{cluster_name_on_cloud}-{vm_id}' - use_internal_ips = provider_config.get('use_internal_ips', False) + node_tags.update(all_tags) - template_params = node_config['azure_arm_parameters'].copy() - # We don't include 'head' or 'worker' in the VM name as on Azure the VM - # name is immutable and we may change the node type for existing VM in the - # multi-node cluster, due to manual termination of the head node. - template_params['vmName'] = vm_name - template_params['provisionPublicIp'] = not use_internal_ips - template_params['vmTags'] = node_tags - template_params['vmCount'] = count - template_params['msi'] = provider_config['msi'] - template_params['nsg'] = provider_config['nsg'] - template_params['subnet'] = provider_config['subnet'] - # In Azure, cloud-init script must be encoded in base64. For more - # information, see: - # https://learn.microsoft.com/en-us/azure/virtual-machines/custom-data - template_params['cloudInitSetupCommands'] = (base64.b64encode( - template_params['cloudInitSetupCommands'].encode('utf-8')).decode( - 'utf-8')) + # Create VM instances in parallel. + def create_single_instance(vm_i): + vm_name = f'{cluster_name_on_cloud}-{vm_id}-{vm_i}' + network_interface = _create_network_interface(network_client, vm_name, + provider_config) + _create_vm(compute_client, vm_name, node_tags, provider_config, + node_config, network_interface.id) - if node_config.get('need_nvidia_driver_extension', False): - # pylint: disable=line-too-long - # Configure driver extension for A10 GPUs. A10 GPUs requires a - # special type of drivers which is available at Microsoft HPC - # extension. Reference: https://forums.developer.nvidia.com/t/ubuntu-22-04-installation-driver-error-nvidia-a10/285195/2 - for r in template['resources']: - if r['type'] == 'Microsoft.Compute/virtualMachines': - # Add a nested extension resource for A10 GPUs - r['resources'] = [ - { - 'type': 'extensions', - 'apiVersion': '2015-06-15', - 'location': '[variables(\'location\')]', - 'dependsOn': [ - '[concat(\'Microsoft.Compute/virtualMachines/\', parameters(\'vmName\'), copyIndex())]' - ], - 'name': 'NvidiaGpuDriverLinux', - 'properties': { - 'publisher': 'Microsoft.HpcCompute', - 'type': 'NvidiaGpuDriverLinux', - 'typeHandlerVersion': '1.9', - 'autoUpgradeMinorVersion': True, - 'settings': {}, - }, - }, - ] - break - - parameters = { - 'properties': { - 'mode': azure.deployment_mode().incremental, - 'template': template, - 'parameters': { - key: { - 'value': value - } for key, value in template_params.items() - }, - } - } - - create_or_update = _get_azure_sdk_function( - client=resource_client.deployments, function_name='create_or_update') - create_or_update( - resource_group_name=resource_group, - deployment_name=vm_name, - parameters=parameters, - ).wait() + subprocess_utils.run_in_parallel(create_single_instance, range(count)) + # Update disk performance tier performance_tier = node_config.get('disk_performance_tier', None) if performance_tier is not None: disks = compute_client.disks.list_by_resource_group(resource_group) @@ -286,12 +358,14 @@ def _create_instances( f'az disk update -n {name} -g {resource_group} ' f'--set tier={performance_tier}') + # Validation filters = { constants.TAG_RAY_CLUSTER_NAME: cluster_name_on_cloud, _TAG_SKYPILOT_VM_ID: vm_id } instances = _filter_instances(compute_client, resource_group, filters) assert len(instances) == count, (len(instances), count) + return instances @@ -303,7 +377,7 @@ def run_instances(region: str, cluster_name_on_cloud: str, resource_group = provider_config['resource_group'] subscription_id = provider_config['subscription_id'] compute_client = azure.get_client('compute', subscription_id) - + network_client = azure.get_client('network', subscription_id) instances_to_resume = [] resumed_instance_ids: List[str] = [] created_instance_ids: List[str] = [] @@ -439,12 +513,11 @@ def _create_instance_tag(target_instance, is_head: bool = True) -> str: to_start_count -= len(resumed_instance_ids) if to_start_count > 0: - resource_client = azure.get_client('resource', subscription_id) logger.debug(f'run_instances: Creating {to_start_count} instances.') try: created_instances = _create_instances( compute_client=compute_client, - resource_client=resource_client, + network_client=network_client, cluster_name_on_cloud=cluster_name_on_cloud, resource_group=resource_group, provider_config=provider_config, diff --git a/sky/templates/azure-ray.yml.j2 b/sky/templates/azure-ray.yml.j2 index 77ddda6652f..b956530fccc 100644 --- a/sky/templates/azure-ray.yml.j2 +++ b/sky/templates/azure-ray.yml.j2 @@ -67,6 +67,8 @@ available_node_types: imageOffer: {{image_offer}} imageSku: "{{image_sku}}" imageVersion: {{image_version}} + # Community Gallery Image ID + communityGalleryImageId: {{community_gallery_image_id}} osDiskSizeGB: {{disk_size}} osDiskTier: {{disk_tier}} {%- if use_spot %} From 99db03fa1f835b75cc53bfd27fa1d84e01127076 Mon Sep 17 00:00:00 2001 From: Christopher Cooper Date: Wed, 23 Oct 2024 19:05:26 -0700 Subject: [PATCH 75/93] [cli] remove shell reload message for fish (#4150) * remove shell reload message for fish Fish will automatically pick up the completions and does not need to be reloaded. * add comment about fish reload command --- sky/cli.py | 22 ++++++++++++++-------- 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/sky/cli.py b/sky/cli.py index fb5a38bba7b..087a2be5e5f 100644 --- a/sky/cli.py +++ b/sky/cli.py @@ -339,7 +339,6 @@ def _get_shell_complete_args(complete_fn): _RELOAD_ZSH_CMD = 'source ~/.zshrc' -_RELOAD_FISH_CMD = 'source ~/.config/fish/config.fish' _RELOAD_BASH_CMD = 'source ~/.bashrc' @@ -378,7 +377,9 @@ def _install_shell_completion(ctx: click.Context, param: click.Parameter, cmd = '_SKY_COMPLETE=fish_source sky > \ ~/.config/fish/completions/sky.fish' - reload_cmd = _RELOAD_FISH_CMD + # Fish does not need to be reloaded and will automatically pick up + # completions. + reload_cmd = None elif value == 'zsh': install_cmd = f'_SKY_COMPLETE=zsh_source sky > \ @@ -398,9 +399,10 @@ def _install_shell_completion(ctx: click.Context, param: click.Parameter, check=True, executable=shutil.which('bash')) click.secho(f'Shell completion installed for {value}', fg='green') - click.echo( - 'Completion will take effect once you restart the terminal: ' + - click.style(f'{reload_cmd}', bold=True)) + if reload_cmd is not None: + click.echo( + 'Completion will take effect once you restart the terminal: ' + + click.style(f'{reload_cmd}', bold=True)) except subprocess.CalledProcessError as e: click.secho(f'> Installation failed with code {e.returncode}', fg='red') ctx.exit() @@ -431,7 +433,9 @@ def _uninstall_shell_completion(ctx: click.Context, param: click.Parameter, elif value == 'fish': cmd = 'rm -f ~/.config/fish/completions/sky.fish' - reload_cmd = _RELOAD_FISH_CMD + # Fish does not need to be reloaded and will automatically pick up + # completions. + reload_cmd = None elif value == 'zsh': cmd = 'sed -i"" -e "/# For SkyPilot shell completion/d" ~/.zshrc && \ @@ -447,8 +451,10 @@ def _uninstall_shell_completion(ctx: click.Context, param: click.Parameter, try: subprocess.run(cmd, shell=True, check=True) click.secho(f'Shell completion uninstalled for {value}', fg='green') - click.echo('Changes will take effect once you restart the terminal: ' + - click.style(f'{reload_cmd}', bold=True)) + if reload_cmd is not None: + click.echo( + 'Changes will take effect once you restart the terminal: ' + + click.style(f'{reload_cmd}', bold=True)) except subprocess.CalledProcessError as e: click.secho(f'> Uninstallation failed with code {e.returncode}', fg='red') From 2df812d6fab2c44ef796f54951f17a21a49b1c10 Mon Sep 17 00:00:00 2001 From: Christopher Cooper Date: Wed, 23 Oct 2024 19:22:52 -0700 Subject: [PATCH 76/93] [k8s] allow use of "k8s" instead of "kubernetes" in the CLI and python API (#4151) * alias sky.K8s for sky.Kubernetes * add ability to alias in the cloud registry and add k8s alias for kubernetes * add tests * add test fixture * use from_str instead of adding a new method, create "canonical_name" * split out sky check test * fix monkeypatch for sky launch --cloud kubernetes * allow @cloud.CLOUD_REGISTRY.register without parens * address review comments Co-authored-by: Romil Bhardwaj --------- Co-authored-by: Romil Bhardwaj --- sky/__init__.py | 2 ++ sky/cli.py | 16 +++++---- sky/clouds/cloud.py | 4 +++ sky/clouds/cloud_registry.py | 65 ++++++++++++++++++++++++++++++------ sky/clouds/kubernetes.py | 2 +- tests/common.py | 3 ++ tests/test_api.py | 13 ++++++++ tests/test_cli.py | 38 ++++++++++++++++++++- 8 files changed, 124 insertions(+), 19 deletions(-) diff --git a/sky/__init__.py b/sky/__init__.py index 37b5a1caf08..b851775dabf 100644 --- a/sky/__init__.py +++ b/sky/__init__.py @@ -128,6 +128,7 @@ def set_proxy_env_var(proxy_var: str, urllib_var: Optional[str]): Lambda = clouds.Lambda SCP = clouds.SCP Kubernetes = clouds.Kubernetes +K8s = Kubernetes OCI = clouds.OCI Paperspace = clouds.Paperspace RunPod = clouds.RunPod @@ -143,6 +144,7 @@ def set_proxy_env_var(proxy_var: str, urllib_var: Optional[str]): 'GCP', 'IBM', 'Kubernetes', + 'K8s', 'Lambda', 'OCI', 'Paperspace', diff --git a/sky/cli.py b/sky/cli.py index 087a2be5e5f..7ac26fd0714 100644 --- a/sky/cli.py +++ b/sky/cli.py @@ -3062,7 +3062,8 @@ def show_gpus( # This will validate 'cloud' and raise if not found. cloud_obj = sky_clouds.CLOUD_REGISTRY.from_str(cloud) - service_catalog.validate_region_zone(region, None, clouds=cloud) + cloud_name = cloud_obj.canonical_name() if cloud_obj is not None else None + service_catalog.validate_region_zone(region, None, clouds=cloud_name) show_all = all if show_all and accelerator_str is not None: raise click.UsageError('--all is only allowed without a GPU name.') @@ -3148,8 +3149,8 @@ def _output(): # Optimization - do not poll for Kubernetes API for fetching # common GPUs because that will be fetched later for the table after # common GPUs. - clouds_to_list = cloud - if cloud is None: + clouds_to_list = cloud_name + if cloud_name is None: clouds_to_list = [ c for c in service_catalog.ALL_CLOUDS if c != 'kubernetes' ] @@ -3159,7 +3160,8 @@ def _output(): # Collect k8s related messages in k8s_messages and print them at end print_section_titles = False # If cloud is kubernetes, we want to show real-time capacity - if kubernetes_is_enabled and (cloud is None or cloud_is_kubernetes): + if kubernetes_is_enabled and (cloud_name is None or + cloud_is_kubernetes): if region: context = region else: @@ -3269,8 +3271,8 @@ def _output(): name, quantity = accelerator_str, None print_section_titles = False - if (kubernetes_is_enabled and (cloud is None or cloud_is_kubernetes) and - not show_all): + if (kubernetes_is_enabled and + (cloud_name is None or cloud_is_kubernetes) and not show_all): # Print section title if not showing all and instead a specific # accelerator is requested print_section_titles = True @@ -3342,7 +3344,7 @@ def _output(): if len(result) == 0: quantity_str = (f' with requested quantity {quantity}' if quantity else '') - cloud_str = f' on {cloud_obj}.' if cloud else ' in cloud catalogs.' + cloud_str = f' on {cloud_obj}.' if cloud_name else ' in cloud catalogs.' yield f'Resources \'{name}\'{quantity_str} not found{cloud_str} ' yield 'To show available accelerators, run: sky show-gpus --all' return diff --git a/sky/clouds/cloud.py b/sky/clouds/cloud.py index dae1d56d309..3e21204f0a3 100644 --- a/sky/clouds/cloud.py +++ b/sky/clouds/cloud.py @@ -819,6 +819,10 @@ def delete_image(cls, image_id: str, region: Optional[str]) -> None: # === End of image related methods === + @classmethod + def canonical_name(cls) -> str: + return cls.__name__.lower() + def __repr__(self): return self._REPR diff --git a/sky/clouds/cloud_registry.py b/sky/clouds/cloud_registry.py index 5c4b10b9fd4..52a026aa330 100644 --- a/sky/clouds/cloud_registry.py +++ b/sky/clouds/cloud_registry.py @@ -1,7 +1,7 @@ """Clouds need to be registered in CLOUD_REGISTRY to be discovered""" import typing -from typing import Optional, Type +from typing import Callable, Dict, List, Optional, overload, Type, Union from sky.utils import ux_utils @@ -12,20 +12,65 @@ class _CloudRegistry(dict): """Registry of clouds.""" + def __init__(self) -> None: + super().__init__() + self.aliases: Dict[str, str] = {} + def from_str(self, name: Optional[str]) -> Optional['cloud.Cloud']: + """Returns the cloud instance from the canonical name or alias.""" if name is None: return None - if name.lower() not in self: - with ux_utils.print_exception_no_traceback(): - raise ValueError(f'Cloud {name!r} is not a valid cloud among ' - f'{list(self.keys())}') - return self.get(name.lower()) + search_name = name.lower() + + if search_name in self: + return self[search_name] + + if search_name in self.aliases: + return self[self.aliases[search_name]] + + with ux_utils.print_exception_no_traceback(): + raise ValueError(f'Cloud {name!r} is not a valid cloud among ' + f'{[*self.keys(), *self.aliases.keys()]}') + + @overload def register(self, cloud_cls: Type['cloud.Cloud']) -> Type['cloud.Cloud']: - name = cloud_cls.__name__.lower() - assert name not in self, f'{name} already registered' - self[name] = cloud_cls() - return cloud_cls + ... + + @overload + def register( + self, + cloud_cls: None = None, + aliases: Optional[List[str]] = None, + ) -> Callable[[Type['cloud.Cloud']], Type['cloud.Cloud']]: + ... + + def register( + self, + cloud_cls: Optional[Type['cloud.Cloud']] = None, + aliases: Optional[List[str]] = None, + ) -> Union[Type['cloud.Cloud'], Callable[[Type['cloud.Cloud']], + Type['cloud.Cloud']]]: + + def _register(cloud_cls: Type['cloud.Cloud']) -> Type['cloud.Cloud']: + name = cloud_cls.canonical_name() + assert name not in self, f'{name} already registered' + self[name] = cloud_cls() + + for alias in aliases or []: + alias = alias.lower() + assert alias not in self.aliases, ( + f'alias {alias} already registered') + self.aliases[alias] = name + + return cloud_cls + + if cloud_cls is not None: + # invocation without parens (e.g. just `@register`) + return _register(cloud_cls) + + # Invocation with parens (e.g. `@register(aliases=['alias'])`) + return _register CLOUD_REGISTRY: _CloudRegistry = _CloudRegistry() diff --git a/sky/clouds/kubernetes.py b/sky/clouds/kubernetes.py index da85246e9ea..8ff4172a5b1 100644 --- a/sky/clouds/kubernetes.py +++ b/sky/clouds/kubernetes.py @@ -33,7 +33,7 @@ _SKYPILOT_SYSTEM_NAMESPACE = 'skypilot-system' -@clouds.CLOUD_REGISTRY.register +@clouds.CLOUD_REGISTRY.register(aliases=['k8s']) class Kubernetes(clouds.Cloud): """Kubernetes.""" diff --git a/tests/common.py b/tests/common.py index c6f08588d99..d41ff3bead0 100644 --- a/tests/common.py +++ b/tests/common.py @@ -70,6 +70,9 @@ def _get_az_mappings(_): lambda *_args, **_kwargs: [True, '']) monkeypatch.setattr('sky.provision.kubernetes.utils.get_spot_label', lambda *_args, **_kwargs: [None, None]) + monkeypatch.setattr( + 'sky.provision.kubernetes.utils.is_kubeconfig_exec_auth', + lambda *_args, **_kwargs: [False, None]) # monkeypatch class Kubernetes. monkeypatch.setattr( diff --git a/tests/test_api.py b/tests/test_api.py index 4d6658fcd05..5a33336dd92 100644 --- a/tests/test_api.py +++ b/tests/test_api.py @@ -1,7 +1,20 @@ import sky +from sky.clouds.cloud import Cloud def test_sky_launch(enable_all_clouds): task = sky.Task() job_id, handle = sky.launch(task, dryrun=True) assert job_id is None and handle is None + + +def test_k8s_alias(enable_all_clouds): + + def dryrun_task_with_cloud(cloud: Cloud): + task = sky.Task() + task.set_resources_override({'cloud': cloud}) + sky.launch(task, dryrun=True) + + dryrun_task_with_cloud(sky.K8s()) + + dryrun_task_with_cloud(sky.Kubernetes()) diff --git a/tests/test_cli.py b/tests/test_cli.py index 3a2417a6cde..36f2a6ea782 100644 --- a/tests/test_cli.py +++ b/tests/test_cli.py @@ -3,7 +3,6 @@ from click import testing as cli_testing -import sky from sky import exceptions import sky.cli as cli @@ -103,3 +102,40 @@ def test_show_gpus(): result = cli_runner.invoke(cli.show_gpus, ['V100:4', '--cloud', cloud, '--all']) assert isinstance(result.exception, SystemExit) + + +def test_k8s_alias_check(): + cli_runner = cli_testing.CliRunner() + + result = cli_runner.invoke(cli.check, ['k8s']) + assert not result.exit_code + + result = cli_runner.invoke(cli.check, ['kubernetes']) + assert not result.exit_code + + result = cli_runner.invoke(cli.check, ['notarealcloud']) + assert isinstance(result.exception, ValueError) + + +def test_k8s_alias(enable_all_clouds): + cli_runner = cli_testing.CliRunner() + + result = cli_runner.invoke(cli.launch, ['--cloud', 'k8s', '--dryrun']) + assert not result.exit_code + + result = cli_runner.invoke(cli.launch, + ['--cloud', 'kubernetes', '--dryrun']) + assert not result.exit_code + + result = cli_runner.invoke(cli.launch, + ['--cloud', 'notarealcloud', '--dryrun']) + assert isinstance(result.exception, ValueError) + + result = cli_runner.invoke(cli.show_gpus, ['--cloud', 'k8s']) + assert not result.exit_code + + result = cli_runner.invoke(cli.show_gpus, ['--cloud', 'kubernetes']) + assert not result.exit_code + + result = cli_runner.invoke(cli.show_gpus, ['--cloud', 'notarealcloud']) + assert isinstance(result.exception, ValueError) From cbf5c0022ad920edb4f41cfad65a2cf4909d5930 Mon Sep 17 00:00:00 2001 From: Romil Bhardwaj Date: Wed, 23 Oct 2024 22:05:44 -0700 Subject: [PATCH 77/93] [k8s] Add info on base images used for k8s (#4129) * add images * make links anonymous --- .../reference/kubernetes/kubernetes-getting-started.rst | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/docs/source/reference/kubernetes/kubernetes-getting-started.rst b/docs/source/reference/kubernetes/kubernetes-getting-started.rst index d7313fba3e2..c1874b6b71f 100644 --- a/docs/source/reference/kubernetes/kubernetes-getting-started.rst +++ b/docs/source/reference/kubernetes/kubernetes-getting-started.rst @@ -174,7 +174,12 @@ You can also inspect the real-time GPU usage on the cluster with :code:`sky show Using Custom Images ------------------- -By default, we use and maintain a SkyPilot container image that has conda and a few other basic tools installed. +By default, we maintain and use two SkyPilot container images for use on Kubernetes clusters: + +1. ``us-central1-docker.pkg.dev/skypilot-375900/skypilotk8s/skypilot``: used for CPU-only clusters (`Dockerfile `__). +2. ``us-central1-docker.pkg.dev/skypilot-375900/skypilotk8s/skypilot-gpu``: used for GPU clusters (`Dockerfile `__). + +These images are pre-installed with SkyPilot dependencies for fast startup. To use your own image, add :code:`image_id: docker:` to the :code:`resources` section of your task YAML. From d6d339d6235faafc281fcb8bd05310ba0ba648b1 Mon Sep 17 00:00:00 2001 From: Andrew Aikawa Date: Thu, 24 Oct 2024 09:48:45 -0700 Subject: [PATCH 78/93] multithreaded ssh setup (#4158) * multithread ssh * parallelize k8s ssh setup * fix hang and imap tuple * patch * lint --------- Co-authored-by: Ubuntu --- sky/provision/kubernetes/instance.py | 6 ++++-- sky/provision/provisioner.py | 24 ++++++++++++++++-------- 2 files changed, 20 insertions(+), 10 deletions(-) diff --git a/sky/provision/kubernetes/instance.py b/sky/provision/kubernetes/instance.py index 6663ed3f657..6ce7b74d18e 100644 --- a/sky/provision/kubernetes/instance.py +++ b/sky/provision/kubernetes/instance.py @@ -18,6 +18,7 @@ from sky.utils import command_runner from sky.utils import common_utils from sky.utils import kubernetes_enums +from sky.utils import subprocess_utils from sky.utils import ux_utils POLL_INTERVAL = 2 @@ -398,8 +399,7 @@ def _setup_ssh_in_pods(namespace: str, context: Optional[str], # See https://www.educative.io/answers/error-mesg-ttyname-failed-inappropriate-ioctl-for-device # pylint: disable=line-too-long '$(prefix_cmd) sed -i "s/mesg n/tty -s \\&\\& mesg n/" ~/.profile;') - # TODO(romilb): Parallelize the setup of SSH in pods for multi-node clusters - for new_node in new_nodes: + def _setup_ssh_thread(new_node): pod_name = new_node.metadata.name runner = command_runner.KubernetesCommandRunner( ((namespace, context), pod_name)) @@ -411,6 +411,8 @@ def _setup_ssh_in_pods(namespace: str, context: Optional[str], stdout) logger.info(f'{"-"*20}End: Set up SSH in pod {pod_name!r} {"-"*20}') + subprocess_utils.run_in_parallel(_setup_ssh_thread, new_nodes) + def _label_pod(namespace: str, context: Optional[str], pod_name: str, label: Dict[str, str]) -> None: diff --git a/sky/provision/provisioner.py b/sky/provision/provisioner.py index 7706a3d489b..b3e965769c9 100644 --- a/sky/provision/provisioner.py +++ b/sky/provision/provisioner.py @@ -28,6 +28,7 @@ from sky.utils import common_utils from sky.utils import resources_utils from sky.utils import rich_utils +from sky.utils import subprocess_utils from sky.utils import ux_utils # Do not use __name__ as we do not want to propagate logs to sky.provision, @@ -365,14 +366,13 @@ def wait_for_ssh(cluster_info: provision_common.ClusterInfo, # use a queue for SSH querying ips = collections.deque(ip_list) ssh_ports = collections.deque(port_list) - while ips: - ip = ips.popleft() - ssh_port = ssh_ports.popleft() - success, stderr = waiter(ip, ssh_port, **ssh_credentials) - if not success: - ips.append(ip) - ssh_ports.append(ssh_port) - if time.time() - start > timeout: + + def _retry_ssh_thread(ip_ssh_port: Tuple[str, int]): + ip, ssh_port = ip_ssh_port + success = False + while not success: + success, stderr = waiter(ip, ssh_port, **ssh_credentials) + if not success and time.time() - start > timeout: with ux_utils.print_exception_no_traceback(): raise RuntimeError( f'Failed to SSH to {ip} after timeout {timeout}s, with ' @@ -380,6 +380,14 @@ def wait_for_ssh(cluster_info: provision_common.ClusterInfo, logger.debug('Retrying in 1 second...') time.sleep(1) + # try one node and multiprocess the rest + if ips: + ip = ips.popleft() + ssh_port = ssh_ports.popleft() + _retry_ssh_thread((ip, ssh_port)) + subprocess_utils.run_in_parallel(_retry_ssh_thread, + list(zip(ips, ssh_ports))) + def _post_provision_setup( cloud_name: str, cluster_name: resources_utils.ClusterName, From 6dc386bab615bdb9fd83dc10722947de7c2056c5 Mon Sep 17 00:00:00 2001 From: Romil Bhardwaj Date: Thu, 24 Oct 2024 11:25:18 -0700 Subject: [PATCH 79/93] [k8s] Rename show-gpus field to `REQUESTABLE_QTY_PER_NODE` (#4162) Update to REQUESTABLE_QTY_PER_NODE --- .../kubernetes/kubernetes-deployment.rst | 20 ++++++++++++++----- .../kubernetes/kubernetes-getting-started.rst | 6 +++--- .../reference/kubernetes/kubernetes-setup.rst | 6 +++--- .../source/reservations/existing-machines.rst | 6 +++--- examples/k8s_cloud_deploy/README.md | 4 ++-- sky/cli.py | 2 +- .../service_catalog/kubernetes_catalog.py | 8 +++++++- 7 files changed, 34 insertions(+), 18 deletions(-) diff --git a/docs/source/reference/kubernetes/kubernetes-deployment.rst b/docs/source/reference/kubernetes/kubernetes-deployment.rst index e9489e9149e..d3891b3df51 100644 --- a/docs/source/reference/kubernetes/kubernetes-deployment.rst +++ b/docs/source/reference/kubernetes/kubernetes-deployment.rst @@ -147,10 +147,16 @@ Deploying on Google Cloud GKE .. code-block:: console $ sky show-gpus --cloud kubernetes - GPU QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS - L4 1, 2, 3, 4 8 6 - A100 1, 2 4 2 + GPU REQUESTABLE_QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS + L4 1, 2, 4 8 6 + A100 1, 2 4 2 + Kubernetes per node GPU availability + NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS + my-cluster-0 L4 4 4 + my-cluster-1 L4 4 2 + my-cluster-2 A100 2 2 + my-cluster-3 A100 2 0 .. note:: GKE autopilot clusters are currently not supported. Only GKE standard clusters are supported. @@ -196,8 +202,12 @@ Deploying on Amazon EKS .. code-block:: console $ sky show-gpus --cloud kubernetes - GPU QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS - A100 1, 2 4 2 + GPU REQUESTABLE_QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS + A100 1, 2 4 2 + + Kubernetes per node GPU availability + NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS + my-cluster-0 A100 2 2 .. _kubernetes-setup-onprem: diff --git a/docs/source/reference/kubernetes/kubernetes-getting-started.rst b/docs/source/reference/kubernetes/kubernetes-getting-started.rst index c1874b6b71f..9d46acf13c0 100644 --- a/docs/source/reference/kubernetes/kubernetes-getting-started.rst +++ b/docs/source/reference/kubernetes/kubernetes-getting-started.rst @@ -156,9 +156,9 @@ You can also inspect the real-time GPU usage on the cluster with :code:`sky show $ sky show-gpus --cloud kubernetes Kubernetes GPUs - GPU QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS - L4 1, 2, 4 12 12 - H100 1, 2, 4, 8 16 16 + GPU REQUESTABLE_QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS + L4 1, 2, 4 12 12 + H100 1, 2, 4, 8 16 16 Kubernetes per node GPU availability NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS diff --git a/docs/source/reference/kubernetes/kubernetes-setup.rst b/docs/source/reference/kubernetes/kubernetes-setup.rst index a827d49ea19..3621d1b5338 100644 --- a/docs/source/reference/kubernetes/kubernetes-setup.rst +++ b/docs/source/reference/kubernetes/kubernetes-setup.rst @@ -262,9 +262,9 @@ You can also check the GPUs available on your nodes by running: $ sky show-gpus --cloud kubernetes Kubernetes GPUs - GPU QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS - L4 1, 2, 4 12 12 - H100 1, 2, 4, 8 16 16 + GPU REQUESTABLE_QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS + L4 1, 2, 4 12 12 + H100 1, 2, 4, 8 16 16 Kubernetes per node GPU availability NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS diff --git a/docs/source/reservations/existing-machines.rst b/docs/source/reservations/existing-machines.rst index 2f9ac2a2441..d8d3fb81e67 100644 --- a/docs/source/reservations/existing-machines.rst +++ b/docs/source/reservations/existing-machines.rst @@ -108,9 +108,9 @@ Deploying SkyPilot $ sky show-gpus --cloud kubernetes Kubernetes GPUs - GPU QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS - L4 1, 2, 4 12 12 - H100 1, 2, 4, 8 16 16 + GPU REQUESTABLE_QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS + L4 1, 2, 4 12 12 + H100 1, 2, 4, 8 16 16 Kubernetes per node GPU availability NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS diff --git a/examples/k8s_cloud_deploy/README.md b/examples/k8s_cloud_deploy/README.md index 64519e2fa53..5ba42cbe836 100644 --- a/examples/k8s_cloud_deploy/README.md +++ b/examples/k8s_cloud_deploy/README.md @@ -44,8 +44,8 @@ NAME STATUS ROLES AGE VERSION $ sky show-gpus --cloud kubernetes Kubernetes GPUs -GPU QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS -A10 1 2 2 +GPU REQUESTABLE_QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS +A10 1 2 2 Kubernetes per node GPU availability NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS diff --git a/sky/cli.py b/sky/cli.py index 7ac26fd0714..6e0587cc117 100644 --- a/sky/cli.py +++ b/sky/cli.py @@ -3085,7 +3085,7 @@ def _get_kubernetes_realtime_gpu_table( qty_header = 'QTY_FILTER' free_header = 'FILTERED_FREE_GPUS' else: - qty_header = 'QTY_PER_NODE' + qty_header = 'REQUESTABLE_QTY_PER_NODE' free_header = 'TOTAL_FREE_GPUS' realtime_gpu_table = log_utils.create_table( ['GPU', qty_header, 'TOTAL_GPUS', free_header]) diff --git a/sky/clouds/service_catalog/kubernetes_catalog.py b/sky/clouds/service_catalog/kubernetes_catalog.py index 24daeabf9d4..2d0cdbf7cf6 100644 --- a/sky/clouds/service_catalog/kubernetes_catalog.py +++ b/sky/clouds/service_catalog/kubernetes_catalog.py @@ -120,8 +120,14 @@ def list_accelerators_realtime( # Generate the GPU quantities for the accelerators if accelerator_name and accelerator_count > 0: - for count in range(1, accelerator_count + 1): + count = 1 + while count <= accelerator_count: accelerators_qtys.add((accelerator_name, count)) + count *= 2 + # Add the accelerator count if it's not already in the set + # (e.g., if there's 12 GPUs, we should have qtys 1, 2, 4, 8, 12) + if accelerator_count not in accelerators_qtys: + accelerators_qtys.add((accelerator_name, accelerator_count)) for pod in pods: # Get all the pods running on the node From 4adda4158cca5599537677283c6ebc746ec5c989 Mon Sep 17 00:00:00 2001 From: Andy Lee Date: Thu, 24 Oct 2024 15:16:59 -0700 Subject: [PATCH 80/93] Fix type checking issues in jobs/ and serve/ directories (#4161) * fix some linter errors * refactor: use `Sequence` for more accurate typing and for covariance --- sky/exceptions.py | 8 ++++---- sky/jobs/recovery_strategy.py | 6 +++--- sky/usage/usage_lib.py | 5 +++-- sky/utils/common_utils.py | 5 +++-- 4 files changed, 13 insertions(+), 11 deletions(-) diff --git a/sky/exceptions.py b/sky/exceptions.py index 066d36c3cf3..f78c6605261 100644 --- a/sky/exceptions.py +++ b/sky/exceptions.py @@ -1,7 +1,7 @@ """Exceptions.""" import enum import typing -from typing import List, Optional +from typing import List, Optional, Sequence if typing.TYPE_CHECKING: from sky import status_lib @@ -61,12 +61,12 @@ class ProvisionPrechecksError(Exception): the error will be raised. Args: - reasons: (List[Exception]) The reasons why the prechecks failed. + reasons: (Sequence[Exception]) The reasons why the prechecks failed. """ - def __init__(self, reasons: List[Exception]) -> None: + def __init__(self, reasons: Sequence[Exception]) -> None: super().__init__() - self.reasons = list(reasons) + self.reasons = reasons class ManagedJobReachedMaxRetriesError(Exception): diff --git a/sky/jobs/recovery_strategy.py b/sky/jobs/recovery_strategy.py index 2a32aa3b24e..6a931240646 100644 --- a/sky/jobs/recovery_strategy.py +++ b/sky/jobs/recovery_strategy.py @@ -24,6 +24,7 @@ from sky.utils import ux_utils if typing.TYPE_CHECKING: + from sky import resources from sky import task as task_lib logger = sky_logging.init_logger(__name__) @@ -327,8 +328,7 @@ def _launch(self, 'Failure happened before provisioning. Failover ' f'reasons: {reasons_str}') if raise_on_failure: - raise exceptions.ProvisionPrechecksError( - reasons=reasons) + raise exceptions.ProvisionPrechecksError(reasons) return None logger.info('Failed to launch a cluster with error: ' f'{common_utils.format_exception(e)})') @@ -382,7 +382,7 @@ def __init__(self, cluster_name: str, backend: 'backends.Backend', # first retry in the same cloud/region. (Inside recover() we may not # rely on cluster handle, as it can be None if the cluster is # preempted.) - self._launched_resources: Optional['sky.resources.Resources'] = None + self._launched_resources: Optional['resources.Resources'] = None def _launch(self, max_retry: Optional[int] = 3, diff --git a/sky/usage/usage_lib.py b/sky/usage/usage_lib.py index a6c10da5c7a..07867939ee5 100644 --- a/sky/usage/usage_lib.py +++ b/sky/usage/usage_lib.py @@ -432,8 +432,9 @@ def entrypoint_context(name: str, fallback: bool = False): with ux_utils.enable_traceback(): trace = traceback.format_exc() messages.usage.stacktrace = trace - if hasattr(e, 'detailed_reason') and e.detailed_reason is not None: - messages.usage.stacktrace += '\nDetails: ' + e.detailed_reason + detailed_reason = getattr(e, 'detailed_reason', None) + if detailed_reason is not None: + messages.usage.stacktrace += '\nDetails: ' + detailed_reason messages.usage.exception = common_utils.remove_color( common_utils.format_exception(e)) raise diff --git a/sky/utils/common_utils.py b/sky/utils/common_utils.py index 6383ee8af0d..5fce435b770 100644 --- a/sky/utils/common_utils.py +++ b/sky/utils/common_utils.py @@ -362,7 +362,6 @@ def _wrapper(f): @functools.wraps(f) def _record(*args, **kwargs): - nonlocal name_or_fn with cls(name_or_fn, **ctx_kwargs): return f(*args, **kwargs) @@ -376,7 +375,6 @@ def _record(*args, **kwargs): @functools.wraps(name_or_fn) def _record(*args, **kwargs): - nonlocal name_or_fn f = name_or_fn func_name = getattr(f, '__qualname__', f.__name__) module_name = getattr(f, '__module__', '') @@ -579,7 +577,10 @@ def validate_schema(obj, schema, err_msg_prefix='', skip_none=True): e.message) else: err_msg = err_msg_prefix + assert isinstance(e.schema, dict), 'Schema must be a dictionary' known_fields = set(e.schema.get('properties', {}).keys()) + assert isinstance(e.instance, + dict), 'Instance must be a dictionary' for field in e.instance: if field not in known_fields: most_similar_field = difflib.get_close_matches( From e832dde2c5a7f9ba9e141afad874054deb15732c Mon Sep 17 00:00:00 2001 From: Wenjie Ma <55629401+euclidgame@users.noreply.github.com> Date: Thu, 24 Oct 2024 16:47:58 -0700 Subject: [PATCH 81/93] [Serve] Make controller regions/ choose from replica resources (#4053) * Add controller regions. * Consider regions and zones` * Revert cloud in task.yaml * Change the format * Remove one for loop and change the placeholder * Add unit test for get_controller_resources * Correct the number of tests * Use explicit loop for return value * Add some resources to test * Change the early return logic in get_controller_resources * Change some comments * Change default value of region * Some nits * Add types for parameters and ret values in test. --------- Co-authored-by: Wenjie Ma --- sky/utils/controller_utils.py | 87 +++++++++++++---- tests/unit_tests/test_controller_utils.py | 114 ++++++++++++++++++---- 2 files changed, 165 insertions(+), 36 deletions(-) diff --git a/sky/utils/controller_utils.py b/sky/utils/controller_utils.py index 0c71357c856..0ab2fd7e117 100644 --- a/sky/utils/controller_utils.py +++ b/sky/utils/controller_utils.py @@ -505,20 +505,17 @@ def get_controller_resources( if handle is not None: controller_resources_to_use = handle.launched_resources - if controller_resources_to_use.cloud is not None: - return {controller_resources_to_use} + # If the controller and replicas are from the same cloud (and region/zone), + # it should provide better connectivity. We will let the controller choose + # from the clouds (and regions/zones) of the resources if the user does not + # specify the cloud (and region/zone) for the controller. - # If the controller and replicas are from the same cloud, it should - # provide better connectivity. We will let the controller choose from - # the clouds of the resources if the controller does not exist. - # TODO(tian): Consider respecting the regions/zones specified for the - # resources as well. - requested_clouds: Set['clouds.Cloud'] = set() + requested_clouds_with_region_zone: Dict[str, Dict[Optional[str], + Set[Optional[str]]]] = {} for resource in task_resources: - # cloud is an object and will not be able to be distinguished by set. - # Here we manually check if the cloud is in the set. if resource.cloud is not None: - if not clouds.cloud_in_iterable(resource.cloud, requested_clouds): + cloud_name = str(resource.cloud) + if cloud_name not in requested_clouds_with_region_zone: try: resource.cloud.check_features_are_supported( resources.Resources(), @@ -526,7 +523,26 @@ def get_controller_resources( except exceptions.NotSupportedError: # Skip the cloud if it does not support hosting controllers. continue - requested_clouds.add(resource.cloud) + requested_clouds_with_region_zone[cloud_name] = {} + if resource.region is None: + # If one of the resource.region is None, this could represent + # that the user is unsure about which region the resource is + # hosted in. In this case, we allow any region for this cloud. + requested_clouds_with_region_zone[cloud_name] = {None: {None}} + elif None not in requested_clouds_with_region_zone[cloud_name]: + if resource.region not in requested_clouds_with_region_zone[ + cloud_name]: + requested_clouds_with_region_zone[cloud_name][ + resource.region] = set() + # If one of the resource.zone is None, allow any zone in the + # region. + if resource.zone is None: + requested_clouds_with_region_zone[cloud_name][ + resource.region] = {None} + elif None not in requested_clouds_with_region_zone[cloud_name][ + resource.region]: + requested_clouds_with_region_zone[cloud_name][ + resource.region].add(resource.zone) else: # if one of the resource.cloud is None, this could represent user # does not know which cloud is best for the specified resources. @@ -536,14 +552,49 @@ def get_controller_resources( # - cloud: runpod # accelerators: A40 # In this case, we allow the controller to be launched on any cloud. - requested_clouds.clear() + requested_clouds_with_region_zone.clear() break - if not requested_clouds: + + # Extract filtering criteria from the controller resources specified by the + # user. + controller_cloud = str( + controller_resources_to_use.cloud + ) if controller_resources_to_use.cloud is not None else None + controller_region = controller_resources_to_use.region + controller_zone = controller_resources_to_use.zone + + # Filter clouds if controller_resources_to_use.cloud is specified. + filtered_clouds = ({controller_cloud} if controller_cloud is not None else + requested_clouds_with_region_zone.keys()) + + # Filter regions and zones and construct the result. + result: Set[resources.Resources] = set() + for cloud_name in filtered_clouds: + regions = requested_clouds_with_region_zone.get(cloud_name, + {None: {None}}) + + # Filter regions if controller_resources_to_use.region is specified. + filtered_regions = ({controller_region} if controller_region is not None + else regions.keys()) + + for region in filtered_regions: + zones = regions.get(region, {None}) + + # Filter zones if controller_resources_to_use.zone is specified. + filtered_zones = ({controller_zone} + if controller_zone is not None else zones) + + # Create combinations of cloud, region, and zone. + for zone in filtered_zones: + resource_copy = controller_resources_to_use.copy( + cloud=clouds.CLOUD_REGISTRY.from_str(cloud_name), + region=region, + zone=zone) + result.add(resource_copy) + + if not result: return {controller_resources_to_use} - return { - controller_resources_to_use.copy(cloud=controller_cloud) - for controller_cloud in requested_clouds - } + return result def _setup_proxy_command_on_controller( diff --git a/tests/unit_tests/test_controller_utils.py b/tests/unit_tests/test_controller_utils.py index 7465f648385..f41c7413bc1 100644 --- a/tests/unit_tests/test_controller_utils.py +++ b/tests/unit_tests/test_controller_utils.py @@ -1,5 +1,5 @@ """Test the controller_utils module.""" -from typing import Any, Dict +from typing import Any, Dict, Optional, Set, Tuple import pytest @@ -65,6 +65,24 @@ def get_custom_controller_resources(keys, default): controller_resources_config, k, v) +def _check_controller_resources( + controller_resources: Set[sky.Resources], + expected_combinations: Set[Tuple[Optional[str], Optional[str], + Optional[str]]], + default_controller_resources: Dict[str, Any]) -> None: + """Helper function to check that the controller resources match the + expected combinations.""" + for r in controller_resources: + config = r.to_yaml_config() + cloud = config.pop('cloud') + region = config.pop('region', None) + zone = config.pop('zone', None) + assert (cloud, region, zone) in expected_combinations + expected_combinations.remove((cloud, region, zone)) + assert config == default_controller_resources, config + assert not expected_combinations + + @pytest.mark.parametrize(('controller_type', 'default_controller_resources'), [ ('jobs', managed_job_constants.CONTROLLER_RESOURCES), ('serve', serve_constants.CONTROLLER_RESOURCES), @@ -79,17 +97,12 @@ def test_get_controller_resources_with_task_resources( # could host controllers. Return a set, each item has # one cloud specified plus the default resources. all_clouds = {sky.AWS(), sky.GCP(), sky.Azure()} - all_cloud_names = {str(c) for c in all_clouds} + expected_combinations = {(str(c), None, None) for c in all_clouds} controller_resources = controller_utils.get_controller_resources( controller=controller_utils.Controllers.from_type(controller_type), task_resources=[sky.Resources(cloud=c) for c in all_clouds]) - for r in controller_resources: - config = r.to_yaml_config() - cloud = config.pop('cloud') - assert cloud in all_cloud_names - all_cloud_names.remove(cloud) - assert config == default_controller_resources, config - assert not all_cloud_names + _check_controller_resources(controller_resources, expected_combinations, + default_controller_resources) # 2. All resources has cloud specified. Some of them # could NOT host controllers. Return a set, only @@ -113,19 +126,14 @@ def _could_host_controllers(cloud: sky.clouds.Cloud) -> bool: return False return True - all_cloud_names_expected = { - str(c) for c in all_clouds if _could_host_controllers(c) + expected_combinations = { + (str(c), None, None) for c in all_clouds if _could_host_controllers(c) } controller_resources = controller_utils.get_controller_resources( controller=controller_utils.Controllers.from_type(controller_type), task_resources=[sky.Resources(cloud=c) for c in all_clouds]) - for r in controller_resources: - config = r.to_yaml_config() - cloud = config.pop('cloud') - assert cloud in all_cloud_names_expected - all_cloud_names_expected.remove(cloud) - assert config == default_controller_resources, config - assert not all_cloud_names_expected + _check_controller_resources(controller_resources, expected_combinations, + default_controller_resources) # 3. Some resources does not have cloud specified. # Return the default resources. @@ -138,3 +146,73 @@ def _could_host_controllers(cloud: sky.clouds.Cloud) -> bool: assert len(controller_resources) == 1 config = list(controller_resources)[0].to_yaml_config() assert config == default_controller_resources, config + + # 4. All resources have clouds, regions, and zones specified. + # Return a set of controller resources for all combinations of clouds, + # regions, and zones. Each combination should contain the default resources + # along with the cloud, region, and zone. + all_cloud_regions_zones = [ + sky.Resources(cloud=sky.AWS(), region='us-east-1', zone='us-east-1a'), + sky.Resources(cloud=sky.AWS(), region='ap-south-1', zone='ap-south-1b'), + sky.Resources(cloud=sky.GCP(), + region='us-central1', + zone='us-central1-a'), + sky.Resources(cloud=sky.GCP(), + region='europe-west1', + zone='europe-west1-b') + ] + expected_combinations = {('AWS', 'us-east-1', 'us-east-1a'), + ('AWS', 'ap-south-1', 'ap-south-1b'), + ('GCP', 'us-central1', 'us-central1-a'), + ('GCP', 'europe-west1', 'europe-west1-b')} + controller_resources = controller_utils.get_controller_resources( + controller=controller_utils.Controllers.from_type(controller_type), + task_resources=all_cloud_regions_zones) + _check_controller_resources(controller_resources, expected_combinations, + default_controller_resources) + + # 5. Clouds and regions are specified, but zones are partially specified. + # Return a set containing combinations where the zone is None when not all + # zones are specified in the input for the given region. The default + # resources should be returned along with the cloud and region, and the + # zone (if specified). + controller_resources = controller_utils.get_controller_resources( + controller=controller_utils.Controllers.from_type(controller_type), + task_resources=[ + sky.Resources(cloud=sky.AWS(), region='us-west-2'), + sky.Resources(cloud=sky.AWS(), + region='us-west-2', + zone='us-west-2b'), + sky.Resources(cloud=sky.GCP(), + region='us-central1', + zone='us-central1-a') + ]) + expected_combinations = {('AWS', 'us-west-2', None), + ('GCP', 'us-central1', 'us-central1-a')} + _check_controller_resources(controller_resources, expected_combinations, + default_controller_resources) + + # 6. Mixed case: Some resources have clouds and regions or zones, others do + # not. For clouds where regions or zones are not specified in the input, + # return None for those fields. The default resources should be returned + # along with the cloud, region (if specified), and zone (if specified). + controller_resources = controller_utils.get_controller_resources( + controller=controller_utils.Controllers.from_type(controller_type), + task_resources=[ + sky.Resources(cloud=sky.GCP(), region='europe-west1'), + sky.Resources(cloud=sky.GCP()), + sky.Resources(cloud=sky.AWS(), + region='eu-north-1', + zone='eu-north-1a'), + sky.Resources(cloud=sky.AWS(), region='eu-north-1'), + sky.Resources(cloud=sky.AWS(), region='ap-south-1'), + sky.Resources(cloud=sky.Azure()), + ]) + expected_combinations = { + ('AWS', 'eu-north-1', None), + ('AWS', 'ap-south-1', None), + ('GCP', None, None), + ('Azure', None, None), + } + _check_controller_resources(controller_resources, expected_combinations, + default_controller_resources) From 13ad9169bd81afa5aab610d7a308a39ddf81d074 Mon Sep 17 00:00:00 2001 From: Yika Date: Thu, 24 Oct 2024 18:48:57 -0700 Subject: [PATCH 82/93] Upload all cloud credentials to sky cluster regardless of sky check (#4165) * Upload all cloud credentials to sky cluster regardless of sky check * address comments * nit Co-authored-by: Zhanghao Wu --------- Co-authored-by: Zhanghao Wu --- sky/check.py | 15 +++++++++++---- sky/clouds/oci.py | 2 +- 2 files changed, 12 insertions(+), 5 deletions(-) diff --git a/sky/check.py b/sky/check.py index 9ac2848733c..dcaa349d234 100644 --- a/sky/check.py +++ b/sky/check.py @@ -1,4 +1,5 @@ """Credential checks: check cloud credentials and enable clouds.""" +import os import traceback from types import ModuleType from typing import Dict, Iterable, List, Optional, Tuple, Union @@ -194,19 +195,25 @@ def get_cached_enabled_clouds_or_refresh( def get_cloud_credential_file_mounts( excluded_clouds: Optional[Iterable[sky_clouds.Cloud]] ) -> Dict[str, str]: - """Returns the files necessary to access all enabled clouds. + """Returns the files necessary to access all clouds. Returns a dictionary that will be added to a task's file mounts and a list of patterns that will be excluded (used as rsync_exclude). """ - enabled_clouds = get_cached_enabled_clouds_or_refresh() + # Uploading credentials for all clouds instead of only sky check + # enabled clouds because users may have partial credentials for some + # clouds to access their specific resources (e.g. cloud storage) but + # not have the complete credentials to pass sky check. + clouds = sky_clouds.CLOUD_REGISTRY.values() file_mounts = {} - for cloud in enabled_clouds: + for cloud in clouds: if (excluded_clouds is not None and sky_clouds.cloud_in_iterable(cloud, excluded_clouds)): continue cloud_file_mounts = cloud.get_credential_file_mounts() - file_mounts.update(cloud_file_mounts) + for remote_path, local_path in cloud_file_mounts.items(): + if os.path.exists(os.path.expanduser(local_path)): + file_mounts[remote_path] = local_path # Currently, get_cached_enabled_clouds_or_refresh() does not support r2 as # only clouds with computing instances are marked as enabled by skypilot. # This will be removed when cloudflare/r2 is added as a 'cloud'. diff --git a/sky/clouds/oci.py b/sky/clouds/oci.py index 810e43fe3b5..c6451a73a1f 100644 --- a/sky/clouds/oci.py +++ b/sky/clouds/oci.py @@ -468,7 +468,7 @@ def get_credential_file_mounts(self) -> Dict[str, str]: api_key_file = oci_cfg[ 'key_file'] if 'key_file' in oci_cfg else 'BadConf' sky_cfg_file = oci_utils.oci_config.get_sky_user_config_file() - except ImportError: + except (ImportError, oci_adaptor.oci.exceptions.ConfigFileNotFound): return {} # OCI config and API key file are mandatory From 149713e9acde77df6d40ee17cd2cd6bd349d5edc Mon Sep 17 00:00:00 2001 From: Yika Date: Thu, 24 Oct 2024 20:06:02 -0700 Subject: [PATCH 83/93] [Performance] Allow users to pass in Azure community images at --image-id (#4145) * Allow users to pass in community image as image-id * Add image fallback * Address comments * address comments * Resolve region failover * address comments --- sky/clouds/azure.py | 136 +++++++++++++------- sky/clouds/service_catalog/azure_catalog.py | 15 +++ sky/clouds/utils/azure_utils.py | 91 +++++++++++++ sky/resources.py | 1 + tests/unit_tests/test_azure_utils.py | 21 +++ 5 files changed, 214 insertions(+), 50 deletions(-) create mode 100644 sky/clouds/utils/azure_utils.py create mode 100644 tests/unit_tests/test_azure_utils.py diff --git a/sky/clouds/azure.py b/sky/clouds/azure.py index adffd32ad88..d91f589ca8f 100644 --- a/sky/clouds/azure.py +++ b/sky/clouds/azure.py @@ -15,6 +15,7 @@ from sky import sky_logging from sky.adaptors import azure from sky.clouds import service_catalog +from sky.clouds.utils import azure_utils from sky.utils import common_utils from sky.utils import resources_utils from sky.utils import ux_utils @@ -36,6 +37,15 @@ _DEFAULT_AZURE_UBUNTU_HPC_IMAGE_GB = 30 _DEFAULT_AZURE_UBUNTU_2004_IMAGE_GB = 150 +_DEFAULT_SKYPILOT_IMAGE_GB = 30 + +_DEFAULT_CPU_IMAGE_ID = 'skypilot:gpu-ubuntu-2204' +_DEFAULT_GPU_IMAGE_ID = 'skypilot:gpu-ubuntu-2204' +_DEFAULT_V1_IMAGE_ID = 'skypilot:v1-ubuntu-2004' +_DEFAULT_GPU_K80_IMAGE_ID = 'skypilot:k80-ubuntu-2004' +_FALLBACK_IMAGE_ID = 'skypilot:gpu-ubuntu-2204' + +_COMMUNITY_IMAGE_PREFIX = '/CommunityGalleries' def _run_output(cmd): @@ -132,29 +142,56 @@ def get_egress_cost(self, num_gigabytes: float): cost += 0.0 return cost + @classmethod + def get_default_instance_type( + cls, + cpus: Optional[str] = None, + memory: Optional[str] = None, + disk_tier: Optional[resources_utils.DiskTier] = None + ) -> Optional[str]: + return service_catalog.get_default_instance_type(cpus=cpus, + memory=memory, + disk_tier=disk_tier, + clouds='azure') + @classmethod def get_image_size(cls, image_id: str, region: Optional[str]) -> float: - if region is None: - # The region used here is only for where to send the query, - # not the image location. Azure's image is globally available. - region = 'eastus' - is_skypilot_image_tag = False + # Process skypilot images. if image_id.startswith('skypilot:'): - is_skypilot_image_tag = True image_id = service_catalog.get_image_id_from_tag(image_id, clouds='azure') - image_id_splitted = image_id.split(':') - if len(image_id_splitted) != 4: - with ux_utils.print_exception_no_traceback(): - raise ValueError(f'Invalid image id: {image_id}. Expected ' - 'format: :::') - publisher, offer, sku, version = image_id_splitted - if is_skypilot_image_tag: - if offer == 'ubuntu-hpc': - return _DEFAULT_AZURE_UBUNTU_HPC_IMAGE_GB + if image_id.startswith(_COMMUNITY_IMAGE_PREFIX): + # Avoid querying the image size from Azure as + # all skypilot custom images have the same size. + return _DEFAULT_SKYPILOT_IMAGE_GB else: - return _DEFAULT_AZURE_UBUNTU_2004_IMAGE_GB + publisher, offer, sku, version = image_id.split(':') + if offer == 'ubuntu-hpc': + return _DEFAULT_AZURE_UBUNTU_HPC_IMAGE_GB + else: + return _DEFAULT_AZURE_UBUNTU_2004_IMAGE_GB + + # Process user-specified images. + azure_utils.validate_image_id(image_id) compute_client = azure.get_client('compute', cls.get_project_id()) + + # Community gallery image. + if image_id.startswith(_COMMUNITY_IMAGE_PREFIX): + if region is None: + return 0.0 + _, _, gallery_name, _, image_name = image_id.split('/') + try: + return azure_utils.get_community_image_size( + compute_client, gallery_name, image_name, region) + except exceptions.ResourcesUnavailableError: + return 0.0 + + # Marketplace image + if region is None: + # The region used here is only for where to send the query, + # not the image location. Marketplace image is globally available. + region = 'eastus' + publisher, offer, sku, version = image_id.split(':') try: image = compute_client.virtual_machine_images.get( region, publisher, offer, sku, version) @@ -176,40 +213,23 @@ def get_image_size(cls, image_id: str, region: Optional[str]) -> float: size_in_gb = size_in_bytes / (1024**3) return size_in_gb - @classmethod - def get_default_instance_type( - cls, - cpus: Optional[str] = None, - memory: Optional[str] = None, - disk_tier: Optional[resources_utils.DiskTier] = None - ) -> Optional[str]: - return service_catalog.get_default_instance_type(cpus=cpus, - memory=memory, - disk_tier=disk_tier, - clouds='azure') - def _get_default_image_tag(self, gen_version, instance_type) -> str: # ubuntu-2004 v21.08.30, K80 requires image with old NVIDIA driver version acc = self.get_accelerators_from_instance_type(instance_type) if acc is not None: acc_name = list(acc.keys())[0] if acc_name == 'K80': - return 'skypilot:k80-ubuntu-2004' - - # ubuntu-2004 v21.11.04, the previous image we used in the past for - # V1 HyperV instance before we change default image to ubuntu-hpc. + return _DEFAULT_GPU_K80_IMAGE_ID + # About Gen V1 vs V2: # In Azure, all instances with K80 (Standard_NC series), some # instances with M60 (Standard_NV series) and some cpu instances - # (Basic_A, Standard_D, ...) are V1 instance. For these instances, - # we use the previous image. + # (Basic_A, Standard_D, ...) are V1 instance. + # All A100 instances are V2. if gen_version == 'V1': - return 'skypilot:v1-ubuntu-2004' - - # nvidia-driver: 535.54.03, cuda: 12.2 - # see: https://github.com/Azure/azhpc-images/releases/tag/ubuntu-hpc-20230803 - # All A100 instances is of gen2, so it will always use - # the latest ubuntu-hpc:2204 image. - return 'skypilot:gpu-ubuntu-2204' + return _DEFAULT_V1_IMAGE_ID + if acc is None: + return _DEFAULT_CPU_IMAGE_ID + return _DEFAULT_GPU_IMAGE_ID @classmethod def regions_with_offering(cls, instance_type: str, @@ -302,17 +322,34 @@ def make_deploy_resources_variables( else: assert region_name in resources.image_id, resources.image_id image_id = resources.image_id[region_name] + + # Checked basic image syntax in resources.py if image_id.startswith('skypilot:'): image_id = service_catalog.get_image_id_from_tag(image_id, clouds='azure') - # Already checked in resources.py - publisher, offer, sku, version = image_id.split(':') - image_config = { - 'image_publisher': publisher, - 'image_offer': offer, - 'image_sku': sku, - 'image_version': version, - } + # Fallback if image does not exist in the specified region. + # Putting fallback here instead of at image validation + # when creating the resource because community images are + # regional so we need the correct region when we check whether + # the image exists. + if image_id.startswith( + _COMMUNITY_IMAGE_PREFIX + ) and region_name not in azure_catalog.COMMUNITY_IMAGE_AVAILABLE_REGIONS: + logger.info(f'Azure image {image_id} does not exist in region ' + f'{region_name} so use the fallback image instead.') + image_id = service_catalog.get_image_id_from_tag( + _FALLBACK_IMAGE_ID, clouds='azure') + + if image_id.startswith(_COMMUNITY_IMAGE_PREFIX): + image_config = {'community_gallery_image_id': image_id} + else: + publisher, offer, sku, version = image_id.split(':') + image_config = { + 'image_publisher': publisher, + 'image_offer': offer, + 'image_sku': sku, + 'image_version': version, + } # Setup the A10 nvidia driver. need_nvidia_driver_extension = (acc_dict is not None and @@ -380,7 +417,6 @@ def _failover_disk_tier() -> Optional[resources_utils.DiskTier]: # Setting disk performance tier for high disk tier. if disk_tier == resources_utils.DiskTier.HIGH: resources_vars['disk_performance_tier'] = 'P50' - return resources_vars def _get_feasible_launchable_resources( diff --git a/sky/clouds/service_catalog/azure_catalog.py b/sky/clouds/service_catalog/azure_catalog.py index 2d323cbac5f..c71285fe9a3 100644 --- a/sky/clouds/service_catalog/azure_catalog.py +++ b/sky/clouds/service_catalog/azure_catalog.py @@ -12,6 +12,21 @@ from sky.utils import resources_utils from sky.utils import ux_utils +# This list should match the list of regions in +# skypilot image generation Packer script's replication_regions +# sky/clouds/service_catalog/images/skypilot-azure-cpu-ubuntu.pkr.hcl +COMMUNITY_IMAGE_AVAILABLE_REGIONS = { + 'centralus', + 'eastus', + 'eastus2', + 'northcentralus', + 'southcentralus', + 'westcentralus', + 'westus', + 'westus2', + 'westus3', +} + # The frequency of pulling the latest catalog from the cloud provider. # Though the catalog update is manual in our skypilot-catalog repo, we # still want to pull the latest catalog periodically to make sure the diff --git a/sky/clouds/utils/azure_utils.py b/sky/clouds/utils/azure_utils.py new file mode 100644 index 00000000000..83b86f4d54f --- /dev/null +++ b/sky/clouds/utils/azure_utils.py @@ -0,0 +1,91 @@ +"""Utilies for Azure""" + +import typing + +from sky import exceptions +from sky.adaptors import azure +from sky.utils import ux_utils + +if typing.TYPE_CHECKING: + from azure.mgmt import compute as azure_compute + from azure.mgmt.compute import models as azure_compute_models + + +def validate_image_id(image_id: str): + """Check if the image ID has a valid format. + + Raises: + ValueError: If the image ID is invalid. + """ + image_id_colon_splitted = image_id.split(':') + image_id_slash_splitted = image_id.split('/') + if len(image_id_slash_splitted) != 5 and len(image_id_colon_splitted) != 4: + with ux_utils.print_exception_no_traceback(): + raise ValueError( + f'Invalid image id for Azure: {image_id}. Expected format: \n' + '* Marketplace image ID: :::\n' + '* Community image ID: ' + '/CommunityGalleries//Images/') + if len(image_id_slash_splitted) == 5: + _, gallery_type, _, image_type, _ = image_id.split('/') + if gallery_type != 'CommunityGalleries' or image_type != 'Images': + with ux_utils.print_exception_no_traceback(): + raise ValueError( + f'Invalid community image id for Azure: {image_id}.\n' + 'Expected format: ' + '/CommunityGalleries//Images/') + + +def get_community_image( + compute_client: 'azure_compute.ComputeManagementClient', image_id: str, + region: str) -> 'azure_compute_models.CommunityGalleryImage': + """Get community image from cloud. + + Args: + image_id: /CommunityGalleries//Images/ + Raises: + ResourcesUnavailableError + """ + try: + _, _, gallery_name, _, image_name = image_id.split('/') + return compute_client.community_gallery_images.get( + location=region, + public_gallery_name=gallery_name, + gallery_image_name=image_name) + except azure.exceptions().AzureError as e: + raise exceptions.ResourcesUnavailableError( + f'Community image {image_id} does not exist in region {region}.' + ) from e + + +def get_community_image_size( + compute_client: 'azure_compute.ComputeManagementClient', + gallery_name: str, image_name: str, region: str) -> float: + """Get the size of the community image from cloud. + + Args: + image_id: /CommunityGalleries//Images/ + Raises: + ResourcesUnavailableError + """ + try: + image_versions = compute_client.community_gallery_image_versions.list( + location=region, + public_gallery_name=gallery_name, + gallery_image_name=image_name, + ) + image_versions = list(image_versions) + if not image_versions: + raise exceptions.ResourcesUnavailableError( + f'No versions available for Azure community image {image_name}') + latest_version = image_versions[-1].name + + image_details = compute_client.community_gallery_image_versions.get( + location=region, + public_gallery_name=gallery_name, + gallery_image_name=image_name, + gallery_image_version_name=latest_version) + return image_details.storage_profile.os_disk_image.disk_size_gb + except azure.exceptions().AzureError as e: + raise exceptions.ResourcesUnavailableError( + f'Failed to get community image size: {e}.') from e diff --git a/sky/resources.py b/sky/resources.py index 384f2b6a548..540cbfb703c 100644 --- a/sky/resources.py +++ b/sky/resources.py @@ -225,6 +225,7 @@ def __init__( self._set_memory(memory) self._set_accelerators(accelerators, accelerator_args) + # TODO: move these out of init to prevent repeated calls. self._try_validate_instance_type() self._try_validate_cpus_mem() self._try_validate_managed_job_attributes() diff --git a/tests/unit_tests/test_azure_utils.py b/tests/unit_tests/test_azure_utils.py new file mode 100644 index 00000000000..93ef5caadb0 --- /dev/null +++ b/tests/unit_tests/test_azure_utils.py @@ -0,0 +1,21 @@ +import pytest + +from sky.clouds.utils import azure_utils + + +def test_validate_image_id(): + # Valid marketplace image ID + azure_utils.validate_image_id("publisher:offer:sku:version") + + # Valid community image ID + azure_utils.validate_image_id( + "/CommunityGalleries/gallery-name/Images/image-name") + + # Invalid format (neither marketplace nor community) + with pytest.raises(ValueError): + azure_utils.validate_image_id( + "CommunityGalleries/gallery-name/Images/image-name") + + # Invalid marketplace image ID (too few parts) + with pytest.raises(ValueError): + azure_utils.validate_image_id("publisher:offer:sku") From 7c5b7e0baf593080c526387e200407c89f4459e4 Mon Sep 17 00:00:00 2001 From: Yika Date: Thu, 24 Oct 2024 20:11:11 -0700 Subject: [PATCH 84/93] [Performance] Add Azure packer scripts for custom images (#4142) * [Performance] Add Azure packer scripts for custom images * address comment --- sky/clouds/service_catalog/images/README.md | 55 ++++++++----- .../images/skypilot-aws-cpu-ubuntu.pkr.hcl | 6 +- .../images/skypilot-aws-gpu-ubuntu.pkr.hcl | 6 +- .../images/skypilot-azure-cpu-ubuntu.pkr.hcl | 72 +++++++++++++++++ .../images/skypilot-azure-gpu-ubuntu.pkr.hcl | 78 +++++++++++++++++++ 5 files changed, 193 insertions(+), 24 deletions(-) create mode 100644 sky/clouds/service_catalog/images/skypilot-azure-cpu-ubuntu.pkr.hcl create mode 100644 sky/clouds/service_catalog/images/skypilot-azure-gpu-ubuntu.pkr.hcl diff --git a/sky/clouds/service_catalog/images/README.md b/sky/clouds/service_catalog/images/README.md index 31ce7c6d9ce..3bdcbf86560 100644 --- a/sky/clouds/service_catalog/images/README.md +++ b/sky/clouds/service_catalog/images/README.md @@ -10,42 +10,58 @@ packer init plugins.pkr.hcl 3. Setup cloud credentials ## Generate Images -```bash -export CLOUD=gcp # Update this -export TYPE=gpu # Update this -export IMAGE=skypilot-${CLOUD}-${TYPE}-ubuntu -packer build ${IMAGE}.pkr.hcl -``` -You will see the image ID after the build is complete. - -FYI time to packer build an image: - +FYI time to packer build images: | Cloud | Type | Approx. Time | |-------|------|------------------------| | AWS | GPU | 15 min | | AWS | CPU | 10 min | | GCP | GPU | 16 min | | GCP | CPU | 5 min | +| Azure | GPU | 35 min | +| Azure | CPU | 25 min | ### GCP +1. Build a single global image. +```bash +export TYPE=gpu # Update this +export IMAGE=skypilot-gcp-${TYPE}-ubuntu +packer build ${IMAGE}.pkr.hcl +``` +2. Make the image public ```bash -export IMAGE_NAME=skypilot-gcp-cpu-ubuntu-20241011003407 # Update this - # Make image public +export IMAGE_NAME=skypilot-gcp-cpu-ubuntu-xxx # Update this export IMAGE_ID=projects/sky-dev-465/global/images/${IMAGE_NAME} gcloud compute images add-iam-policy-binding ${IMAGE_NAME} --member='allAuthenticatedUsers' --role='roles/compute.imageUser' ``` ### AWS -1. Generate images for all regions +1. Generate the source image for a single region. +```bash +export TYPE=gpu # Update this +export IMAGE=skypilot-aws-${TYPE}-ubuntu +packer build ${IMAGE}.pkr.hcl +``` +2. Copy images to all regions ```bash export IMAGE_ID=ami-0b31b24524afa8e47 # Update this - python aws_utils/image_gen.py --image-id ${IMAGE_ID} --processor ${TYPE} ``` -2. Add fallback images if any region failed \ +3. Add fallback images if any region failed \ Look for "NEED_FALLBACK" in the output `images.csv` and edit. (You can use public [ubuntu images](https://cloud-images.ubuntu.com/locator/ec2/) as fallback.) +### Azure +1. Generate a client secret for packer [here](https://portal.azure.com/?feature.msaljs=true#view/Microsoft_AAD_RegisteredApps/ApplicationMenuBlade/~/Credentials/appId/1d249f23-c22e-4d02-b62b-a6827bd113fe/isMSAApp~/false). +```bash +export SECRET=xxxxxx # Update this +``` +2. Build and copy images for all regions and both VM generations (1 and 2). +```bash +export VM_GENERATION=2 # Update this +packer build -force --var vm_generation=${VM_GENERATION} --var client_secret=${SECRET} skypilot-azure-cpu-ubuntu.pkr.hcl +packer build --var client_secret=${SECRET} skypilot-azure-gpu-ubuntu.pkr.hcl +``` + ## Test Images 1. Minimal GPU test: `sky launch --image ${IMAGE_ID} --gpus=L4:1 --cloud ${CLOUD}` then run `nvidia-smi` in the launched instance. 2. Update the image ID in `sky/clouds/gcp.py` and run the test: @@ -60,13 +76,16 @@ pytest tests/test_smoke.py::test_cancel_gcp Submit a PR to update [`SkyPilot Catalog`](https://github.com/skypilot-org/skypilot-catalog/tree/master/catalogs) then clean up the old images to avoid extra iamge storage fees. ### GCP -1. Example PR: [#86](https://github.com/skypilot-org/skypilot-catalog/pull/86) -2. Go to console and delete old images. +1. Update Catalog with new images: [example PR](https://github.com/skypilot-org/skypilot-catalog/pull/86) +2. Go to [GCP console](https://console.cloud.google.com/compute/images?tab=images&project=sky-dev-465) and delete old images. ### AWS 1. Copy the old custom image rows from Catalog's existing `images.csv` to a local `images.csv` in this folder. -2. Update Catalog with new images. Example PR: [#89](https://github.com/skypilot-org/skypilot-catalog/pull/89) +2. Update Catalog with new images: [example PR](https://github.com/skypilot-org/skypilot-catalog/pull/89) 3. Delete AMIs across regions by running ```bash python aws_utils/image_delete.py --tag ${TAG} ``` + +### Azure +1. Update Catalog with new images: [example PR](https://github.com/skypilot-org/skypilot-catalog/pull/92) diff --git a/sky/clouds/service_catalog/images/skypilot-aws-cpu-ubuntu.pkr.hcl b/sky/clouds/service_catalog/images/skypilot-aws-cpu-ubuntu.pkr.hcl index c21fbf51b20..5b049cf35ec 100644 --- a/sky/clouds/service_catalog/images/skypilot-aws-cpu-ubuntu.pkr.hcl +++ b/sky/clouds/service_catalog/images/skypilot-aws-cpu-ubuntu.pkr.hcl @@ -22,9 +22,9 @@ source "amazon-ebs" "cpu-ubuntu" { owners = ["099720109477"] } launch_block_device_mappings { - device_name = "/dev/sda1" - volume_size = 8 - volume_type = "gp2" + device_name = "/dev/sda1" + volume_size = 8 + volume_type = "gp2" delete_on_termination = true } } diff --git a/sky/clouds/service_catalog/images/skypilot-aws-gpu-ubuntu.pkr.hcl b/sky/clouds/service_catalog/images/skypilot-aws-gpu-ubuntu.pkr.hcl index c4a8efac4dc..4579987768a 100644 --- a/sky/clouds/service_catalog/images/skypilot-aws-gpu-ubuntu.pkr.hcl +++ b/sky/clouds/service_catalog/images/skypilot-aws-gpu-ubuntu.pkr.hcl @@ -22,9 +22,9 @@ source "amazon-ebs" "gpu-ubuntu" { owners = ["099720109477"] } launch_block_device_mappings { - device_name = "/dev/sda1" - volume_size = 30 - volume_type = "gp2" + device_name = "/dev/sda1" + volume_size = 30 + volume_type = "gp2" delete_on_termination = true } } diff --git a/sky/clouds/service_catalog/images/skypilot-azure-cpu-ubuntu.pkr.hcl b/sky/clouds/service_catalog/images/skypilot-azure-cpu-ubuntu.pkr.hcl new file mode 100644 index 00000000000..2a07c41d136 --- /dev/null +++ b/sky/clouds/service_catalog/images/skypilot-azure-cpu-ubuntu.pkr.hcl @@ -0,0 +1,72 @@ +variable "client_secret" { + type = string + description = "The client secret for the packer client registered in Azure (see Azure app registration)" +} + +variable "vm_generation" { + type = number + description = "Azure's VM generation, currently support 1 or 2" +} + +locals { + timestamp = regex_replace(timestamp(), "[- TZ:]", "") + version = formatdate("YY.MM.DD", timestamp()) +} + +source "azure-arm" "cpu-ubuntu" { + managed_image_resource_group_name = "skypilot-images" + managed_image_name = "skypilot-azure-cpu-ubuntu-${local.timestamp}" + + subscription_id = "59d8c23c-7ef5-42c7-b2f3-a919ad8026a7" + tenant_id = "7c81f068-46f8-4b26-9a46-2fbec2287e3d" + client_id = "1d249f23-c22e-4d02-b62b-a6827bd113fe" + client_secret = var.client_secret + + os_type = "Linux" + image_publisher = "Canonical" + image_offer = "0001-com-ubuntu-server-jammy" + image_sku = var.vm_generation == 1 ? "22_04-lts" : "22_04-lts-gen2" + location = "centralus" + vm_size = var.vm_generation == 1 ? "Standard_D1_v2" : "Standard_B2s" + ssh_username = "azureuser" + azure_tags = { + Created_by = "packer" + Purpose = "skypilot" + } + + shared_image_gallery_destination { + subscription = "59d8c23c-7ef5-42c7-b2f3-a919ad8026a7" + resource_group = "skypilot-images" + gallery_name = "skypilot_image_gallery" + image_name = "skypilot-cpu-gen${var.vm_generation}" + image_version = "${local.version}" + replication_regions = [ + "centralus", + "eastus", + "eastus2", + "northcentralus", + "southcentralus", + "westcentralus", + "westus", + "westus2", + "westus3" + ] + } +} + +build { + name = "azure-cpu-ubuntu-build" + sources = ["sources.azure-arm.cpu-ubuntu"] + provisioner "shell" { + script = "./provisioners/docker.sh" + } + provisioner "shell" { + script = "./provisioners/skypilot.sh" + } + provisioner "shell" { + environment_vars = [ + "CLOUD=azure", + ] + script = "./provisioners/cloud.sh" + } +} diff --git a/sky/clouds/service_catalog/images/skypilot-azure-gpu-ubuntu.pkr.hcl b/sky/clouds/service_catalog/images/skypilot-azure-gpu-ubuntu.pkr.hcl new file mode 100644 index 00000000000..97c99b2431e --- /dev/null +++ b/sky/clouds/service_catalog/images/skypilot-azure-gpu-ubuntu.pkr.hcl @@ -0,0 +1,78 @@ +variable "client_secret" { + type = string + description = "The client secret for the packer client registered in Azure (see Azure app registration)" +} + +variable "vm_generation" { + type = number + description = "Azure's VM generation, currently support 1 or 2" +} + +locals { + timestamp = regex_replace(timestamp(), "[- TZ:]", "") + version = formatdate("YY.MM.DD", timestamp()) +} + +source "azure-arm" "gpu-ubuntu" { + managed_image_resource_group_name = "skypilot-images" + managed_image_name = "skypilot-azure-gpu-ubuntu-${local.timestamp}" + + subscription_id = "59d8c23c-7ef5-42c7-b2f3-a919ad8026a7" + tenant_id = "7c81f068-46f8-4b26-9a46-2fbec2287e3d" + client_id = "1d249f23-c22e-4d02-b62b-a6827bd113fe" + client_secret = var.client_secret + + os_type = "Linux" + image_publisher = "Canonical" + image_offer = "0001-com-ubuntu-server-jammy" + image_sku = var.vm_generation == 1 ? "22_04-lts" : "22_04-lts-gen2" + location = var.vm_generation == 1 ? "eastus" : "centralus" + vm_size = var.vm_generation == 1 ? "Standard_NC4as_T4_v3" : "Standard_NC24ads_A100_v4" + ssh_username = "azureuser" + azure_tags = { + Created_by = "packer" + Purpose = "skypilot" + } + + shared_image_gallery_destination { + subscription = "59d8c23c-7ef5-42c7-b2f3-a919ad8026a7" + resource_group = "skypilot-images" + gallery_name = var.vm_generation == 1 ? "skypilot_images": "skypilot_image_gallery" + image_name = "skypilot-gpu-gen${var.vm_generation}" + image_version = "${local.version}" + replication_regions = [ + "centralus", + "eastus", + "eastus2", + "northcentralus", + "southcentralus", + "westcentralus", + "westus", + "westus2", + "westus3" + ] + } +} + +build { + name = "azure-gpu-ubuntu-build" + sources = ["sources.azure-arm.gpu-ubuntu"] + provisioner "shell" { + script = "./provisioners/docker.sh" + } + provisioner "shell" { + script = "./provisioners/cuda.sh" + } + provisioner "shell" { + script = "./provisioners/nvidia-container-toolkit.sh" + } + provisioner "shell" { + script = "./provisioners/skypilot.sh" + } + provisioner "shell" { + environment_vars = [ + "CLOUD=azure", + ] + script = "./provisioners/cloud.sh" + } +} From 057bc4b44755ac1e9dadc680e022c369e8ddff52 Mon Sep 17 00:00:00 2001 From: landscapepainter <34902420+landscapepainter@users.noreply.github.com> Date: Thu, 24 Oct 2024 21:58:31 -0700 Subject: [PATCH 85/93] [Azure] Fix to sync NSG status while opening ports (#3844) * fix to update NSG status while opening ports * nit * format * refactor check for nsg creation * format * nit * format * Update sky/provision/azure/config.py Co-authored-by: Zhanghao Wu * Update sky/provision/azure/instance.py Co-authored-by: Zhanghao Wu * Update sky/provision/azure/config.py Co-authored-by: Zhanghao Wu * Update sky/provision/azure/config.py Co-authored-by: Zhanghao Wu * format * additional TODO comments --------- Co-authored-by: Zhanghao Wu --- .../azure/azure-config-template.json | 8 +- sky/provision/azure/config.py | 31 +++- sky/provision/azure/instance.py | 141 +++++++++++------- 3 files changed, 121 insertions(+), 59 deletions(-) diff --git a/sky/provision/azure/azure-config-template.json b/sky/provision/azure/azure-config-template.json index 489783faf98..c743dd40215 100644 --- a/sky/provision/azure/azure-config-template.json +++ b/sky/provision/azure/azure-config-template.json @@ -13,6 +13,12 @@ "metadata": { "description": "Subnet parameters." } + }, + "nsgName": { + "type": "string", + "metadata": { + "description": "Name of the Network Security Group associated with the SkyPilot cluster." + } } }, "variables": { @@ -20,7 +26,7 @@ "location": "[resourceGroup().location]", "msiName": "[concat('sky-', parameters('clusterId'), '-msi')]", "roleAssignmentName": "[concat('sky-', parameters('clusterId'), '-ra')]", - "nsgName": "[concat('sky-', parameters('clusterId'), '-nsg')]", + "nsgName": "[parameters('nsgName')]", "nsg": "[resourceId('Microsoft.Network/networkSecurityGroups', variables('nsgName'))]", "vnetName": "[concat('sky-', parameters('clusterId'), '-vnet')]", "subnetName": "[concat('sky-', parameters('clusterId'), '-subnet')]" diff --git a/sky/provision/azure/config.py b/sky/provision/azure/config.py index 22982a99075..afa94b4adbe 100644 --- a/sky/provision/azure/config.py +++ b/sky/provision/azure/config.py @@ -8,7 +8,7 @@ from pathlib import Path import random import time -from typing import Any, Callable +from typing import Any, Callable, Tuple from sky import exceptions from sky import sky_logging @@ -22,6 +22,7 @@ _DEPLOYMENT_NAME = 'skypilot-config' _LEGACY_DEPLOYMENT_NAME = 'ray-config' _RESOURCE_GROUP_WAIT_FOR_DELETION_TIMEOUT = 480 # 8 minutes +_CLUSTER_ID = '{cluster_name_on_cloud}-{unique_id}' def get_azure_sdk_function(client: Any, function_name: str) -> Callable: @@ -41,6 +42,19 @@ def get_azure_sdk_function(client: Any, function_name: str) -> Callable: return func +def get_cluster_id_and_nsg_name(resource_group: str, + cluster_name_on_cloud: str) -> Tuple[str, str]: + hasher = hashlib.md5(resource_group.encode('utf-8')) + unique_id = hasher.hexdigest()[:UNIQUE_ID_LEN] + # We use the cluster name + resource group hash as the + # unique ID for the cluster, as we need to make sure that + # the deployments have unique names during failover. + cluster_id = _CLUSTER_ID.format(cluster_name_on_cloud=cluster_name_on_cloud, + unique_id=unique_id) + nsg_name = f'sky-{cluster_id}-nsg' + return cluster_id, nsg_name + + @common.log_function_start_end def bootstrap_instances( region: str, cluster_name_on_cloud: str, @@ -117,12 +131,13 @@ def bootstrap_instances( logger.info(f'Using cluster name: {cluster_name_on_cloud}') - hasher = hashlib.md5(provider_config['resource_group'].encode('utf-8')) - unique_id = hasher.hexdigest()[:UNIQUE_ID_LEN] + cluster_id, nsg_name = get_cluster_id_and_nsg_name( + resource_group=provider_config['resource_group'], + cluster_name_on_cloud=cluster_name_on_cloud) subnet_mask = provider_config.get('subnet_mask') if subnet_mask is None: # choose a random subnet, skipping most common value of 0 - random.seed(unique_id) + random.seed(cluster_id) subnet_mask = f'10.{random.randint(1, 254)}.0.0/16' logger.info(f'Using subnet mask: {subnet_mask}') @@ -135,10 +150,10 @@ def bootstrap_instances( 'value': subnet_mask }, 'clusterId': { - # We use the cluster name + resource group hash as the - # unique ID for the cluster, as we need to make sure that - # the deployments have unique names during failover. - 'value': f'{cluster_name_on_cloud}-{unique_id}' + 'value': cluster_id + }, + 'nsgName': { + 'value': nsg_name }, }, } diff --git a/sky/provision/azure/instance.py b/sky/provision/azure/instance.py index f6c865e29c8..cc2dc692dec 100644 --- a/sky/provision/azure/instance.py +++ b/sky/provision/azure/instance.py @@ -15,6 +15,7 @@ from sky.adaptors import azure from sky.provision import common from sky.provision import constants +from sky.provision.azure import config as config_lib from sky.utils import common_utils from sky.utils import subprocess_utils from sky.utils import ux_utils @@ -31,6 +32,8 @@ # https://github.com/Azure/azure-sdk-for-python/issues/9422 azure_logger = logging.getLogger('azure') azure_logger.setLevel(logging.WARNING) +Client = Any +NetworkSecurityGroup = Any _RESUME_INSTANCE_TIMEOUT = 480 # 8 minutes _RESUME_PER_INSTANCE_TIMEOUT = 120 # 2 minutes @@ -40,6 +43,10 @@ _RESOURCE_GROUP_NOT_FOUND_ERROR_MESSAGE = 'ResourceGroupNotFound' _POLL_INTERVAL = 1 +# TODO(Doyoung): _LEGACY_NSG_NAME can be remove this after 0.8.0 to ignore +# legacy nsg names. +_LEGACY_NSG_NAME = 'ray-{cluster_name_on_cloud}-nsg' +_SECOND_LEGACY_NSG_NAME = 'sky-{cluster_name_on_cloud}-nsg' class AzureInstanceStatus(enum.Enum): @@ -795,6 +802,32 @@ def _fetch_and_map_status(node, resource_group: str) -> None: return statuses +# TODO(Doyoung): _get_cluster_nsg can be remove this after 0.8.0 to ignore +# legacy nsg names. +def _get_cluster_nsg(network_client: Client, resource_group: str, + cluster_name_on_cloud: str) -> NetworkSecurityGroup: + """Retrieve the NSG associated with the given name of the cluster.""" + list_network_security_groups = _get_azure_sdk_function( + client=network_client.network_security_groups, function_name='list') + legacy_nsg_name = _LEGACY_NSG_NAME.format( + cluster_name_on_cloud=cluster_name_on_cloud) + second_legacy_nsg_name = _SECOND_LEGACY_NSG_NAME.format( + cluster_name_on_cloud=cluster_name_on_cloud) + _, nsg_name = config_lib.get_cluster_id_and_nsg_name( + resource_group=resource_group, + cluster_name_on_cloud=cluster_name_on_cloud) + possible_nsg_names = [nsg_name, legacy_nsg_name, second_legacy_nsg_name] + for nsg in list_network_security_groups(resource_group): + if nsg.name in possible_nsg_names: + return nsg + + # Raise an error if no matching NSG is found + raise ValueError('Failed to find a matching NSG for cluster ' + f'{cluster_name_on_cloud!r} in resource group ' + f'{resource_group!r}. Expected NSG names were: ' + f'{possible_nsg_names}.') + + def open_ports( cluster_name_on_cloud: str, ports: List[str], @@ -809,58 +842,66 @@ def open_ports( update_network_security_groups = _get_azure_sdk_function( client=network_client.network_security_groups, function_name='create_or_update') - list_network_security_groups = _get_azure_sdk_function( - client=network_client.network_security_groups, function_name='list') - for nsg in list_network_security_groups(resource_group): - try: - # Wait the NSG creation to be finished before opening a port. The - # cluster provisioning triggers the NSG creation, but it may not be - # finished yet. - backoff = common_utils.Backoff(max_backoff_factor=1) - start_time = time.time() - while True: - if nsg.provisioning_state not in ['Creating', 'Updating']: - break - if time.time() - start_time > _WAIT_CREATION_TIMEOUT_SECONDS: - logger.warning( - f'Fails to wait for the creation of NSG {nsg.name} in ' - f'{resource_group} within ' - f'{_WAIT_CREATION_TIMEOUT_SECONDS} seconds. ' - 'Skip this NSG.') - backoff_time = backoff.current_backoff() - logger.info(f'NSG {nsg.name} is not created yet. Waiting for ' - f'{backoff_time} seconds before checking again.') - time.sleep(backoff_time) - - # Azure NSG rules have a priority field that determines the order - # in which they are applied. The priority must be unique across - # all inbound rules in one NSG. - priority = max(rule.priority - for rule in nsg.security_rules - if rule.direction == 'Inbound') + 1 - nsg.security_rules.append( - azure.create_security_rule( - name=f'sky-ports-{cluster_name_on_cloud}-{priority}', - priority=priority, - protocol='Tcp', - access='Allow', - direction='Inbound', - source_address_prefix='*', - source_port_range='*', - destination_address_prefix='*', - destination_port_ranges=ports, - )) - poller = update_network_security_groups(resource_group, nsg.name, - nsg) - poller.wait() - if poller.status() != 'Succeeded': + + try: + # Wait for the NSG creation to be finished before opening a port. The + # cluster provisioning triggers the NSG creation, but it may not be + # finished yet. + backoff = common_utils.Backoff(max_backoff_factor=1) + start_time = time.time() + while True: + nsg = _get_cluster_nsg(network_client, resource_group, + cluster_name_on_cloud) + if nsg.provisioning_state not in ['Creating', 'Updating']: + break + if time.time() - start_time > _WAIT_CREATION_TIMEOUT_SECONDS: with ux_utils.print_exception_no_traceback(): - raise ValueError(f'Failed to open ports {ports} in NSG ' - f'{nsg.name}: {poller.status()}') - except azure.exceptions().HttpResponseError as e: + raise TimeoutError( + f'Timed out while waiting for the Network ' + f'Security Group {nsg.name!r} to be ready for ' + f'cluster {cluster_name_on_cloud!r} in ' + f'resource group {resource_group!r}. The NSG ' + f'did not reach a stable state ' + '(Creating/Updating) within the allocated ' + f'{_WAIT_CREATION_TIMEOUT_SECONDS} seconds. ' + 'Consequently, the operation to open ports ' + f'{ports} failed.') + + backoff_time = backoff.current_backoff() + logger.info(f'NSG {nsg.name} is not created yet. Waiting for ' + f'{backoff_time} seconds before checking again.') + time.sleep(backoff_time) + + # Azure NSG rules have a priority field that determines the order + # in which they are applied. The priority must be unique across + # all inbound rules in one NSG. + priority = max(rule.priority + for rule in nsg.security_rules + if rule.direction == 'Inbound') + 1 + nsg.security_rules.append( + azure.create_security_rule( + name=f'sky-ports-{cluster_name_on_cloud}-{priority}', + priority=priority, + protocol='Tcp', + access='Allow', + direction='Inbound', + source_address_prefix='*', + source_port_range='*', + destination_address_prefix='*', + destination_port_ranges=ports, + )) + poller = update_network_security_groups(resource_group, nsg.name, nsg) + poller.wait() + if poller.status() != 'Succeeded': with ux_utils.print_exception_no_traceback(): - raise ValueError( - f'Failed to open ports {ports} in NSG {nsg.name}.') from e + raise ValueError(f'Failed to open ports {ports} in NSG ' + f'{nsg.name}: {poller.status()}') + + except azure.exceptions().HttpResponseError as e: + with ux_utils.print_exception_no_traceback(): + raise ValueError(f'Failed to open ports {ports} in NSG for cluster ' + f'{cluster_name_on_cloud!r} within resource group ' + f'{resource_group!r}.') from e def cleanup_ports( From ee708e7aad508fab1b48b1f6f297ebe6fb5dbe63 Mon Sep 17 00:00:00 2001 From: Yika Date: Fri, 25 Oct 2024 09:51:45 -0700 Subject: [PATCH 86/93] [Performance] Use new Azure custom images (#4167) * use sky images for azure * auto refresh local images.csv * format --- sky/clouds/azure.py | 6 +++--- sky/clouds/service_catalog/azure_catalog.py | 16 +++++++++++++--- 2 files changed, 16 insertions(+), 6 deletions(-) diff --git a/sky/clouds/azure.py b/sky/clouds/azure.py index d91f589ca8f..0852c993ed3 100644 --- a/sky/clouds/azure.py +++ b/sky/clouds/azure.py @@ -39,9 +39,9 @@ _DEFAULT_AZURE_UBUNTU_2004_IMAGE_GB = 150 _DEFAULT_SKYPILOT_IMAGE_GB = 30 -_DEFAULT_CPU_IMAGE_ID = 'skypilot:gpu-ubuntu-2204' -_DEFAULT_GPU_IMAGE_ID = 'skypilot:gpu-ubuntu-2204' -_DEFAULT_V1_IMAGE_ID = 'skypilot:v1-ubuntu-2004' +_DEFAULT_CPU_IMAGE_ID = 'skypilot:custom-cpu-ubuntu-v2' +_DEFAULT_GPU_IMAGE_ID = 'skypilot:custom-gpu-ubuntu-v2' +_DEFAULT_V1_IMAGE_ID = 'skypilot:custom-gpu-ubuntu-v1' _DEFAULT_GPU_K80_IMAGE_ID = 'skypilot:k80-ubuntu-2004' _FALLBACK_IMAGE_ID = 'skypilot:gpu-ubuntu-2204' diff --git a/sky/clouds/service_catalog/azure_catalog.py b/sky/clouds/service_catalog/azure_catalog.py index c71285fe9a3..867141f7899 100644 --- a/sky/clouds/service_catalog/azure_catalog.py +++ b/sky/clouds/service_catalog/azure_catalog.py @@ -7,11 +7,14 @@ from typing import Dict, List, Optional, Tuple from sky import clouds as cloud_lib +from sky import sky_logging from sky.clouds import Azure from sky.clouds.service_catalog import common from sky.utils import resources_utils from sky.utils import ux_utils +logger = sky_logging.init_logger(__name__) + # This list should match the list of regions in # skypilot image generation Packer script's replication_regions # sky/clouds/service_catalog/images/skypilot-azure-cpu-ubuntu.pkr.hcl @@ -191,9 +194,16 @@ def list_accelerators( def get_image_id_from_tag(tag: str, region: Optional[str]) -> Optional[str]: """Returns the image id from the tag.""" - # Azure images are not region-specific. - del region # Unused. - return common.get_image_id_from_tag_impl(_image_df, tag, None) + global _image_df + image_id = common.get_image_id_from_tag_impl(_image_df, tag, region) + if image_id is None: + # Refresh the image catalog and try again, if the image tag is not + # found. + logger.debug('Refreshing the image catalog and trying again.') + _image_df = common.read_catalog('azure/images.csv', + pull_frequency_hours=0) + image_id = common.get_image_id_from_tag_impl(_image_df, tag, region) + return image_id def is_image_tag_valid(tag: str, region: Optional[str]) -> bool: From b8a9a5716e746e88e65af397d8a2d84522445cb3 Mon Sep 17 00:00:00 2001 From: Yika Date: Fri, 25 Oct 2024 14:08:06 -0700 Subject: [PATCH 87/93] Fix OCI import issue (#4178) * Fix OCI import issue * Update sky/clouds/oci.py Co-authored-by: Zhanghao Wu * edit comments --------- Co-authored-by: Zhanghao Wu --- sky/clouds/oci.py | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/sky/clouds/oci.py b/sky/clouds/oci.py index c6451a73a1f..0feda467bbf 100644 --- a/sky/clouds/oci.py +++ b/sky/clouds/oci.py @@ -468,7 +468,11 @@ def get_credential_file_mounts(self) -> Dict[str, str]: api_key_file = oci_cfg[ 'key_file'] if 'key_file' in oci_cfg else 'BadConf' sky_cfg_file = oci_utils.oci_config.get_sky_user_config_file() - except (ImportError, oci_adaptor.oci.exceptions.ConfigFileNotFound): + # Must catch ImportError before any oci_adaptor.oci.exceptions + # because oci_adaptor.oci.exceptions can throw ImportError. + except ImportError: + return {} + except oci_adaptor.oci.exceptions.ConfigFileNotFound: return {} # OCI config and API key file are mandatory From bbf14d5a52b21387cb2bed086a96f7363bc5750b Mon Sep 17 00:00:00 2001 From: Romil Bhardwaj Date: Fri, 25 Oct 2024 15:08:27 -0700 Subject: [PATCH 88/93] [k8s] Add retry for apparmor failures (#4176) * Add retry for apparmor failures * add comment --- sky/provision/kubernetes/instance.py | 68 +++++++++++++++++++++++++++- 1 file changed, 66 insertions(+), 2 deletions(-) diff --git a/sky/provision/kubernetes/instance.py b/sky/provision/kubernetes/instance.py index 6ce7b74d18e..26ed5f51a43 100644 --- a/sky/provision/kubernetes/instance.py +++ b/sky/provision/kubernetes/instance.py @@ -1,5 +1,6 @@ """Kubernetes instance provisioning.""" import copy +import json import time from typing import Any, Dict, List, Optional import uuid @@ -425,6 +426,70 @@ def _label_pod(namespace: str, context: Optional[str], pod_name: str, _request_timeout=kubernetes.API_TIMEOUT) +def _create_namespaced_pod_with_retries(namespace: str, pod_spec: dict, + context: Optional[str]) -> Any: + """Attempts to create a Kubernetes Pod and handle any errors. + + Currently, we handle errors due to the AppArmor annotation and retry if + it fails due to the `FieldValueForbidden` error. + See https://github.com/skypilot-org/skypilot/issues/4174 for details. + + Returns: The created Pod object. + """ + try: + # Attempt to create the Pod with the AppArmor annotation + pod = kubernetes.core_api(context).create_namespaced_pod( + namespace, pod_spec) + return pod + except kubernetes.api_exception() as e: + try: + error_body = json.loads(e.body) + error_message = error_body.get('message', '') + except json.JSONDecodeError: + error_message = str(e.body) + # Check if the error is due to the AppArmor annotation and retry. + # We add an AppArmor annotation to set it as unconfined in our + # base template in kubernetes-ray.yml.j2. This is required for + # FUSE to work in the pod on most Kubernetes distributions. + # However, some distributions do not support the AppArmor annotation + # and will fail to create the pod. In this case, we retry without + # the annotation. + if (e.status == 422 and 'FieldValueForbidden' in error_message and + 'AppArmorProfile: nil' in error_message): + logger.warning('AppArmor annotation caused pod creation to fail. ' + 'Retrying without the annotation. ' + 'Note: this may cause bucket mounting to fail.') + + # Remove the AppArmor annotation + annotations = pod_spec.get('metadata', {}).get('annotations', {}) + if ('container.apparmor.security.beta.kubernetes.io/ray-node' + in annotations): + del annotations[ + 'container.apparmor.security.beta.kubernetes.io/ray-node'] + pod_spec['metadata']['annotations'] = annotations + logger.info('AppArmor annotation removed from Pod spec.') + else: + logger.warning('AppArmor annotation not found in pod spec, ' + 'retrying will not help. ' + f'Current annotations: {annotations}') + raise e + + # Retry Pod creation without the AppArmor annotation + try: + pod = kubernetes.core_api(context).create_namespaced_pod( + namespace, pod_spec) + logger.info(f'Pod {pod.metadata.name} created successfully ' + 'without AppArmor annotation.') + return pod + except kubernetes.api_exception() as retry_exception: + logger.info('Failed to create Pod without AppArmor annotation: ' + f'{retry_exception}') + raise retry_exception + else: + # Re-raise the exception if it's a different error + raise e + + def _create_pods(region: str, cluster_name_on_cloud: str, config: common.ProvisionConfig) -> common.ProvisionRecord: """Create pods based on the config.""" @@ -546,8 +611,7 @@ def _create_pods(region: str, cluster_name_on_cloud: str, } } - pod = kubernetes.core_api(context).create_namespaced_pod( - namespace, pod_spec) + pod = _create_namespaced_pod_with_retries(namespace, pod_spec, context) created_pods[pod.metadata.name] = pod if head_pod_name is None: head_pod_name = pod.metadata.name From c8ceaf6d7af98def987c98d41fcb502341cf1ed4 Mon Sep 17 00:00:00 2001 From: Zongheng Yang Date: Fri, 25 Oct 2024 16:51:33 -0700 Subject: [PATCH 89/93] [Docs] Update Managed Jobs page. (#4177) * [Docs] Update Managed Jobs page. * Lint * Updates --- docs/source/examples/managed-jobs.rst | 89 +++++++++++++++------------ 1 file changed, 51 insertions(+), 38 deletions(-) diff --git a/docs/source/examples/managed-jobs.rst b/docs/source/examples/managed-jobs.rst index a47b4345b9f..d85356c936a 100644 --- a/docs/source/examples/managed-jobs.rst +++ b/docs/source/examples/managed-jobs.rst @@ -5,14 +5,20 @@ Managed Jobs .. tip:: - This feature is great for scaling out: running a single job for long durations, or running many jobs (pipelines). + This feature is great for scaling out: running a single job for long durations, or running many jobs in parallel. -SkyPilot supports **managed jobs** (:code:`sky jobs`), which can automatically recover from any spot preemptions or hardware failures. -It can be used in three modes: +SkyPilot supports **managed jobs** (:code:`sky jobs`), where "managed" means +if a job's underlying compute experienced any spot preemptions or hardware failures, +SkyPilot will automatically recover the job. -#. :ref:`Managed Spot Jobs `: Jobs run on auto-recovering spot instances. This can **save significant costs** (e.g., up to 70\% for GPU VMs) by making preemptible spot instances useful for long-running jobs. -#. :ref:`On-demand `: Jobs run on auto-recovering on-demand instances. This is useful for jobs that require guaranteed resources. -#. :ref:`Pipelines `: Run pipelines that contain multiple tasks (which can have different resource requirements and ``setup``/``run`` commands). This is useful for running a sequence of tasks that depend on each other, e.g., data processing, training a model, and then running inference on it. +Managed jobs can be used in three modes: + +#. :ref:`Managed spot jobs `: Jobs run on auto-recovering spot instances. This **saves significant costs** (e.g., ~70\% for GPU VMs) by making preemptible spot instances useful for long-running jobs. +#. :ref:`Managed on-demand/reserved jobs `: Jobs run on auto-recovering on-demand or reserved instances. Useful for jobs that require guaranteed resources. +#. :ref:`Managed pipelines `: Run pipelines that contain multiple tasks (which + can have different resource requirements and ``setup``/``run`` commands). + Useful for running a sequence of tasks that depend on each other, e.g., data + processing, training a model, and then running inference on it. .. _spot-jobs: @@ -20,28 +26,12 @@ It can be used in three modes: Managed Spot Jobs ----------------- -In this mode, :code:`sky jobs launch --use-spot` is used to launch a managed spot job. SkyPilot automatically finds available spot resources across regions and clouds to maximize availability. -Any spot preemptions are automatically handled by SkyPilot without user intervention. - +In this mode, jobs run on spot instances, and preemptions are auto-recovered by SkyPilot. -Quick comparison between *unmanaged spot clusters* vs. *managed spot jobs*: - -.. list-table:: - :widths: 30 18 12 35 - :header-rows: 1 +To launch a managed spot job, use :code:`sky jobs launch --use-spot`. +SkyPilot automatically finds available spot instances across regions and clouds to maximize availability. +Any spot preemptions are automatically handled by SkyPilot without user intervention. - * - Command - - Managed? - - SSH-able? - - Best for - * - :code:`sky launch --use-spot` - - Unmanaged spot cluster - - Yes - - Interactive dev on spot instances (especially for hardware with low preemption rates) - * - :code:`sky jobs launch --use-spot` - - Managed spot job (auto-recovery) - - No - - Scaling out long-running jobs (e.g., data processing, training, batch inference) Here is an example of a BERT training job failing over different regions across AWS and GCP. @@ -59,6 +49,25 @@ To use managed spot jobs, there are two requirements: #. :ref:`Checkpointing ` (optional): For job recovery due to preemptions, the user application code can checkpoint its progress periodically to a :ref:`mounted cloud bucket `. The program can reload the latest checkpoint when restarted. +Quick comparison between *managed spot jobs* vs. *launching spot clusters*: + +.. list-table:: + :widths: 30 18 12 35 + :header-rows: 1 + + * - Command + - Managed? + - SSH-able? + - Best for + * - :code:`sky jobs launch --use-spot` + - Yes, preemptions are auto-recovered + - No + - Scaling out long-running jobs (e.g., data processing, training, batch inference) + * - :code:`sky launch --use-spot` + - No, preemptions are not handled + - Yes + - Interactive dev on spot instances (especially for hardware with low preemption rates) + .. _job-yaml: Job YAML @@ -245,11 +254,11 @@ Real-World Examples .. _on-demand: -Using On-Demand Instances --------------------------------- +Managed On-Demand/Reserved Jobs +------------------------------- The same ``sky jobs launch`` and YAML interfaces can run jobs on auto-recovering -on-demand instances. This is useful to have SkyPilot monitor any underlying +on-demand or reserved instances. This is useful to have SkyPilot monitor any underlying machine failures and transparently recover the job. To do so, simply set :code:`use_spot: false` in the :code:`resources` section, or override it with :code:`--use-spot false` in the CLI. @@ -264,10 +273,10 @@ To do so, simply set :code:`use_spot: false` in the :code:`resources` section, o interface, while ``sky launch`` is a cluster interface (that you can launch tasks on, albeit not managed). -Either Spot Or On-Demand -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Either Spot or On-Demand/Reserved +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -You can use ``any_of`` to specify either spot or on-demand instances as +You can use ``any_of`` to specify either spot or on-demand/reserved instances as candidate resources for a job. See documentation :ref:`here ` for more details. @@ -280,12 +289,17 @@ candidate resources for a job. See documentation :ref:`here - use_spot: false In this example, SkyPilot will perform cost optimizations to select the resource to use, which almost certainly -will be spot instances. If spot instances are not available, SkyPilot will fall back to launch on-demand instances. +will be spot instances. If spot instances are not available, SkyPilot will fall back to launch on-demand/reserved instances. More advanced policies for resource selection, such as the `Can't Be Late `__ (NSDI'24) paper, may be supported in the future. +Running Many Parallel Jobs +-------------------------- + +For batch jobs such as **data processing** or **hyperparameter sweeps**, you can launch many jobs in parallel. See :ref:`many-jobs`. + Useful CLIs ----------- @@ -323,11 +337,10 @@ Cancel a managed job: If any failure happens for a managed job, you can check :code:`sky jobs queue -a` for the brief reason of the failure. For more details, it would be helpful to check :code:`sky jobs logs --controller `. - .. _pipeline: -Job Pipelines -------------- +Managed Pipelines +----------------- A pipeline is a managed job that contains a sequence of tasks running one after another. @@ -414,8 +427,8 @@ To submit the pipeline, the same command :code:`sky jobs launch` is used. The pi -Dashboard ---------- +Job Dashboard +------------- Use ``sky jobs dashboard`` to open a dashboard to see all jobs: From df80daedea2eefd52c2e681b8da370305fcda262 Mon Sep 17 00:00:00 2001 From: Zongheng Yang Date: Fri, 25 Oct 2024 17:32:25 -0700 Subject: [PATCH 90/93] Minor: Jobs docs fix. (#4183) * [Docs] Update Managed Jobs page. * Lint * Updates * reword --- docs/source/examples/managed-jobs.rst | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/docs/source/examples/managed-jobs.rst b/docs/source/examples/managed-jobs.rst index d85356c936a..993ad361d66 100644 --- a/docs/source/examples/managed-jobs.rst +++ b/docs/source/examples/managed-jobs.rst @@ -7,10 +7,7 @@ Managed Jobs This feature is great for scaling out: running a single job for long durations, or running many jobs in parallel. -SkyPilot supports **managed jobs** (:code:`sky jobs`), where "managed" means -if a job's underlying compute experienced any spot preemptions or hardware failures, -SkyPilot will automatically recover the job. - +SkyPilot supports **managed jobs** (:code:`sky jobs`), which can automatically recover from any underlying spot preemptions or hardware failures. Managed jobs can be used in three modes: #. :ref:`Managed spot jobs `: Jobs run on auto-recovering spot instances. This **saves significant costs** (e.g., ~70\% for GPU VMs) by making preemptible spot instances useful for long-running jobs. From 0e915d3430d8027aa40b766605bb13c889ffc62f Mon Sep 17 00:00:00 2001 From: Christopher Cooper Date: Fri, 25 Oct 2024 18:54:11 -0700 Subject: [PATCH 91/93] [UX] remove all uses of deprecated `sky jobs` (#4173) * [UX] remove all uses of deprecated `sky jobs` * Apply suggestions from code review Co-authored-by: Romil Bhardwaj * fix other mentions of "spot jobs" --------- Co-authored-by: Romil Bhardwaj --- docs/source/examples/managed-jobs.rst | 2 +- docs/source/reference/faq.rst | 2 +- examples/managed_job_with_storage.yaml | 2 +- llm/axolotl/axolotl-spot.yaml | 2 +- llm/axolotl/readme.md | 2 +- llm/falcon/README.md | 12 ++++++------ llm/vicuna-llama-2/README.md | 6 +++--- llm/vicuna/README.md | 4 ++-- sky/cli.py | 2 +- sky/jobs/controller.py | 2 +- tests/backward_compatibility_tests.sh | 4 ++-- 11 files changed, 20 insertions(+), 20 deletions(-) diff --git a/docs/source/examples/managed-jobs.rst b/docs/source/examples/managed-jobs.rst index 993ad361d66..8e329adaa81 100644 --- a/docs/source/examples/managed-jobs.rst +++ b/docs/source/examples/managed-jobs.rst @@ -99,7 +99,7 @@ We can launch it with the following: setup: | # Fill in your wandb key: copy from https://wandb.ai/authorize # Alternatively, you can use `--env WANDB_API_KEY=$WANDB_API_KEY` - # to pass the key in the command line, during `sky spot launch`. + # to pass the key in the command line, during `sky jobs launch`. echo export WANDB_API_KEY=[YOUR-WANDB-API-KEY] >> ~/.bashrc pip install -e . diff --git a/docs/source/reference/faq.rst b/docs/source/reference/faq.rst index 5a966a0014f..1ade656b44b 100644 --- a/docs/source/reference/faq.rst +++ b/docs/source/reference/faq.rst @@ -38,7 +38,7 @@ How to ensure my workdir's ``.git`` is synced up for managed spot jobs? Currently, there is a difference in whether ``.git`` is synced up depending on the command used: - For regular ``sky launch``, the workdir's ``.git`` is synced up by default. -- For managed spot jobs ``sky spot launch``, the workdir's ``.git`` is excluded by default. +- For managed jobs ``sky jobs launch``, the workdir's ``.git`` is excluded by default. In the second case, to ensure the workdir's ``.git`` is synced up for managed spot jobs, you can explicitly add a file mount to sync it up: diff --git a/examples/managed_job_with_storage.yaml b/examples/managed_job_with_storage.yaml index 61244c16ba0..677e2c8ed6d 100644 --- a/examples/managed_job_with_storage.yaml +++ b/examples/managed_job_with_storage.yaml @@ -3,7 +3,7 @@ # Runs a task that uses cloud buckets for uploading and accessing files. # # Usage: -# sky spot launch -c spot-storage examples/managed_job_with_storage.yaml +# sky jobs launch -c spot-storage examples/managed_job_with_storage.yaml # sky down spot-storage resources: diff --git a/llm/axolotl/axolotl-spot.yaml b/llm/axolotl/axolotl-spot.yaml index b22a8ae3fce..0e04ba11992 100644 --- a/llm/axolotl/axolotl-spot.yaml +++ b/llm/axolotl/axolotl-spot.yaml @@ -4,7 +4,7 @@ # HF_TOKEN=abc BUCKET= sky launch -c axolotl-spot axolotl-spot.yaml --env HF_TOKEN --env BUCKET -i30 --down # # Managed spot (auto-recovery; for full runs): -# HF_TOKEN=abc BUCKET= sky spot launch -n axolotl-spot axolotl-spot.yaml --env HF_TOKEN --env BUCKET +# HF_TOKEN=abc BUCKET= sky jobs launch -n axolotl-spot axolotl-spot.yaml --env HF_TOKEN --env BUCKET name: axolotl diff --git a/llm/axolotl/readme.md b/llm/axolotl/readme.md index 0cc06b98723..eb80231aa93 100644 --- a/llm/axolotl/readme.md +++ b/llm/axolotl/readme.md @@ -22,5 +22,5 @@ ssh -L 8888:localhost:8888 axolotl-spot Launch managed spot instances (auto-recovery; for full runs): ``` -HF_TOKEN=abc BUCKET= sky spot launch -n axolotl-spot axolotl-spot.yaml --env HF_TOKEN --env BUCKET +HF_TOKEN=abc BUCKET= sky jobs launch -n axolotl-spot axolotl-spot.yaml --env HF_TOKEN --env BUCKET ``` diff --git a/llm/falcon/README.md b/llm/falcon/README.md index 6eb480d9ea8..1f40dc9f524 100644 --- a/llm/falcon/README.md +++ b/llm/falcon/README.md @@ -1,6 +1,6 @@ # Finetuning Falcon with SkyPilot -This README contains instructions on how to use SkyPilot to finetune Falcon-7B and Falcon-40B, an open-source LLM that rivals many current closed-source models, including ChatGPT. +This README contains instructions on how to use SkyPilot to finetune Falcon-7B and Falcon-40B, an open-source LLM that rivals many current closed-source models, including ChatGPT. * [Blog post](https://huggingface.co/blog/falcon) * [Repo](https://huggingface.co/tiiuae/falcon-40b) @@ -16,10 +16,10 @@ sky check See the Falcon SkyPilot YAML for [training](train.yaml). Serving is currently a work in progress and a YAML will be provided for that soon! We are also working on adding an evaluation step to evaluate the model you finetuned compared to the base model. ## Running Falcon on SkyPilot -Finetuning `Falcon-7B` and `Falcon-40B` require GPUs with 80GB memory, +Finetuning `Falcon-7B` and `Falcon-40B` require GPUs with 80GB memory, but `Falcon-7b-sharded` requires only 40GB memory. Thus, * If your GPU has 40 GB memory or less (e.g., Nvidia A100): use `ybelkada/falcon-7b-sharded-bf16`. -* If your GPU has 80 GB memory (e.g., Nvidia A100-80GB): you can also use `tiiuae/falcon-7b` and `tiiuae/falcon-40b`. +* If your GPU has 80 GB memory (e.g., Nvidia A100-80GB): you can also use `tiiuae/falcon-7b` and `tiiuae/falcon-40b`. Try `sky show-gpus --all` for supported GPUs. @@ -32,13 +32,13 @@ Steps for training on your cloud(s): 1. In [train.yaml](train.yaml), set the following variables in `envs`: - Replace the `OUTPUT_BUCKET_NAME` with a unique name. SkyPilot will create this bucket for you to store the model weights. - - Replace the `WANDB_API_KEY` to your own key. - - Replace the `MODEL_NAME` with your desired base model. + - Replace the `WANDB_API_KEY` to your own key. + - Replace the `MODEL_NAME` with your desired base model. 2. **Training the Falcon model using spot instances**: ```bash -sky spot launch -n falcon falcon.yaml +sky jobs launch --use-spot -n falcon falcon.yaml ``` Currently, such `A100-80GB:1` spot instances are only available on AWS and GCP. diff --git a/llm/vicuna-llama-2/README.md b/llm/vicuna-llama-2/README.md index 24caa525a56..e392b231e64 100644 --- a/llm/vicuna-llama-2/README.md +++ b/llm/vicuna-llama-2/README.md @@ -120,12 +120,12 @@ sky launch --no-use-spot ... ### Reducing costs by 3x with spot instances -[SkyPilot Managed Spot](https://skypilot.readthedocs.io/en/latest/examples/spot-jobs.html) is a library built on top of SkyPilot that helps users run jobs on spot instances without worrying about interruptions. That is the tool used by the LMSYS organization to train the first version of Vicuna (more details can be found in their [launch blog post](https://lmsys.org/blog/2023-03-30-vicuna/) and [example](https://github.com/skypilot-org/skypilot/tree/master/llm/vicuna)). With this, the training cost can be reduced from $1000 to **\$300**. +[SkyPilot Managed Jobs](https://skypilot.readthedocs.io/en/latest/examples/managed-jobs.html) is a library built on top of SkyPilot that helps users run jobs on spot instances without worrying about interruptions. That is the tool used by the LMSYS organization to train the first version of Vicuna (more details can be found in their [launch blog post](https://lmsys.org/blog/2023-03-30-vicuna/) and [example](https://github.com/skypilot-org/skypilot/tree/master/llm/vicuna)). With this, the training cost can be reduced from $1000 to **\$300**. -To use SkyPilot Managed Spot, you can simply replace `sky launch` with `sky spot launch` in the above command: +To use SkyPilot Managed Spot Jobs, you can simply replace `sky launch` with `sky jobs launch` in the above command: ```bash -sky spot launch -n vicuna train.yaml \ +sky jobs launch -n vicuna train.yaml \ --env ARTIFACT_BUCKET_NAME= \ --env WANDB_API_KEY= ``` diff --git a/llm/vicuna/README.md b/llm/vicuna/README.md index b511eb7f4b0..6d9f46127d4 100644 --- a/llm/vicuna/README.md +++ b/llm/vicuna/README.md @@ -63,14 +63,14 @@ Steps for training on your cloud(s): 2. **Training the Vicuna-7B model on 8 A100 GPUs (80GB memory) using spot instances**: ```bash # Launch it on managed spot to save 3x cost -sky spot launch -n vicuna train.yaml +sky jobs launch -n vicuna train.yaml ``` Note: if you would like to see the training curve on W&B, you can add `--env WANDB_API_KEY` to the above command, which will propagate your local W&B API key in the environment variable to the job. [Optional] Train a larger 13B model ``` # Train a 13B model instead of the default 7B -sky spot launch -n vicuna-7b train.yaml --env MODEL_SIZE=13 +sky jobs launch -n vicuna-7b train.yaml --env MODEL_SIZE=13 # Use *unmanaged* spot instances (i.e., preemptions won't get auto-recovered). # Unmanaged spot provides a better interactive development experience but is vulnerable to spot preemptions. diff --git a/sky/cli.py b/sky/cli.py index 6e0587cc117..db1befb04a3 100644 --- a/sky/cli.py +++ b/sky/cli.py @@ -3519,7 +3519,7 @@ def jobs(): default=None, type=str, hidden=True, - help=('Alias for --name, the name of the spot job.')) + help=('Alias for --name, the name of the managed job.')) @click.option('--job-recovery', default=None, type=str, diff --git a/sky/jobs/controller.py b/sky/jobs/controller.py index f3cd81576e2..1faa5dfbe31 100644 --- a/sky/jobs/controller.py +++ b/sky/jobs/controller.py @@ -215,7 +215,7 @@ def _run_one_task(self, task_id: int, task: 'sky.Task') -> bool: end_time=end_time, callback_func=callback_func) logger.info( - f'Spot job {self._job_id} (task: {task_id}) SUCCEEDED. ' + f'Managed job {self._job_id} (task: {task_id}) SUCCEEDED. ' f'Cleaning up the cluster {cluster_name}.') # Only clean up the cluster, not the storages, because tasks may # share storages. diff --git a/tests/backward_compatibility_tests.sh b/tests/backward_compatibility_tests.sh index 4f83c379ccf..276fda899dd 100644 --- a/tests/backward_compatibility_tests.sh +++ b/tests/backward_compatibility_tests.sh @@ -167,8 +167,8 @@ MANAGED_JOB_JOB_NAME=${CLUSTER_NAME}-${uuid:0:4} if [ "$start_from" -le 7 ]; then conda activate sky-back-compat-master rm -r ~/.sky/wheels || true -sky spot launch -d --cloud ${CLOUD} -y --cpus 2 --num-nodes 2 -n ${MANAGED_JOB_JOB_NAME}-7-0 "echo hi; sleep 1000" -sky spot launch -d --cloud ${CLOUD} -y --cpus 2 --num-nodes 2 -n ${MANAGED_JOB_JOB_NAME}-7-1 "echo hi; sleep 400" +sky jobs launch -d --cloud ${CLOUD} -y --cpus 2 --num-nodes 2 -n ${MANAGED_JOB_JOB_NAME}-7-0 "echo hi; sleep 1000" +sky jobs launch -d --cloud ${CLOUD} -y --cpus 2 --num-nodes 2 -n ${MANAGED_JOB_JOB_NAME}-7-1 "echo hi; sleep 400" conda activate sky-back-compat-current rm -r ~/.sky/wheels || true s=$(sky jobs queue | grep ${MANAGED_JOB_JOB_NAME}-7 | grep "RUNNING" | wc -l) From 647fcea335dec9f180421342d6c41cd67a3c8674 Mon Sep 17 00:00:00 2001 From: Tian Xia Date: Sat, 26 Oct 2024 13:34:01 -0700 Subject: [PATCH 92/93] [Azure] Support fractional A10 instance types (#3877) * fix * change catalog to float gpu num * support print float point gpu in sky launch. TODO: test if the ray deployment group works for fractional one * fix unittest * format * patch ray resources to ceil value * support launch from --gpus A10 * only allow strictly match fractional gpu counts * address comment * change back condition * fix * apply suggestions from code review * fix * Update sky/backends/cloud_vm_ray_backend.py Co-authored-by: Zhanghao Wu * format * fix display of fuzzy candidates * fix precision issue * fix num gpu required * refactor in check_resources_fit_cluster * change type annotation of acc_count * enable fuzzy fp acc count * fix k8s * Update sky/clouds/service_catalog/common.py Co-authored-by: Zhanghao Wu * fix integer gpus * format --------- Co-authored-by: Zhanghao Wu --- sky/backends/cloud_vm_ray_backend.py | 15 +++++++++ sky/clouds/aws.py | 11 +++---- sky/clouds/azure.py | 10 +++--- sky/clouds/cloud.py | 18 +++++++---- sky/clouds/cudo.py | 11 +++---- sky/clouds/fluidstack.py | 11 +++---- sky/clouds/gcp.py | 4 +-- sky/clouds/ibm.py | 11 +++---- sky/clouds/kubernetes.py | 11 +++---- sky/clouds/lambda_cloud.py | 11 +++---- sky/clouds/oci.py | 11 +++---- sky/clouds/paperspace.py | 11 +++---- sky/clouds/runpod.py | 11 +++---- sky/clouds/scp.py | 11 +++---- sky/clouds/service_catalog/__init__.py | 2 +- sky/clouds/service_catalog/aws_catalog.py | 4 +-- sky/clouds/service_catalog/azure_catalog.py | 5 +-- sky/clouds/service_catalog/common.py | 21 ++++++++---- sky/clouds/service_catalog/cudo_catalog.py | 4 +-- .../data_fetchers/fetch_azure.py | 32 ++++++++++++------- .../service_catalog/fluidstack_catalog.py | 4 +-- sky/clouds/service_catalog/ibm_catalog.py | 4 +-- sky/clouds/service_catalog/lambda_catalog.py | 4 +-- sky/clouds/service_catalog/oci_catalog.py | 4 +-- .../service_catalog/paperspace_catalog.py | 4 +-- sky/clouds/service_catalog/runpod_catalog.py | 4 +-- sky/clouds/service_catalog/scp_catalog.py | 4 +-- sky/clouds/service_catalog/vsphere_catalog.py | 4 +-- sky/clouds/vsphere.py | 11 +++---- sky/resources.py | 2 +- sky/utils/resources_utils.py | 14 +++++++- 31 files changed, 150 insertions(+), 134 deletions(-) diff --git a/sky/backends/cloud_vm_ray_backend.py b/sky/backends/cloud_vm_ray_backend.py index f0fb4d97ba1..918848b045b 100644 --- a/sky/backends/cloud_vm_ray_backend.py +++ b/sky/backends/cloud_vm_ray_backend.py @@ -2713,6 +2713,21 @@ def check_resources_fit_cluster( f' Existing:\t{handle.launched_nodes}x ' f'{handle.launched_resources}\n' f'{mismatch_str}') + else: + # For fractional acc count clusters, we round up the number of accs + # to 1 (sky/utils/resources_utils.py::make_ray_custom_resources_str) + # Here we scale the required acc count to (required / launched) * 1 + # so the total number of accs is the same as the requested number. + launched_accs = launched_resources.accelerators + if (launched_accs is not None and + valid_resource.accelerators is not None): + for _, count in launched_accs.items(): + if isinstance(count, float) and not count.is_integer(): + valid_resource = valid_resource.copy( + accelerators={ + k: v / count + for k, v in valid_resource.accelerators.items() + }) return valid_resource def _provision( diff --git a/sky/clouds/aws.py b/sky/clouds/aws.py index a0962b17cac..43062ebf393 100644 --- a/sky/clouds/aws.py +++ b/sky/clouds/aws.py @@ -2,13 +2,12 @@ import enum import fnmatch import functools -import json import os import re import subprocess import time import typing -from typing import Any, Dict, Iterator, List, Optional, Set, Tuple +from typing import Any, Dict, Iterator, List, Optional, Set, Tuple, Union from sky import clouds from sky import exceptions @@ -383,7 +382,7 @@ def get_default_instance_type( def get_accelerators_from_instance_type( cls, instance_type: str, - ) -> Optional[Dict[str, int]]: + ) -> Optional[Dict[str, Union[int, float]]]: return service_catalog.get_accelerators_from_instance_type( instance_type, clouds='aws') @@ -411,10 +410,8 @@ def make_deploy_resources_variables( r = resources # r.accelerators is cleared but .instance_type encodes the info. acc_dict = self.get_accelerators_from_instance_type(r.instance_type) - if acc_dict is not None: - custom_resources = json.dumps(acc_dict, separators=(',', ':')) - else: - custom_resources = None + custom_resources = resources_utils.make_ray_custom_resources_str( + acc_dict) if r.extract_docker_image() is not None: image_id_to_use = None diff --git a/sky/clouds/azure.py b/sky/clouds/azure.py index 0852c993ed3..fc9579d17c0 100644 --- a/sky/clouds/azure.py +++ b/sky/clouds/azure.py @@ -1,12 +1,11 @@ """Azure.""" import functools -import json import os import re import subprocess import textwrap import typing -from typing import Any, Dict, Iterator, List, Optional, Tuple +from typing import Any, Dict, Iterator, List, Optional, Tuple, Union import colorama @@ -272,7 +271,7 @@ def zones_provision_loop( def get_accelerators_from_instance_type( cls, instance_type: str, - ) -> Optional[Dict[str, int]]: + ) -> Optional[Dict[str, Union[int, float]]]: return service_catalog.get_accelerators_from_instance_type( instance_type, clouds='azure') @@ -304,10 +303,9 @@ def make_deploy_resources_variables( acc_dict = self.get_accelerators_from_instance_type(r.instance_type) acc_count = None if acc_dict is not None: - custom_resources = json.dumps(acc_dict, separators=(',', ':')) acc_count = str(sum(acc_dict.values())) - else: - custom_resources = None + custom_resources = resources_utils.make_ray_custom_resources_str( + acc_dict) if (resources.image_id is None or resources.extract_docker_image() is not None): diff --git a/sky/clouds/cloud.py b/sky/clouds/cloud.py index 3e21204f0a3..4028c1fef59 100644 --- a/sky/clouds/cloud.py +++ b/sky/clouds/cloud.py @@ -9,8 +9,9 @@ """ import collections import enum +import math import typing -from typing import Dict, Iterable, Iterator, List, Optional, Set, Tuple +from typing import Dict, Iterable, Iterator, List, Optional, Set, Tuple, Union from sky import exceptions from sky import skypilot_config @@ -306,7 +307,7 @@ def get_vcpus_mem_from_instance_type( def get_accelerators_from_instance_type( cls, instance_type: str, - ) -> Optional[Dict[str, int]]: + ) -> Optional[Dict[str, Union[int, float]]]: """Returns {acc: acc_count} held by 'instance_type', if any.""" raise NotImplementedError @@ -673,8 +674,9 @@ def _check_instance_type_accelerators_combination( assert resources.is_launchable(), resources def _equal_accelerators( - acc_requested: Optional[Dict[str, int]], - acc_from_instance_type: Optional[Dict[str, int]]) -> bool: + acc_requested: Optional[Dict[str, Union[int, float]]], + acc_from_instance_type: Optional[Dict[str, Union[int, + float]]]) -> bool: """Check the requested accelerators equals to the instance type Check the requested accelerators equals to the accelerators @@ -689,12 +691,14 @@ def _equal_accelerators( for acc in acc_requested: if acc not in acc_from_instance_type: return False - if acc_requested[acc] != acc_from_instance_type[acc]: + # Avoid float point precision issue. + if not math.isclose(acc_requested[acc], + acc_from_instance_type[acc]): return False return True - acc_from_instance_type = (cls.get_accelerators_from_instance_type( - resources.instance_type)) + acc_from_instance_type = cls.get_accelerators_from_instance_type( + resources.instance_type) if not _equal_accelerators(resources.accelerators, acc_from_instance_type): with ux_utils.print_exception_no_traceback(): diff --git a/sky/clouds/cudo.py b/sky/clouds/cudo.py index 4dca442fa01..6f02e007049 100644 --- a/sky/clouds/cudo.py +++ b/sky/clouds/cudo.py @@ -1,8 +1,7 @@ """Cudo Compute""" -import json import subprocess import typing -from typing import Dict, Iterator, List, Optional, Tuple +from typing import Dict, Iterator, List, Optional, Tuple, Union from sky import clouds from sky.clouds import service_catalog @@ -183,7 +182,7 @@ def get_default_instance_type( def get_accelerators_from_instance_type( cls, instance_type: str, - ) -> Optional[Dict[str, int]]: + ) -> Optional[Dict[str, Union[int, float]]]: return service_catalog.get_accelerators_from_instance_type( instance_type, clouds='cudo') @@ -202,10 +201,8 @@ def make_deploy_resources_variables( del zones, cluster_name # unused r = resources acc_dict = self.get_accelerators_from_instance_type(r.instance_type) - if acc_dict is not None: - custom_resources = json.dumps(acc_dict, separators=(',', ':')) - else: - custom_resources = None + custom_resources = resources_utils.make_ray_custom_resources_str( + acc_dict) return { 'instance_type': resources.instance_type, diff --git a/sky/clouds/fluidstack.py b/sky/clouds/fluidstack.py index 473fceabbe3..31e2112f8f7 100644 --- a/sky/clouds/fluidstack.py +++ b/sky/clouds/fluidstack.py @@ -1,8 +1,7 @@ """Fluidstack Cloud.""" -import json import os import typing -from typing import Dict, Iterator, List, Optional, Tuple +from typing import Dict, Iterator, List, Optional, Tuple, Union import requests @@ -155,7 +154,7 @@ def get_default_instance_type( def get_accelerators_from_instance_type( cls, instance_type: str, - ) -> Optional[Dict[str, int]]: + ) -> Optional[Dict[str, Union[int, float]]]: return service_catalog.get_accelerators_from_instance_type( instance_type, clouds='fluidstack') @@ -184,10 +183,8 @@ def make_deploy_resources_variables( r = resources acc_dict = self.get_accelerators_from_instance_type(r.instance_type) - if acc_dict is not None: - custom_resources = json.dumps(acc_dict, separators=(',', ':')) - else: - custom_resources = None + custom_resources = resources_utils.make_ray_custom_resources_str( + acc_dict) return { 'instance_type': resources.instance_type, diff --git a/sky/clouds/gcp.py b/sky/clouds/gcp.py index 1b70abf914d..0e20fdc9789 100644 --- a/sky/clouds/gcp.py +++ b/sky/clouds/gcp.py @@ -7,7 +7,7 @@ import subprocess import time import typing -from typing import Any, Dict, Iterator, List, Optional, Set, Tuple +from typing import Any, Dict, Iterator, List, Optional, Set, Tuple, Union import colorama @@ -669,7 +669,7 @@ def _get_feasible_launchable_resources( def get_accelerators_from_instance_type( cls, instance_type: str, - ) -> Optional[Dict[str, int]]: + ) -> Optional[Dict[str, Union[int, float]]]: # GCP handles accelerators separately from regular instance types, # hence return none here. return None diff --git a/sky/clouds/ibm.py b/sky/clouds/ibm.py index b78cc4287c0..0ac3c36cc48 100644 --- a/sky/clouds/ibm.py +++ b/sky/clouds/ibm.py @@ -1,8 +1,7 @@ """IBM Web Services.""" -import json import os import typing -from typing import Any, Dict, Iterator, List, Optional, Tuple +from typing import Any, Dict, Iterator, List, Optional, Tuple, Union import colorama @@ -206,10 +205,8 @@ def _get_profile_resources(instance_profile): 'IBM does not currently support spot instances in this framework' acc_dict = self.get_accelerators_from_instance_type(r.instance_type) - if acc_dict is not None: - custom_resources = json.dumps(acc_dict, separators=(',', ':')) - else: - custom_resources = None + custom_resources = resources_utils.make_ray_custom_resources_str( + acc_dict) instance_resources = _get_profile_resources(r.instance_type) @@ -247,7 +244,7 @@ def get_vcpus_mem_from_instance_type( def get_accelerators_from_instance_type( cls, instance_type: str, - ) -> Optional[Dict[str, int]]: + ) -> Optional[Dict[str, Union[int, float]]]: """Returns {acc: acc_count} held by 'instance_type', if any.""" return service_catalog.get_accelerators_from_instance_type( instance_type, clouds='ibm') diff --git a/sky/clouds/kubernetes.py b/sky/clouds/kubernetes.py index 8ff4172a5b1..39ddbe30577 100644 --- a/sky/clouds/kubernetes.py +++ b/sky/clouds/kubernetes.py @@ -1,10 +1,9 @@ """Kubernetes.""" import functools -import json import os import re import typing -from typing import Dict, Iterator, List, Optional, Tuple +from typing import Dict, Iterator, List, Optional, Tuple, Union from sky import clouds from sky import sky_logging @@ -271,7 +270,7 @@ def get_default_instance_type( def get_accelerators_from_instance_type( cls, instance_type: str, - ) -> Optional[Dict[str, int]]: + ) -> Optional[Dict[str, Union[int, float]]]: inst = kubernetes_utils.KubernetesInstanceType.from_instance_type( instance_type) return { @@ -328,10 +327,8 @@ def make_deploy_resources_variables( r = resources acc_dict = self.get_accelerators_from_instance_type(r.instance_type) - if acc_dict is not None: - custom_resources = json.dumps(acc_dict, separators=(',', ':')) - else: - custom_resources = None + custom_resources = resources_utils.make_ray_custom_resources_str( + acc_dict) # resources.memory and cpus are None if they are not explicitly set. # We fetch the default values for the instance type in that case. diff --git a/sky/clouds/lambda_cloud.py b/sky/clouds/lambda_cloud.py index 0201f4f76ad..055a5338750 100644 --- a/sky/clouds/lambda_cloud.py +++ b/sky/clouds/lambda_cloud.py @@ -1,7 +1,6 @@ """Lambda Cloud.""" -import json import typing -from typing import Dict, Iterator, List, Optional, Tuple +from typing import Dict, Iterator, List, Optional, Tuple, Union import requests @@ -136,7 +135,7 @@ def get_default_instance_type( def get_accelerators_from_instance_type( cls, instance_type: str, - ) -> Optional[Dict[str, int]]: + ) -> Optional[Dict[str, Union[int, float]]]: return service_catalog.get_accelerators_from_instance_type( instance_type, clouds='lambda') @@ -164,10 +163,8 @@ def make_deploy_resources_variables( r = resources acc_dict = self.get_accelerators_from_instance_type(r.instance_type) - if acc_dict is not None: - custom_resources = json.dumps(acc_dict, separators=(',', ':')) - else: - custom_resources = None + custom_resources = resources_utils.make_ray_custom_resources_str( + acc_dict) resources_vars = { 'instance_type': resources.instance_type, diff --git a/sky/clouds/oci.py b/sky/clouds/oci.py index 0feda467bbf..93a70c5ac37 100644 --- a/sky/clouds/oci.py +++ b/sky/clouds/oci.py @@ -20,11 +20,10 @@ - Hysun He (hysun.he@oracle.com) @ Oct 13, 2024: Support more OS types additional to ubuntu for OCI resources. """ -import json import logging import os import typing -from typing import Dict, Iterator, List, Optional, Tuple +from typing import Dict, Iterator, List, Optional, Tuple, Union from sky import clouds from sky import exceptions @@ -193,7 +192,7 @@ def get_default_instance_type( def get_accelerators_from_instance_type( cls, instance_type: str, - ) -> Optional[Dict[str, int]]: + ) -> Optional[Dict[str, Union[int, float]]]: return service_catalog.get_accelerators_from_instance_type( instance_type, clouds='oci') @@ -213,10 +212,8 @@ def make_deploy_resources_variables( acc_dict = self.get_accelerators_from_instance_type( resources.instance_type) - if acc_dict is not None: - custom_resources = json.dumps(acc_dict, separators=(',', ':')) - else: - custom_resources = None + custom_resources = resources_utils.make_ray_custom_resources_str( + acc_dict) image_str = self._get_image_id(resources.image_id, region.name, resources.instance_type) diff --git a/sky/clouds/paperspace.py b/sky/clouds/paperspace.py index 4c4fa1d695a..4047a2f5926 100644 --- a/sky/clouds/paperspace.py +++ b/sky/clouds/paperspace.py @@ -1,8 +1,7 @@ """ Paperspace Cloud. """ -import json import typing -from typing import Dict, Iterator, List, Optional, Tuple +from typing import Dict, Iterator, List, Optional, Tuple, Union import requests @@ -162,7 +161,7 @@ def get_default_instance_type( @classmethod def get_accelerators_from_instance_type( - cls, instance_type: str) -> Optional[Dict[str, int]]: + cls, instance_type: str) -> Optional[Dict[str, Union[int, float]]]: return service_catalog.get_accelerators_from_instance_type( instance_type, clouds='paperspace') @@ -181,10 +180,8 @@ def make_deploy_resources_variables( r = resources acc_dict = self.get_accelerators_from_instance_type(r.instance_type) - if acc_dict is not None: - custom_resources = json.dumps(acc_dict, separators=(',', ':')) - else: - custom_resources = None + custom_resources = resources_utils.make_ray_custom_resources_str( + acc_dict) return { 'instance_type': resources.instance_type, diff --git a/sky/clouds/runpod.py b/sky/clouds/runpod.py index 6cfdf11c6b4..0d693fd9f60 100644 --- a/sky/clouds/runpod.py +++ b/sky/clouds/runpod.py @@ -1,8 +1,7 @@ """ RunPod Cloud. """ -import json import typing -from typing import Dict, Iterator, List, Optional, Tuple +from typing import Dict, Iterator, List, Optional, Tuple, Union from sky import clouds from sky.clouds import service_catalog @@ -147,7 +146,7 @@ def get_default_instance_type( @classmethod def get_accelerators_from_instance_type( - cls, instance_type: str) -> Optional[Dict[str, int]]: + cls, instance_type: str) -> Optional[Dict[str, Union[int, float]]]: return service_catalog.get_accelerators_from_instance_type( instance_type, clouds='runpod') @@ -166,10 +165,8 @@ def make_deploy_resources_variables( r = resources acc_dict = self.get_accelerators_from_instance_type(r.instance_type) - if acc_dict is not None: - custom_resources = json.dumps(acc_dict, separators=(',', ':')) - else: - custom_resources = None + custom_resources = resources_utils.make_ray_custom_resources_str( + acc_dict) if r.image_id is None: image_id = 'runpod/base:0.0.2' diff --git a/sky/clouds/scp.py b/sky/clouds/scp.py index 17a54ce1607..d0ad611bf0c 100644 --- a/sky/clouds/scp.py +++ b/sky/clouds/scp.py @@ -4,9 +4,8 @@ to access the SCP catalog and check credentials for the SCP access. """ -import json import typing -from typing import Dict, Iterator, List, Optional, Tuple +from typing import Dict, Iterator, List, Optional, Tuple, Union from sky import clouds from sky import exceptions @@ -160,7 +159,7 @@ def get_default_instance_type( def get_accelerators_from_instance_type( cls, instance_type: str, - ) -> Optional[Dict[str, int]]: + ) -> Optional[Dict[str, Union[int, float]]]: return service_catalog.get_accelerators_from_instance_type( instance_type, clouds='scp') @@ -188,11 +187,9 @@ def make_deploy_resources_variables( r = resources acc_dict = self.get_accelerators_from_instance_type(r.instance_type) + custom_resources = resources_utils.make_ray_custom_resources_str( + acc_dict) - if acc_dict is not None: - custom_resources = json.dumps(acc_dict, separators=(',', ':')) - else: - custom_resources = None image_id = self._get_image_id(r.image_id, region.name, r.instance_type) return { 'instance_type': resources.instance_type, diff --git a/sky/clouds/service_catalog/__init__.py b/sky/clouds/service_catalog/__init__.py index f2301bac466..4deab8ac204 100644 --- a/sky/clouds/service_catalog/__init__.py +++ b/sky/clouds/service_catalog/__init__.py @@ -238,7 +238,7 @@ def get_default_instance_type(cpus: Optional[str] = None, def get_accelerators_from_instance_type( instance_type: str, - clouds: CloudFilter = None) -> Optional[Dict[str, int]]: + clouds: CloudFilter = None) -> Optional[Dict[str, Union[int, float]]]: """Returns the accelerators from a instance type.""" return _map_clouds_catalog(clouds, 'get_accelerators_from_instance_type', instance_type) diff --git a/sky/clouds/service_catalog/aws_catalog.py b/sky/clouds/service_catalog/aws_catalog.py index d156135047b..918a4070414 100644 --- a/sky/clouds/service_catalog/aws_catalog.py +++ b/sky/clouds/service_catalog/aws_catalog.py @@ -8,7 +8,7 @@ import os import threading import typing -from typing import Dict, List, Optional, Tuple +from typing import Dict, List, Optional, Tuple, Union from sky import exceptions from sky import sky_logging @@ -243,7 +243,7 @@ def get_default_instance_type( def get_accelerators_from_instance_type( - instance_type: str) -> Optional[Dict[str, int]]: + instance_type: str) -> Optional[Dict[str, Union[int, float]]]: return common.get_accelerators_from_instance_type_impl( _get_df(), instance_type) diff --git a/sky/clouds/service_catalog/azure_catalog.py b/sky/clouds/service_catalog/azure_catalog.py index 867141f7899..62cb422bf83 100644 --- a/sky/clouds/service_catalog/azure_catalog.py +++ b/sky/clouds/service_catalog/azure_catalog.py @@ -4,7 +4,7 @@ instance types and pricing information for Azure. """ import re -from typing import Dict, List, Optional, Tuple +from typing import Dict, List, Optional, Tuple, Union from sky import clouds as cloud_lib from sky import sky_logging @@ -137,7 +137,7 @@ def _filter_disk_type(instance_type: str) -> bool: def get_accelerators_from_instance_type( - instance_type: str) -> Optional[Dict[str, int]]: + instance_type: str) -> Optional[Dict[str, Union[int, float]]]: return common.get_accelerators_from_instance_type_impl(_df, instance_type) @@ -157,6 +157,7 @@ def get_instance_type_for_accelerator( if zone is not None: with ux_utils.print_exception_no_traceback(): raise ValueError('Azure does not support zones.') + return common.get_instance_type_for_accelerator_impl(df=_df, acc_name=acc_name, acc_count=acc_count, diff --git a/sky/clouds/service_catalog/common.py b/sky/clouds/service_catalog/common.py index 4df72824027..1082b4e9efd 100644 --- a/sky/clouds/service_catalog/common.py +++ b/sky/clouds/service_catalog/common.py @@ -5,7 +5,7 @@ import os import time import typing -from typing import Callable, Dict, List, NamedTuple, Optional, Tuple +from typing import Callable, Dict, List, NamedTuple, Optional, Tuple, Union import filelock import requests @@ -481,7 +481,7 @@ def get_instance_type_for_cpus_mem_impl( def get_accelerators_from_instance_type_impl( df: 'pd.DataFrame', instance_type: str, -) -> Optional[Dict[str, int]]: +) -> Optional[Dict[str, Union[int, float]]]: df = _get_instance_type(df, instance_type, None) if len(df) == 0: with ux_utils.print_exception_no_traceback(): @@ -490,13 +490,19 @@ def get_accelerators_from_instance_type_impl( acc_name, acc_count = row['AcceleratorName'], row['AcceleratorCount'] if pd.isnull(acc_name): return None - return {acc_name: int(acc_count)} + + def _convert(value): + if int(value) == value: + return int(value) + return float(value) + + return {acc_name: _convert(acc_count)} def get_instance_type_for_accelerator_impl( df: 'pd.DataFrame', acc_name: str, - acc_count: int, + acc_count: Union[int, float], cpus: Optional[str] = None, memory: Optional[str] = None, use_spot: bool = False, @@ -509,7 +515,7 @@ def get_instance_type_for_accelerator_impl( accelerators with sorted prices and a list of candidates with fuzzy search. """ result = df[(df['AcceleratorName'].str.fullmatch(acc_name, case=False)) & - (df['AcceleratorCount'] == acc_count)] + (abs(df['AcceleratorCount'] - acc_count) <= 0.01)] result = _filter_region_zone(result, region, zone) if len(result) == 0: fuzzy_result = df[ @@ -522,8 +528,11 @@ def get_instance_type_for_accelerator_impl( fuzzy_candidate_list = [] if len(fuzzy_result) > 0: for _, row in fuzzy_result.iterrows(): + acc_cnt = float(row['AcceleratorCount']) + acc_count_display = (int(acc_cnt) if acc_cnt.is_integer() else + f'{acc_cnt:.2f}') fuzzy_candidate_list.append(f'{row["AcceleratorName"]}:' - f'{int(row["AcceleratorCount"])}') + f'{acc_count_display}') return (None, fuzzy_candidate_list) result = _filter_with_cpus(result, cpus) diff --git a/sky/clouds/service_catalog/cudo_catalog.py b/sky/clouds/service_catalog/cudo_catalog.py index 62832cba5bf..d4adc5baea5 100644 --- a/sky/clouds/service_catalog/cudo_catalog.py +++ b/sky/clouds/service_catalog/cudo_catalog.py @@ -1,7 +1,7 @@ """Cudo Compute Offerings Catalog.""" import typing -from typing import Dict, List, Optional, Tuple +from typing import Dict, List, Optional, Tuple, Union from sky.clouds.service_catalog import common import sky.provision.cudo.cudo_machine_type as cudo_mt @@ -66,7 +66,7 @@ def get_default_instance_type(cpus: Optional[str] = None, def get_accelerators_from_instance_type( - instance_type: str) -> Optional[Dict[str, int]]: + instance_type: str) -> Optional[Dict[str, Union[int, float]]]: return common.get_accelerators_from_instance_type_impl(_df, instance_type) diff --git a/sky/clouds/service_catalog/data_fetchers/fetch_azure.py b/sky/clouds/service_catalog/data_fetchers/fetch_azure.py index bbd337e23aa..f646cac339a 100644 --- a/sky/clouds/service_catalog/data_fetchers/fetch_azure.py +++ b/sky/clouds/service_catalog/data_fetchers/fetch_azure.py @@ -93,14 +93,15 @@ def get_regions() -> List[str]: # We have to manually remove it. DEPRECATED_FAMILIES = ['standardNVSv2Family'] -# Some A10 instance types only contains a fractional of GPU. We temporarily -# filter them out here to avoid using it as a whole A10 GPU. -# TODO(zhwu,tian): support fractional GPUs, which can be done on -# kubernetes as well. +# Azure has those fractional A10 instance types, which still shows has 1 A10 GPU +# in the API response. We manually changing the number of GPUs to a float here. # Ref: https://learn.microsoft.com/en-us/azure/virtual-machines/nva10v5-series -FILTERED_A10_INSTANCE_TYPES = [ - f'Standard_NV{vcpu}ads_A10_v5' for vcpu in [6, 12, 18] -] +# TODO(zhwu,tian): Support fractional GPUs on k8s as well. +# TODO(tian): Maybe we should support literally fractional count, i.e. A10:1/6 +# instead of float point count (A10:0.167). +AZURE_FRACTIONAL_A10_INS_TYPE_TO_NUM_GPUS = { + f'Standard_NV{vcpu}ads_A10_v5': round(vcpu / 36, 3) for vcpu in [6, 12, 18] +} USEFUL_COLUMNS = [ 'InstanceType', 'AcceleratorName', 'AcceleratorCount', 'vCPUs', 'MemoryGiB', @@ -274,6 +275,19 @@ def get_additional_columns(row): axis='columns', ) + def _upd_a10_gpu_count(row): + new_gpu_cnt = AZURE_FRACTIONAL_A10_INS_TYPE_TO_NUM_GPUS.get( + row['InstanceType']) + if new_gpu_cnt is not None: + return new_gpu_cnt + return row['AcceleratorCount'] + + # Manually update the GPU count for fractional A10 instance types. + # Those instance types have fractional GPU count, but Azure API returns + # 1 GPU count for them. We manually update the GPU count here. + df_ret['AcceleratorCount'] = df_ret.apply(_upd_a10_gpu_count, + axis='columns') + # As of Dec 2023, a few H100 instance types fetched from Azure APIs do not # have pricing: # @@ -299,10 +313,6 @@ def get_additional_columns(row): after_drop_len = len(df_ret) print(f'Dropped {before_drop_len - after_drop_len} duplicated rows') - # Filter out instance types that only contain a fractional of GPU. - df_ret = df_ret.loc[~df_ret['InstanceType'].isin(FILTERED_A10_INSTANCE_TYPES - )] - # Filter out deprecated families df_ret = df_ret.loc[~df_ret['family'].isin(DEPRECATED_FAMILIES)] df_ret = df_ret[USEFUL_COLUMNS] diff --git a/sky/clouds/service_catalog/fluidstack_catalog.py b/sky/clouds/service_catalog/fluidstack_catalog.py index 2f47a38df43..7a28ac8174a 100644 --- a/sky/clouds/service_catalog/fluidstack_catalog.py +++ b/sky/clouds/service_catalog/fluidstack_catalog.py @@ -4,7 +4,7 @@ instance types and pricing information for FluidStack. """ import typing -from typing import Dict, List, Optional, Tuple +from typing import Dict, List, Optional, Tuple, Union from sky.clouds.service_catalog import common from sky.utils import ux_utils @@ -65,7 +65,7 @@ def get_default_instance_type(cpus: Optional[str] = None, def get_accelerators_from_instance_type( - instance_type: str) -> Optional[Dict[str, int]]: + instance_type: str) -> Optional[Dict[str, Union[int, float]]]: return common.get_accelerators_from_instance_type_impl(_df, instance_type) diff --git a/sky/clouds/service_catalog/ibm_catalog.py b/sky/clouds/service_catalog/ibm_catalog.py index 51b4e14f569..5cec86fbb65 100644 --- a/sky/clouds/service_catalog/ibm_catalog.py +++ b/sky/clouds/service_catalog/ibm_catalog.py @@ -4,7 +4,7 @@ instance types and pricing information for IBM. """ -from typing import Dict, List, Optional, Tuple +from typing import Dict, List, Optional, Tuple, Union from sky import sky_logging from sky.adaptors import ibm @@ -43,7 +43,7 @@ def get_vcpus_mem_from_instance_type( def get_accelerators_from_instance_type( - instance_type: str) -> Optional[Dict[str, int]]: + instance_type: str) -> Optional[Dict[str, Union[int, float]]]: return common.get_accelerators_from_instance_type_impl(_df, instance_type) diff --git a/sky/clouds/service_catalog/lambda_catalog.py b/sky/clouds/service_catalog/lambda_catalog.py index e843ab72cc0..24cb4064d54 100644 --- a/sky/clouds/service_catalog/lambda_catalog.py +++ b/sky/clouds/service_catalog/lambda_catalog.py @@ -4,7 +4,7 @@ instance types and pricing information for Lambda. """ import typing -from typing import Dict, List, Optional, Tuple +from typing import Dict, List, Optional, Tuple, Union from sky.clouds.service_catalog import common from sky.utils import resources_utils @@ -72,7 +72,7 @@ def get_default_instance_type( def get_accelerators_from_instance_type( - instance_type: str) -> Optional[Dict[str, int]]: + instance_type: str) -> Optional[Dict[str, Union[int, float]]]: return common.get_accelerators_from_instance_type_impl(_df, instance_type) diff --git a/sky/clouds/service_catalog/oci_catalog.py b/sky/clouds/service_catalog/oci_catalog.py index 47d0489f6ab..c8e475df871 100644 --- a/sky/clouds/service_catalog/oci_catalog.py +++ b/sky/clouds/service_catalog/oci_catalog.py @@ -14,7 +14,7 @@ import logging import threading import typing -from typing import Dict, List, Optional, Tuple +from typing import Dict, List, Optional, Tuple, Union from sky.adaptors import oci as oci_adaptor from sky.clouds import OCI @@ -131,7 +131,7 @@ def _filter_disk_type(instance_type: str) -> bool: def get_accelerators_from_instance_type( - instance_type: str) -> Optional[Dict[str, int]]: + instance_type: str) -> Optional[Dict[str, Union[int, float]]]: return common.get_accelerators_from_instance_type_impl( _get_df(), instance_type) diff --git a/sky/clouds/service_catalog/paperspace_catalog.py b/sky/clouds/service_catalog/paperspace_catalog.py index 1eb635c93e5..49948b219a1 100644 --- a/sky/clouds/service_catalog/paperspace_catalog.py +++ b/sky/clouds/service_catalog/paperspace_catalog.py @@ -5,7 +5,7 @@ """ import typing -from typing import Dict, List, Optional, Tuple +from typing import Dict, List, Optional, Tuple, Union from sky.clouds.service_catalog import common from sky.utils import ux_utils @@ -60,7 +60,7 @@ def get_default_instance_type( def get_accelerators_from_instance_type( - instance_type: str) -> Optional[Dict[str, int]]: + instance_type: str) -> Optional[Dict[str, Union[int, float]]]: return common.get_accelerators_from_instance_type_impl(_df, instance_type) diff --git a/sky/clouds/service_catalog/runpod_catalog.py b/sky/clouds/service_catalog/runpod_catalog.py index 2d3ed44307b..7fbc46206ed 100644 --- a/sky/clouds/service_catalog/runpod_catalog.py +++ b/sky/clouds/service_catalog/runpod_catalog.py @@ -5,7 +5,7 @@ """ import typing -from typing import Dict, List, Optional, Tuple +from typing import Dict, List, Optional, Tuple, Union from sky.clouds.service_catalog import common from sky.utils import ux_utils @@ -56,7 +56,7 @@ def get_default_instance_type(cpus: Optional[str] = None, def get_accelerators_from_instance_type( - instance_type: str) -> Optional[Dict[str, int]]: + instance_type: str) -> Optional[Dict[str, Union[int, float]]]: return common.get_accelerators_from_instance_type_impl(_df, instance_type) diff --git a/sky/clouds/service_catalog/scp_catalog.py b/sky/clouds/service_catalog/scp_catalog.py index 209bb4cf631..e4773ab3250 100644 --- a/sky/clouds/service_catalog/scp_catalog.py +++ b/sky/clouds/service_catalog/scp_catalog.py @@ -5,7 +5,7 @@ """ import typing -from typing import Dict, List, Optional, Tuple +from typing import Dict, List, Optional, Tuple, Union from sky.clouds.service_catalog import common from sky.utils import resources_utils @@ -67,7 +67,7 @@ def get_default_instance_type( def get_accelerators_from_instance_type( - instance_type: str) -> Optional[Dict[str, int]]: + instance_type: str) -> Optional[Dict[str, Union[int, float]]]: return common.get_accelerators_from_instance_type_impl(_df, instance_type) diff --git a/sky/clouds/service_catalog/vsphere_catalog.py b/sky/clouds/service_catalog/vsphere_catalog.py index e1199d3d266..74fb2fbe60d 100644 --- a/sky/clouds/service_catalog/vsphere_catalog.py +++ b/sky/clouds/service_catalog/vsphere_catalog.py @@ -2,7 +2,7 @@ import io import os import typing -from typing import Dict, List, Optional, Tuple +from typing import Dict, List, Optional, Tuple, Union from sky.adaptors import common as adaptors_common from sky.clouds.service_catalog import common @@ -85,7 +85,7 @@ def get_default_instance_type( def get_accelerators_from_instance_type( - instance_type: str) -> Optional[Dict[str, int]]: + instance_type: str) -> Optional[Dict[str, Union[int, float]]]: return common.get_accelerators_from_instance_type_impl( _get_df(), instance_type) diff --git a/sky/clouds/vsphere.py b/sky/clouds/vsphere.py index 7cf56b46a8d..88d5df3232a 100644 --- a/sky/clouds/vsphere.py +++ b/sky/clouds/vsphere.py @@ -1,8 +1,7 @@ """Vsphere cloud implementation.""" -import json import subprocess import typing -from typing import Dict, Iterator, List, Optional, Tuple +from typing import Dict, Iterator, List, Optional, Tuple, Union import requests @@ -152,7 +151,7 @@ def get_default_instance_type( def get_accelerators_from_instance_type( cls, instance_type: str, - ) -> Optional[Dict[str, int]]: + ) -> Optional[Dict[str, Union[int, float]]]: return service_catalog.get_accelerators_from_instance_type( instance_type, clouds=_CLOUD_VSPHERE) @@ -182,10 +181,8 @@ def make_deploy_resources_variables( zone_names = [zone.name for zone in zones] r = resources acc_dict = self.get_accelerators_from_instance_type(r.instance_type) - if acc_dict is not None: - custom_resources = json.dumps(acc_dict, separators=(',', ':')) - else: - custom_resources = None + custom_resources = resources_utils.make_ray_custom_resources_str( + acc_dict) return { 'instance_type': resources.instance_type, diff --git a/sky/resources.py b/sky/resources.py index 540cbfb703c..164ef312ba1 100644 --- a/sky/resources.py +++ b/sky/resources.py @@ -392,7 +392,7 @@ def memory(self) -> Optional[str]: @property @functools.lru_cache(maxsize=1) - def accelerators(self) -> Optional[Dict[str, int]]: + def accelerators(self) -> Optional[Dict[str, Union[int, float]]]: """Returns the accelerators field directly or by inferring. For example, Resources(AWS, 'p3.2xlarge') has its accelerators field diff --git a/sky/utils/resources_utils.py b/sky/utils/resources_utils.py index 72aa5ac05d3..653bb109ac0 100644 --- a/sky/utils/resources_utils.py +++ b/sky/utils/resources_utils.py @@ -2,9 +2,11 @@ import dataclasses import enum import itertools +import json +import math import re import typing -from typing import List, Optional, Set +from typing import Dict, List, Optional, Set, Union from sky import skypilot_config from sky.clouds import cloud_registry @@ -163,6 +165,16 @@ def get_readable_resources_repr(handle: 'backends.CloudVmRayResourceHandle', return _DEFAULT_MESSAGE_HANDLE_INITIALIZING +def make_ray_custom_resources_str( + resource_dict: Optional[Dict[str, Union[int, float]]]) -> Optional[str]: + """Convert resources to Ray custom resources format.""" + if resource_dict is None: + return None + # Ray does not allow fractional resources, so we need to ceil the values. + ceiled_dict = {k: math.ceil(v) for k, v in resource_dict.items()} + return json.dumps(ceiled_dict, separators=(',', ':')) + + @dataclasses.dataclass class FeasibleResources: """Feasible resources returned by cloud. From c0c17483d1f692ad639144050f5f6fa0966e47a5 Mon Sep 17 00:00:00 2001 From: Andy Lee Date: Sat, 26 Oct 2024 16:30:52 -0700 Subject: [PATCH 93/93] [Jobs] Refactor: Extract task failure state update helper (#4185) refactor: a unified exception handling utility --- sky/jobs/controller.py | 61 +++++++++++++++++++----------------------- 1 file changed, 28 insertions(+), 33 deletions(-) diff --git a/sky/jobs/controller.py b/sky/jobs/controller.py index 1faa5dfbe31..73d509be9ef 100644 --- a/sky/jobs/controller.py +++ b/sky/jobs/controller.py @@ -340,48 +340,28 @@ def run(self): common_utils.format_exception(reason, use_bracket=True) for reason in e.reasons)) logger.error(failure_reason) - managed_job_state.set_failed( - self._job_id, - task_id=task_id, - failure_type=managed_job_state.ManagedJobStatus. - FAILED_PRECHECKS, - failure_reason=failure_reason, - callback_func=managed_job_utils.event_callback_func( - job_id=self._job_id, - task_id=task_id, - task=self._dag.tasks[task_id])) + self._update_failed_task_state( + task_id, managed_job_state.ManagedJobStatus.FAILED_PRECHECKS, + failure_reason) except exceptions.ManagedJobReachedMaxRetriesError as e: # Please refer to the docstring of self._run for the cases when # this exception can occur. - logger.error(common_utils.format_exception(e)) + failure_reason = common_utils.format_exception(e) + logger.error(failure_reason) # The managed job should be marked as FAILED_NO_RESOURCE, as the # managed job may be able to launch next time. - managed_job_state.set_failed( - self._job_id, - task_id=task_id, - failure_type=managed_job_state.ManagedJobStatus. - FAILED_NO_RESOURCE, - failure_reason=common_utils.format_exception(e), - callback_func=managed_job_utils.event_callback_func( - job_id=self._job_id, - task_id=task_id, - task=self._dag.tasks[task_id])) + self._update_failed_task_state( + task_id, managed_job_state.ManagedJobStatus.FAILED_NO_RESOURCE, + failure_reason) except (Exception, SystemExit) as e: # pylint: disable=broad-except with ux_utils.enable_traceback(): logger.error(traceback.format_exc()) - msg = ('Unexpected error occurred: ' - f'{common_utils.format_exception(e, use_bracket=True)}') + msg = ('Unexpected error occurred: ' + + common_utils.format_exception(e, use_bracket=True)) logger.error(msg) - managed_job_state.set_failed( - self._job_id, - task_id=task_id, - failure_type=managed_job_state.ManagedJobStatus. - FAILED_CONTROLLER, - failure_reason=msg, - callback_func=managed_job_utils.event_callback_func( - job_id=self._job_id, - task_id=task_id, - task=self._dag.tasks[task_id])) + self._update_failed_task_state( + task_id, managed_job_state.ManagedJobStatus.FAILED_CONTROLLER, + msg) finally: # This will set all unfinished tasks to CANCELLING, and will not # affect the jobs in terminal states. @@ -396,6 +376,21 @@ def run(self): managed_job_state.set_cancelled(job_id=self._job_id, callback_func=callback_func) + def _update_failed_task_state( + self, task_id: int, + failure_type: managed_job_state.ManagedJobStatus, + failure_reason: str): + """Update the state of the failed task.""" + managed_job_state.set_failed( + self._job_id, + task_id=task_id, + failure_type=failure_type, + failure_reason=failure_reason, + callback_func=managed_job_utils.event_callback_func( + job_id=self._job_id, + task_id=task_id, + task=self._dag.tasks[task_id])) + def _run_controller(job_id: int, dag_yaml: str, retry_until_up: bool): """Runs the controller in a remote process for interruption."""