Skip to content

Commit

Permalink
Merge branch 'kueue-multipod' of https://github.com/asaiacai/skypilot
Browse files Browse the repository at this point in the history
…into kueue-multipod
  • Loading branch information
asaiacai committed May 25, 2024
2 parents e6de9d6 + bb06626 commit edf10de
Show file tree
Hide file tree
Showing 101 changed files with 1,197 additions and 477 deletions.
24 changes: 22 additions & 2 deletions docs/source/examples/managed-jobs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Managed Jobs

This feature is great for scaling out: running a single job for long durations, or running many jobs (pipelines).

SkyPilot supports **managed jobs**, which can automatically recover from any spot preemptions or hardware failures.
SkyPilot supports **managed jobs** (:code:`sky jobs`), which can automatically recover from any spot preemptions or hardware failures.
It can be used in three modes:

#. :ref:`Managed Spot Jobs <spot-jobs>`: Jobs run on auto-recovering spot instances. This can **save significant costs** (e.g., up to 70\% for GPU VMs) by making preemptible spot instances useful for long-running jobs.
Expand All @@ -20,9 +20,29 @@ It can be used in three modes:
Managed Spot Jobs
-----------------

SkyPilot automatically finds available spot resources across regions and clouds to maximize availability.
In this mode, :code:`sky jobs launch --use-spot` is used to launch a managed spot job. SkyPilot automatically finds available spot resources across regions and clouds to maximize availability.
Any spot preemptions are automatically handled by SkyPilot without user intervention.


Quick comparison between *unmanaged spot clusters* vs. *managed spot jobs*:

.. list-table::
:widths: 30 18 12 35
:header-rows: 1

* - Command
- Managed?
- SSH-able?
- Best for
* - :code:`sky launch --use-spot`
- Unmanaged spot cluster
- Yes
- Interactive dev on spot instances (especially for hardware with low preemption rates)
* - :code:`sky jobs launch --use-spot`
- Managed spot job (auto-recovery)
- No
- Scaling out long-running jobs (e.g., data processing, training, batch inference)

Here is an example of a BERT training job failing over different regions across AWS and GCP.

.. image:: https://i.imgur.com/Vteg3fK.gif
Expand Down
4 changes: 4 additions & 0 deletions docs/source/getting-started/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,10 @@ section :ref:`below <cloud-account-setup>`.
If your clouds show ``enabled`` --- |:tada:| |:tada:| **Congratulations!** |:tada:| |:tada:| You can now head over to
:ref:`Quickstart <quickstart>` to get started with SkyPilot.

.. tip::

To check credentials only for specific clouds, pass the clouds as arguments: :code:`sky check aws gcp`

.. _cloud-account-setup:

Cloud account setup
Expand Down
13 changes: 13 additions & 0 deletions docs/source/reference/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,19 @@ Available fields and semantics:
cpus: 4+ # number of vCPUs, max concurrent spot jobs = 2 * cpus
disk_size: 100
# Allow list for clouds to be used in `sky check`
#
# This field is used to restrict the clouds that SkyPilot will check and use
# when running `sky check`. Any cloud already enabled but not specified here
# will be disabled on the next `sky check` run.
# If this field is not set, SkyPilot will check and use all supported clouds.
#
# Default: null (use all supported clouds).
allowed_clouds:
- aws
- gcp
- kubernetes
# Advanced AWS configurations (optional).
# Apply to all new instances but not existing ones.
aws:
Expand Down
14 changes: 14 additions & 0 deletions docs/source/running-jobs/environment-variables.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,20 @@ You can specify environment variables to be made available to a task in two ways
- The ``envs`` field (dict) in a :ref:`task YAML <yaml-spec>`
- The ``--env`` flag in the ``sky launch/exec`` :ref:`CLI <cli>` (takes precedence over the above)

.. tip::

If an environment variable is required to be specified with `--env` during
``sky launch/exec``, you can set it to ``null`` in task YAML to raise an
error when it is forgotten to be specified. For example, the ``WANDB_API_KEY``
and ``HF_TOKEN`` in the following task YAML:

.. code-block:: yaml
envs:
WANDB_API_KEY:
HF_TOKEN: null
MYVAR: val
The ``file_mounts``, ``setup``, and ``run`` sections of a task YAML can access the variables via the ``${MYVAR}`` syntax.

Using in ``file_mounts``
Expand Down
22 changes: 8 additions & 14 deletions docs/source/serving/sky-serve.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ Why SkyServe?
How it works:

- Each service gets an endpoint that automatically redirects requests to its replicas.
- Each service gets an endpoint that automatically distributes requests to its replicas.
- Replicas of the same service can run in different regions and clouds — reducing cloud costs and increasing availability.
- SkyServe handles the load balancing, recovery, and autoscaling of the replicas.

Expand Down Expand Up @@ -127,7 +127,7 @@ Run :code:`sky serve up service.yaml` to deploy the service with automatic price

If you see the :code:`STATUS` column becomes :code:`READY`, then the service is ready to accept traffic!

Simply ``curl -L`` the service endpoint, which automatically load-balances across the two replicas:
Simply ``curl`` the service endpoint, which automatically load-balances across the two replicas:

.. tab-set::

Expand All @@ -136,7 +136,7 @@ Simply ``curl -L`` the service endpoint, which automatically load-balances acros

.. code-block:: console
$ curl -L 3.84.15.251:30001/v1/chat/completions \
$ curl 3.84.15.251:30001/v1/chat/completions \
-X POST \
-d '{"model": "mistralai/Mixtral-8x7B-Instruct-v0.1", "messages": [{"role": "user", "content": "Who are you?"}]}' \
-H 'Content-Type: application/json'
Expand All @@ -149,7 +149,7 @@ Simply ``curl -L`` the service endpoint, which automatically load-balances acros

.. code-block:: console
$ curl -L 44.211.131.51:30001/generate \
$ curl 44.211.131.51:30001/generate \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json'
Expand Down Expand Up @@ -240,7 +240,7 @@ Under the hood, :code:`sky serve up`:
#. Launches a controller which handles autoscaling, monitoring and load balancing;
#. Returns a Service Endpoint which will be used to accept traffic;
#. Meanwhile, the controller provisions replica VMs which later run the services;
#. Once any replica is ready, the requests sent to the Service Endpoint will be **HTTP-redirect** to one of the endpoint replicas.
#. Once any replica is ready, the requests sent to the Service Endpoint will be distributed to one of the endpoint replicas.

After the controller is provisioned, you'll see the following in :code:`sky serve status` output:

Expand All @@ -264,7 +264,7 @@ sending requests to :code:`<endpoint-url>` (e.g., ``44.201.119.3:30001``):

.. code-block:: console
$ curl -L <endpoint-url>
$ curl <endpoint-url>
<html>
<head>
<title>My First SkyServe Service</title>
Expand All @@ -274,12 +274,6 @@ sending requests to :code:`<endpoint-url>` (e.g., ``44.201.119.3:30001``):
</body>
</html>
.. note::

Since we are using HTTP-redirect, we need to use :code:`curl -L
<endpoint-url>`. The :code:`curl` command by default won't follow the
redirect.

Tutorial: Serve a Chatbot LLM!
------------------------------

Expand Down Expand Up @@ -368,7 +362,7 @@ Send a request using the following cURL command:

.. code-block:: console
$ curl -L http://<endpoint-url>/v1/chat/completions \
$ curl http://<endpoint-url>/v1/chat/completions \
-X POST \
-d '{"model":"vicuna-7b-v1.3","messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"Who are you?"}],"temperature":0}' \
-H 'Content-Type: application/json'
Expand Down Expand Up @@ -468,7 +462,7 @@ SkyServe has a centralized controller VM that manages the deployment of your ser
It is composed of the following components:

#. **Controller**: The controller will monitor the status of the replicas and re-launch a new replica if one of them fails. It also autoscales the number of replicas if autoscaling config is set (see :ref:`Service YAML spec <service-yaml-spec>` for more information).
#. **Load Balancer**: The load balancer will route the traffic to all ready replicas. It is a lightweight HTTP server that listens on the service endpoint and **HTTP-redirects** the requests to one of the replicas.
#. **Load Balancer**: The load balancer will route the traffic to all ready replicas. It is a lightweight HTTP server that listens on the service endpoint and distribute the requests to one of the replicas.

All of the process group shares a single controller VM. The controller VM will be launched in the cloud with the best price/performance ratio. You can also :ref:`customize the controller resources <customizing-sky-serve-controller-resources>` based on your needs.

Expand Down
2 changes: 1 addition & 1 deletion examples/cog/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ After the service is launched, access the deployment with the following:
```console
ENDPOINT=$(sky serve status --endpoint cog)

curl -L http://$ENDPOINT/predictions -X POST \
curl http://$ENDPOINT/predictions -X POST \
-H 'Content-Type: application/json' \
-d '{"input": {"image": "https://blog.skypilot.co/introducing-sky-serve/images/sky-serve-thumbnail.png"}}' \
| jq -r '.output | split(",")[1]' | base64 --decode > output.png
Expand Down
2 changes: 1 addition & 1 deletion examples/serve/llama2/llama2.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ resources:

envs:
MODEL_SIZE: 7
HF_TOKEN: <your-huggingface-token> # TODO: Replace with huggingface token
HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.

setup: |
conda activate chatbot
Expand Down
4 changes: 2 additions & 2 deletions examples/serve/misc/cancel/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# SkyServe cancel example

This example demonstrates the redirect support canceling a request.
This example demonstrates the SkyServe load balancer support canceling a request.

## Running the example

Expand Down Expand Up @@ -33,7 +33,7 @@ Client disconnected, stopping computation.
You can also run

```bash
curl -L http://<endpoint>/
curl http://<endpoint>/
```

and manually Ctrl + C to cancel the request and see logs.
2 changes: 1 addition & 1 deletion examples/serve/stable_diffusion_service.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ file_mounts:
/stable_diffusion: examples/stable_diffusion

setup: |
sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo curl "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
cd stable-diffusion-webui-docker
sudo rm -r stable-diffusion-webui-docker
Expand Down
4 changes: 2 additions & 2 deletions examples/spot_pipeline/bert_qa_train_eval.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ run: |
echo Model saved to /checkpoint/bert_qa/$SKYPILOT_TASK_ID
envs:
WANDB_API_KEY: # NOTE: Fill in your wandb key
WANDB_API_KEY: # TODO: Fill with your own WANDB_API_KEY, or use --env to pass.

---

Expand Down Expand Up @@ -84,4 +84,4 @@ run: |
--save_steps 1000
envs:
WANDB_API_KEY: # NOTE: Fill in your wandb key
WANDB_API_KEY: # TODO: Fill with your own WANDB_API_KEY, or use --env to pass.
2 changes: 1 addition & 1 deletion examples/stable_diffusion/stable_diffusion_docker.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ file_mounts:
/stable_diffusion: .

setup: |
sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo curl "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
cd stable-diffusion-webui-docker
sudo rm -r stable-diffusion-webui-docker
Expand Down
29 changes: 29 additions & 0 deletions llm/axolotl/axolotl-docker.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Usage:
# HF_TOKEN=abc sky launch -c axolotl axolotl.yaml --env HF_TOKEN -y -i30 --down

name: axolotl

resources:
accelerators: L4:1
cloud: gcp # optional

workdir: mistral

setup: |
docker pull winglian/axolotl:main-py3.10-cu118-2.0.1
run: |
docker run --gpus all \
-v ~/sky_workdir:/sky_workdir \
-v /root/.cache:/root/.cache \
winglian/axolotl:main-py3.10-cu118-2.0.1 \
huggingface-cli login --token ${HF_TOKEN}
docker run --gpus all \
-v ~/sky_workdir:/sky_workdir \
-v /root/.cache:/root/.cache \
winglian/axolotl:main-py3.10-cu118-2.0.1 \
accelerate launch -m axolotl.cli.train /sky_workdir/qlora.yaml
envs:
HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
27 changes: 6 additions & 21 deletions llm/axolotl/axolotl-spot.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ resources:
accelerators: A100:1
cloud: gcp # optional
use_spot: True
image_id: docker:winglian/axolotl:main-py3.10-cu118-2.0.1

workdir: mistral

Expand All @@ -20,29 +21,13 @@ file_mounts:
name: ${BUCKET}
mode: MOUNT

setup: |
docker pull winglian/axolotl:main-py3.10-cu118-2.0.1
run: |
docker run --gpus all \
-v ~/sky_workdir:/sky_workdir \
-v /root/.cache:/root/.cache \
winglian/axolotl:main-py3.10-cu118-2.0.1 \
huggingface-cli login --token ${HF_TOKEN}
huggingface-cli login --token ${HF_TOKEN}
docker run --gpus all \
-v ~/sky_workdir:/sky_workdir \
-v /root/.cache:/root/.cache \
-v /sky-notebook:/sky-notebook \
winglian/axolotl:main-py3.10-cu118-2.0.1 \
accelerate launch -m axolotl.cli.train /sky_workdir/qlora-checkpoint.yaml
accelerate launch -m axolotl.cli.train qlora-checkpoint.yaml
envs:
HF_TOKEN: <your-huggingface-token> # TODO: Replace with huggingface token
BUCKET: <a-unique-bucket-name-to-use>





HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
BUCKET: # TODO: Fill with your unique bucket name, or use --env to pass.

4
25 changes: 4 additions & 21 deletions llm/axolotl/axolotl.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,31 +5,14 @@ name: axolotl

resources:
accelerators: L4:1
cloud: gcp # optional
image_id: docker:winglian/axolotl:main-py3.10-cu118-2.0.1

workdir: mistral

setup: |
docker pull winglian/axolotl:main-py3.10-cu118-2.0.1
run: |
docker run --gpus all \
-v ~/sky_workdir:/sky_workdir \
-v /root/.cache:/root/.cache \
winglian/axolotl:main-py3.10-cu118-2.0.1 \
huggingface-cli login --token ${HF_TOKEN}
huggingface-cli login --token ${HF_TOKEN}
docker run --gpus all \
-v ~/sky_workdir:/sky_workdir \
-v /root/.cache:/root/.cache \
winglian/axolotl:main-py3.10-cu118-2.0.1 \
accelerate launch -m axolotl.cli.train /sky_workdir/qlora.yaml
accelerate launch -m axolotl.cli.train qlora.yaml
envs:
HF_TOKEN: <your-huggingface-token> # TODO: Replace with huggingface token






HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
3 changes: 2 additions & 1 deletion llm/axolotl/mistral/qlora-checkpoint.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ warmup_steps: 10
eval_steps: 0.05
eval_table_size:
eval_table_max_new_tokens: 128
eval_sample_packing: false
save_steps: 2 ## increase based on your dataset
save_strategy: steps
debug:
Expand All @@ -81,4 +82,4 @@ fsdp_config:
special_tokens:
bos_token: "<s>"
eos_token: "</s>"
unk_token: "<unk>"
unk_token: "<unk>"
3 changes: 2 additions & 1 deletion llm/axolotl/mistral/qlora.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ warmup_steps: 10
eval_steps: 0.05
eval_table_size:
eval_table_max_new_tokens: 128
eval_sample_packing: false
save_steps:
debug:
deepspeed:
Expand All @@ -78,4 +79,4 @@ fsdp_config:
special_tokens:
bos_token: "<s>"
eos_token: "</s>"
unk_token: "<unk>"
unk_token: "<unk>"
Loading

0 comments on commit edf10de

Please sign in to comment.