Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Add docs for run many jobs #3847

Merged
merged 49 commits into from
Aug 29, 2024
Merged
Show file tree
Hide file tree
Changes from 32 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
50ad6ef
Add docs for running N jobs
Michaelvll Aug 20, 2024
4254044
Fix language
Michaelvll Aug 20, 2024
19f294f
Fix job queue
Michaelvll Aug 20, 2024
64446e3
Add link to managed jobs
Michaelvll Aug 20, 2024
3e80dab
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
c21084a
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
2eeb7d3
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
772dbc3
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
7dace8c
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
8059c83
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
d580119
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
013d436
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
5e6b644
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
0144b96
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
23c4674
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
3c5b8ac
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
71ab53f
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
d01d5ad
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
ab5110e
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
2de0066
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
ee7218e
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
b096d90
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
4a70534
Add script for generating config files
Michaelvll Aug 28, 2024
4f90f24
Fix comments
Michaelvll Aug 28, 2024
1af226d
fix title
Michaelvll Aug 28, 2024
2837cad
fix title
Michaelvll Aug 29, 2024
e31d814
fix
Michaelvll Aug 29, 2024
86f7292
reduce image size
Michaelvll Aug 29, 2024
acb5690
restructure
Michaelvll Aug 29, 2024
a720ab2
rename
Michaelvll Aug 29, 2024
30b558f
adopt comments
Michaelvll Aug 29, 2024
512627c
Add benefits
Michaelvll Aug 29, 2024
eb50f26
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 29, 2024
f8f3eb7
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 29, 2024
48d6f80
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 29, 2024
d16bab6
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 29, 2024
e13f72e
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 29, 2024
ce8730b
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 29, 2024
8ba7a88
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 29, 2024
5afc752
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 29, 2024
516fdb1
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 29, 2024
39ca239
update
Michaelvll Aug 29, 2024
089ac31
Merge branch 'docs-scale-up-jobs' of github.com:skypilot-org/skypilot…
Michaelvll Aug 29, 2024
2440d90
rename
Michaelvll Aug 29, 2024
9b36c6b
fix
Michaelvll Aug 29, 2024
8651fac
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 29, 2024
0f02e4b
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 29, 2024
83e2b5c
Minor fix for comments
Michaelvll Aug 29, 2024
c071f68
Merge branch 'docs-scale-up-jobs' of github.com:skypilot-org/skypilot…
Michaelvll Aug 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/_static/custom.js
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ document.addEventListener('DOMContentLoaded', () => {
{ selector: '.toctree-l1 > a', text: 'Managed Jobs' },
{ selector: '.toctree-l1 > a', text: 'Running on Kubernetes' },
concretevitamin marked this conversation as resolved.
Show resolved Hide resolved
{ selector: '.toctree-l1 > a', text: 'Llama-3.1 (Meta)' },
{ selector: '.toctree-l1 > a', text: 'Many Parallel Jobs' },
];
newItems.forEach(({ selector, text }) => {
document.querySelectorAll(selector).forEach((el) => {
Expand Down
3 changes: 2 additions & 1 deletion docs/source/docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -129,8 +129,8 @@ Read the research:

../getting-started/installation
../getting-started/quickstart
../getting-started/tutorial
../examples/interactive-development
../getting-started/tutorial


.. toctree::
Expand All @@ -143,6 +143,7 @@ Read the research:
../examples/auto-failover
../reference/kubernetes/index
../running-jobs/distributed-jobs
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
../running-jobs/many-jobs

.. toctree::
:hidden:
Expand Down
2 changes: 1 addition & 1 deletion docs/source/getting-started/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -219,7 +219,7 @@ Congratulations! In this quickstart, you have launched a cluster, run a task, a

Next steps:

- Adapt :ref:`Tutorial: DNN Training <dnn-training>` to start running your own project on SkyPilot!
- Adapt :ref:`Tutorial: AI Training <ai-training>` to start running your own project on SkyPilot!
- See the :ref:`Task YAML reference <yaml-spec>`, :ref:`CLI reference <cli>`, and `more examples <https://github.com/skypilot-org/skypilot/tree/master/examples>`_
- To learn more, try out `SkyPilot Tutorials <https://github.com/skypilot-org/skypilot-tutorial>`_ in Jupyter notebooks

Expand Down
4 changes: 2 additions & 2 deletions docs/source/getting-started/tutorial.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
.. _dnn-training:
.. _ai-training:

Tutorial: DNN Training
Tutorial: AI Training
======================
This example uses SkyPilot to train a Transformer-based language model from HuggingFace.

Expand Down
2 changes: 1 addition & 1 deletion docs/source/reference/job-queue.rst
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,7 @@ SkyPilot's scheduler serves two goals:
2. **Minimizing resource idleness**: If a resource is idle, SkyPilot will schedule a
queued job that can utilize that resource.

We illustrate the scheduling behavior by revisiting :ref:`Tutorial: DNN Training <dnn-training>`.
We illustrate the scheduling behavior by revisiting :ref:`Tutorial: AI Training <ai-training>`.
In that tutorial, we have a task YAML that specifies these resource requirements:

.. code-block:: yaml
Expand Down
345 changes: 345 additions & 0 deletions docs/source/running-jobs/many-jobs.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,345 @@

.. _many-jobs:

Many Parallel Jobs
======================

SkyPilot allows you to easily **run many jobs in parallel** and manage them in a single system. This is useful for hyperparameter tuning sweeps, data processing, and other batch jobs.

This guide shows a typical workflow for running many jobs with SkyPilot.


.. image:: https://i.imgur.com/x1aFtVn.png
concretevitamin marked this conversation as resolved.
Show resolved Hide resolved
:width: 90%
:align: center
.. TODO: Show the components in a GIF.

Reason for using SkyPilot to run many jobs:

- **Seamless**: Utilize resources on any of your infrastructure (Kubernetes, cloud VMs, reservations, etc.)
- **Observable**: See and manage all jobs in a single pane of glass
- **Elastic**: Scale up and down based on demands
- **Cost-effective**: Only pay for the cheapest resources
- **Robust**: Automatically recover jobs from failures
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

Develop a YAML for One Job
-----------------------------------

Before scaling up to many jobs, develop a SkyPilot YAML for a single job first and ensure it runs correctly. This can save time by avoiding debugging many jobs at once.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

Here is the same example YAML as in :ref:`Tutorial: AI Training <ai-training>`:

.. raw:: html

<details>
<summary>Click to expand: <code>train.yaml</code></summary>

.. code-block:: yaml

# train.yaml
name: huggingface

resources:
accelerators: V100:4

setup: |
set -e # Exit if any command failed.
git clone https://github.com/huggingface/transformers/ || true
cd transformers
pip install .
cd examples/pytorch/text-classification
pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113

run: |
set -e # Exit if any command failed.
cd transformers/examples/pytorch/text-classification
python run_glue.py \
--model_name_or_path bert-base-cased \
--dataset_name imdb \
--do_train \
--max_seq_length 128 \
--per_device_train_batch_size 32 \
--learning_rate 2e-5 \
--max_steps 50 \
--output_dir /tmp/imdb/ --overwrite_output_dir \
--fp16


.. raw:: html

</details>


First, launch the job to check it successfully launches and runs correctly:

.. code-block:: bash

sky launch -c train train.yaml


If there is any error, you can fix the code and/or the YAML, and launch the job again on the same cluster:

.. code-block:: bash

# Cancel the latest job.
sky cancel train -y
# Run the job again on the same cluster.
sky launch -c train train.yaml


Sometimes, it may be more efficient to log into the cluster and interactively debug the job. You can do so by directly :ref:`ssh'ing into the cluster or using VSCode's remote ssh <dev-connect>`.

.. code-block:: bash

# Log into the cluster.
ssh train



Next, after confirming the job is working correctly, **add (hyper)parameters** to the job YAML so that all job variants can be specified.

1. Add Hyperparameters
~~~~~~~~~~~~~~~~~~~~~~

To launch jobs with different hyperparameters, add them as :ref:`environment variables <env-vars>` to the SkyPilot YAML, and make your main program read these environment variables:

.. raw:: html

<details>
<summary>Updated SkyPilot YAML: <code>train-template.yaml</code></summary>

.. code-block:: yaml
:emphasize-lines: 4-6,28-29

# train-template.yaml
name: huggingface

envs:
LR: 2e-5
MAX_STEPS: 50

resources:
accelerators: V100:4

setup: |
set -e # Exit if any command failed.
git clone https://github.com/huggingface/transformers/ || true
cd transformers
pip install .
cd examples/pytorch/text-classification
pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113

run: |
set -e # Exit if any command failed.
cd transformers/examples/pytorch/text-classification
python run_glue.py \
--model_name_or_path bert-base-cased \
--dataset_name imdb \
--do_train \
--max_seq_length 128 \
--per_device_train_batch_size 32 \
--learning_rate ${LR} \
--max_steps ${MAX_STEPS} \
--output_dir /tmp/imdb/ --overwrite_output_dir \
--fp16

.. raw:: html

</details>

You can now use `--env` to launch a job with different hyperparameters:
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: bash

sky launch -c train train-template.yaml \
--env LR=1e-5 \
--env MAX_STEPS=100

Alternative, store the environment variable values in a dotenv file and use `--env-file` to launch:
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: bash

# configs/job1
LR=1e-5
MAX_STEPS=100

.. code-block:: bash

sky launch -c train train-template.yaml \
--env-file configs/job1



2. Logging Job Outputs
~~~~~~~~~~~~~~~~~~~~~~~

When running many jobs, it is useful to log the outputs of all jobs. You can use tools like `W&B <https://wandb.ai>`__ for this purpose:

.. raw:: html

<details>
<summary>SkyPilot YAML with W&B: <code>train-template.yaml</code></summary>

.. code-block:: yaml
:emphasize-lines: 7-7,19-19,34-34

# train-template.yaml
name: huggingface

envs:
LR: 2e-5
MAX_STEPS: 50
WANDB_API_KEY: # Empty field means this field is required when launching the job.

resources:
accelerators: V100:4

setup: |
set -e # Exit if any command failed.
git clone https://github.com/huggingface/transformers/ || true
cd transformers
pip install .
cd examples/pytorch/text-classification
pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
pip install wandb

run: |
set -e # Exit if any command failed.
cd transformers/examples/pytorch/text-classification
python run_glue.py \
--model_name_or_path bert-base-cased \
--dataset_name imdb \
--do_train \
--max_seq_length 128 \
--per_device_train_batch_size 32 \
--learning_rate ${LR} \
--max_steps ${MAX_STEPS} \
--output_dir /tmp/imdb/ --overwrite_output_dir \
--fp16 \
--report_to wandb

.. raw:: html

</details>

You can now launch the job with the following command (``WANDB_API_KEY`` should existing in your local environment variables).

.. code-block:: bash

sky launch -c train train-template.yaml \
--env-file configs/job1 \
--env WANDB_API_KEY



Scale Out to Many Jobs
-----------------------

With the above setup, you can now scale out to run many jobs in parallel. You
can either use SkyPilot CLI with many config files or use SkyPilot Python API.

With Config Files
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
~~~~~~~~~~~~~~~~~~~~

You can run many jobs in parallel by (1) creating multiple config files and (2)
submitting them with :ref:`SkyPilot managed jobs <managed-jobs>`.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

First, create a config file for each job (for example, in a ``configs`` directory):

.. code-block:: bash

# configs/job1
LR=1e-5
MAX_STEPS=100

# configs/job2
LR=2e-5
MAX_STEPS=200

...

.. raw:: html

<details>
<summary>An example Python script to generate config files</summary>

.. code-block:: python

import os

CONFIG_PATH = 'configs'
LR_CANDIDATES = [0.01, 0.03, 0.1, 0.3, 1.0]
MAX_STEPS_CANDIDATES = [100, 300, 1000]

os.makedirs(CONFIG_PATH, exist_ok=True)

job_idx = 1
for lr in LR_CANDIDATES:
for max_steps in MAX_STEPS_CANDIDATES:
config_file = f"{CONFIG_PATH}/job{job_idx}"
with open(config_file, "w") as f:
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
print(f'LR={lr}', file=f)
print(f'MAX_STEPS={max_steps}', file=f)
job_idx += 1

.. raw:: html

</details>

Then, submit all jobs by iterating over the config files and calling `sky jobs launch` on each:
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: bash

for config_file in configs/*; do
job_name=$(basename $config_file)
# -y: yes to all prompts.
# -d: detach from the job's logging, so the next job can be submitted
# without waiting for the previous job to finish.
sky jobs launch -n train-$job_name -y -d train-template.yaml \
--env-file $config_file \
--env WANDB_API_KEY
done


Job statuses can be checked via `sky jobs queue`:
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: console

$ sky jobs queue

Fetching managed job statuses...
Managed jobs
In progress tasks: 10 RUNNING
ID TASK NAME RESOURCES SUBMITTED TOT. DURATION JOB DURATION #RECOVERIES STATUS
10 - train-job10 1x[V100:4] 5 mins ago 5m 5s 1m 12s 0 RUNNING
9 - train-job9 1x[V100:4] 6 mins ago 6m 11s 2m 23s 0 RUNNING
8 - train-job8 1x[V100:4] 7 mins ago 7m 15s 3m 31s 0 RUNNING
...


With Python API
~~~~~~~~~~~~~~~

To have more customized control over the jobs, you can also use SkyPilot Python
API to launch the jobs.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: python

import os
import sky

LR_CANDIDATES = [0.01, 0.03, 0.1, 0.3, 1.0]
MAX_STEPS_CANDIDATES = [100, 300, 1000]
task = sky.Task.from_yaml('train-template.yaml')

job_idx = 1
for lr in LR_CANDIDATES:
for max_steps in MAX_STEPS_CANDIDATES:
task.update_envs({'LR': lr, 'MAX_STEPS': max_steps})
sky.jobs.launch(
task,
name=f'train-job{job_idx}',
detach_run=True,
retry_until_up=True,
)
job_idx += 1
Loading