Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Add docs for run many jobs #3847

Merged
merged 49 commits into from
Aug 29, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
50ad6ef
Add docs for running N jobs
Michaelvll Aug 20, 2024
4254044
Fix language
Michaelvll Aug 20, 2024
19f294f
Fix job queue
Michaelvll Aug 20, 2024
64446e3
Add link to managed jobs
Michaelvll Aug 20, 2024
3e80dab
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
c21084a
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
2eeb7d3
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
772dbc3
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
7dace8c
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
8059c83
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
d580119
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
013d436
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
5e6b644
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
0144b96
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
23c4674
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
3c5b8ac
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
71ab53f
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
d01d5ad
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
ab5110e
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
2de0066
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
ee7218e
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
b096d90
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 28, 2024
4a70534
Add script for generating config files
Michaelvll Aug 28, 2024
4f90f24
Fix comments
Michaelvll Aug 28, 2024
1af226d
fix title
Michaelvll Aug 28, 2024
2837cad
fix title
Michaelvll Aug 29, 2024
e31d814
fix
Michaelvll Aug 29, 2024
86f7292
reduce image size
Michaelvll Aug 29, 2024
acb5690
restructure
Michaelvll Aug 29, 2024
a720ab2
rename
Michaelvll Aug 29, 2024
30b558f
adopt comments
Michaelvll Aug 29, 2024
512627c
Add benefits
Michaelvll Aug 29, 2024
eb50f26
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 29, 2024
f8f3eb7
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 29, 2024
48d6f80
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 29, 2024
d16bab6
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 29, 2024
e13f72e
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 29, 2024
ce8730b
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 29, 2024
8ba7a88
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 29, 2024
5afc752
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 29, 2024
516fdb1
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 29, 2024
39ca239
update
Michaelvll Aug 29, 2024
089ac31
Merge branch 'docs-scale-up-jobs' of github.com:skypilot-org/skypilot…
Michaelvll Aug 29, 2024
2440d90
rename
Michaelvll Aug 29, 2024
9b36c6b
fix
Michaelvll Aug 29, 2024
8651fac
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 29, 2024
0f02e4b
Update docs/source/running-jobs/many-jobs.rst
Michaelvll Aug 29, 2024
83e2b5c
Minor fix for comments
Michaelvll Aug 29, 2024
c071f68
Merge branch 'docs-scale-up-jobs' of github.com:skypilot-org/skypilot…
Michaelvll Aug 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/source/docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -129,8 +129,8 @@ Read the research:

../getting-started/installation
../getting-started/quickstart
../getting-started/tutorial
../examples/interactive-development
../getting-started/tutorial


.. toctree::
Expand All @@ -143,6 +143,7 @@ Read the research:
../examples/auto-failover
../reference/kubernetes/index
../running-jobs/distributed-jobs
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
../running-jobs/many-jobs

.. toctree::
:hidden:
Expand Down
2 changes: 1 addition & 1 deletion docs/source/running-jobs/distributed-jobs.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
.. _dist-jobs:

Distributed Jobs on Many Nodes
Guide: Distributed Jobs
================================================

SkyPilot supports multi-node cluster
Expand Down
273 changes: 273 additions & 0 deletions docs/source/running-jobs/many-jobs.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,273 @@

.. _many-jobs:

Guide: Run Many Jobs
====================

Hyperparameter tuning, data processing, etc, are common cases where many jobs are run in parallel.
However, managing all those jobs can be a pain. SkyPilot's allows us to **run many jobs in parallel** and **manage them in
a single pane of glass**.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

We show a typical workflow for running many jobs with SkyPilot.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

Develop a YAML for One Job
-----------------------------------

Before scaling up to many jobs, develeoping and debugging a SkyPilot YAML for a single job can save a lot of time by avoiding debugging many jobs at once.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

We use the same example of AI model training as in :ref:`Tutorial: DNN Training <dnn-training>`.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

.. raw:: html

<details>
<summary>Click to expand: <code>dnn.yaml</code></summary>

.. code-block:: yaml

# dnn.yaml
name: huggingface

resources:
accelerators: V100:4

setup: |
set -e # Exit if any command failed.
git clone https://github.com/huggingface/transformers/ || true
cd transformers
pip install .
cd examples/pytorch/text-classification
pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113

run: |
set -e # Exit if any command failed.
cd transformers/examples/pytorch/text-classification
python run_glue.py \
--model_name_or_path bert-base-cased \
--dataset_name imdb \
--do_train \
--max_seq_length 128 \
--per_device_train_batch_size 32 \
--learning_rate 2e-5 \
--max_steps 50 \
--output_dir /tmp/imdb/ --overwrite_output_dir \
--fp16


.. raw:: html

</details>


We can launch the job with the following command to check the correctness of the YAML.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: bash

sky launch -c dnn dnn.yaml


If there is any error, we can fix the YAML and launch the job again on the same cluster.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: bash

# Cancel the latest job.
sky cancel dnn -y
# Run the job again on the same cluster.
sky launch -c dnn dnn.yaml


Sometimes, we may find it more efficient to interactively log into the cluster and debug the job. We can do so by directly :ref:`ssh into the cluster or use VSCode's remote ssh <dev-connect>`.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: bash

# Log into the cluster.
ssh dnn



After confirming the job is working correctly, we now start **adding additional fields** to make the job YAML more configurable.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

1. Add Hyperparameters
~~~~~~~~~~~~~~~~~~~~~~

To launch many jobs with different hyperparameters, we turn the SkyPilot YAML into a template, by
adding :ref:`environment variables <env-vars>` as arguments for the job.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

.. raw:: html

<details>
<summary>Updated SkyPilot YAML: <code>dnn-template.yaml</code></summary>

.. code-block:: yaml
:emphasize-lines: 4-6,28-29

# dnn-template.yaml
name: huggingface

envs:
LR: 2e-5
MAX_STEPS: 50

resources:
accelerators: V100:4

setup: |
set -e # Exit if any command failed.
git clone https://github.com/huggingface/transformers/ || true
cd transformers
pip install .
cd examples/pytorch/text-classification
pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113

run: |
set -e # Exit if any command failed.
cd transformers/examples/pytorch/text-classification
python run_glue.py \
--model_name_or_path bert-base-cased \
--dataset_name imdb \
--do_train \
--max_seq_length 128 \
--per_device_train_batch_size 32 \
--learning_rate ${LR} \
--max_steps ${MAX_STEPS} \
--output_dir /tmp/imdb/ --overwrite_output_dir \
--fp16

.. raw:: html

</details>

We can now launch a job with different hyperparameters by specifying the envs.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: bash

sky launch -c dnn dnn-template.yaml \
--env LR=1e-5 \
--env MAX_STEPS=100

Or, you can store the envs in a dotenv file and launch the job with the file: ``configs/job1.env``.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: bash

# configs/job1.env
LR=1e-5
MAX_STEPS=100

.. code-block:: bash

sky launch -c dnn dnn-template.yaml \
--env-file configs/job1.env



2. Track Job Output
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
~~~~~~~~~~~~~~~~~~~

When running many jobs, it is useful to track live outputs of each job. We recommend using WandB to track the outputs of all jobs.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

.. raw:: html

<details>
<summary>SkyPilot YAML with WandB: <code>dnn-template.yaml</code></summary>
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: yaml
:emphasize-lines: 7-7,19-19,34-34

# dnn-template.yaml
name: huggingface

envs:
LR: 2e-5
MAX_STEPS: 50
WANDB_API_KEY: # Empty field means this field is required when launching the job.

resources:
accelerators: V100:4

setup: |
set -e # Exit if any command failed.
git clone https://github.com/huggingface/transformers/ || true
cd transformers
pip install .
cd examples/pytorch/text-classification
pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
pip install wandb

run: |
set -e # Exit if any command failed.
cd transformers/examples/pytorch/text-classification
python run_glue.py \
--model_name_or_path bert-base-cased \
--dataset_name imdb \
--do_train \
--max_seq_length 128 \
--per_device_train_batch_size 32 \
--learning_rate ${LR} \
--max_steps ${MAX_STEPS} \
--output_dir /tmp/imdb/ --overwrite_output_dir \
--fp16 \
--report_to wandb

.. raw:: html

</details>

We can now launch the job with the following command (``WANDB_API_KEY`` should existing in your local environment variables).
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: bash

sky launch -c dnn dnn-template.yaml \
--env-file configs/job1.env \
--env WANDB_API_KEY



Scale up the Job
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
-----------------

With the above setup, we can now scale up a job to many in-parallel jobs by creating multiple config files and
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
submitting them with :ref:`SkyPilot managed jobs <managed-jobs>`.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

We create a config file for each job in the ``configs`` directory.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: bash

# configs/job1.env
LR=1e-5
MAX_STEPS=100

# configs/job2.env
LR=2e-5
MAX_STEPS=200

...

We can then submit all jobs by iterating over the config files.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: bash

for config_file in configs/*.env; do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question in mind was "how to write out a config file for each param combo". Is it better to show a representative python program to iterate over params, then either (1) write out the env files, or (2) directly call jobs launch Python API / CLI from within Python script?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a python script for generating the configs.

job_name=$(basename ${config_file%.env})
# -y means yes to all prompts.
# -d means detach from the job's logging, so the next job can be submitted
# without waiting for the previous job to finish.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
sky jobs launch -n dnn-$job_name -y -d dnn-template.yaml \
--env-file $config_file \
--env WANDB_API_KEY
done


All job statuses can be checked with the following command.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: console

$ sky jobs queue

Fetching managed job statuses...
Managed jobs
In progress tasks: 3 RUNNING
ID TASK NAME RESOURCES SUBMITTED TOT. DURATION JOB DURATION #RECOVERIES STATUS
10 - dnn-job10 1x[V100:4] 5 mins ago 5m 5s 1m 12s 0 RUNNING
9 - dnn-job9 1x[V100:4] 6 mins ago 6m 11s 2m 23s 0 RUNNING
8 - dnn-job8 1x[V100:4] 7 mins ago 7m 15s 3m 31s 0 RUNNING
...

Loading