From 50ad6ef716cd06279f5a2580e16a27646560551c Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Tue, 20 Aug 2024 04:25:01 +0000 Subject: [PATCH 01/47] Add docs for running N jobs --- docs/source/docs/index.rst | 3 +- docs/source/running-jobs/distributed-jobs.rst | 2 +- docs/source/running-jobs/many-jobs.rst | 240 ++++++++++++++++++ 3 files changed, 243 insertions(+), 2 deletions(-) create mode 100644 docs/source/running-jobs/many-jobs.rst diff --git a/docs/source/docs/index.rst b/docs/source/docs/index.rst index f8d329a1431..59d8f54cc77 100644 --- a/docs/source/docs/index.rst +++ b/docs/source/docs/index.rst @@ -129,8 +129,8 @@ Read the research: ../getting-started/installation ../getting-started/quickstart - ../getting-started/tutorial ../examples/interactive-development + ../getting-started/tutorial .. toctree:: @@ -143,6 +143,7 @@ Read the research: ../examples/auto-failover ../reference/kubernetes/index ../running-jobs/distributed-jobs + ../running-jobs/many-jobs .. toctree:: :hidden: diff --git a/docs/source/running-jobs/distributed-jobs.rst b/docs/source/running-jobs/distributed-jobs.rst index 9eb590c10bc..9249e9faca0 100644 --- a/docs/source/running-jobs/distributed-jobs.rst +++ b/docs/source/running-jobs/distributed-jobs.rst @@ -1,6 +1,6 @@ .. _dist-jobs: -Distributed Jobs on Many Nodes +Guide: Distributed Jobs ================================================ SkyPilot supports multi-node cluster diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst new file mode 100644 index 00000000000..b84a28d3209 --- /dev/null +++ b/docs/source/running-jobs/many-jobs.rst @@ -0,0 +1,240 @@ + +.. _many-jobs: + +Guide: Run Many Jobs +==================== + +Hyperparameter tuning, data processing, etc, are common cases where many jobs are run in parallel. +However, managing all those jobs can be a pain. SkyPilot's allows us to run many jobs in parallel and manage them in +a single pane of glass. + +We show a typical workflow for running many jobs with SkyPilot. + +Develop a YAML for One Job +----------------------------------- + +Before scaling up to many jobs, develeoping and debugging a SkyPilot YAML for a single job can save a lot of time by avoiding debugging many jobs at once. + +We use the same example of AI model training as in :ref:`Tutorial: DNN Training `. + +.. raw:: html + +
+ Click to expand: dnn.yaml + +.. code-block:: yaml + + # dnn.yaml + name: huggingface + + resources: + accelerators: V100:4 + + setup: | + set -e # Exit if any command failed. + git clone https://github.com/huggingface/transformers/ || true + cd transformers + pip install . + cd examples/pytorch/text-classification + pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113 + + run: | + set -e # Exit if any command failed. + cd transformers/examples/pytorch/text-classification + python run_glue.py \ + --model_name_or_path bert-base-cased \ + --dataset_name imdb \ + --do_train \ + --max_seq_length 128 \ + --per_device_train_batch_size 32 \ + --learning_rate 2e-5 \ + --max_steps 50 \ + --output_dir /tmp/imdb/ --overwrite_output_dir \ + --fp16 + + +.. raw:: html + +
+ + +We can launch the job with the following command to check the correctness of the YAML. + +.. code-block:: bash + + sky launch -c dnn dnn.yaml + + +If there is any error, we can fix the YAML and launch the job again on the same cluster. + +.. code-block:: bash + + # Cancel the latest job. + sky cancel dnn -y + # Run the job again on the same cluster. + sky launch -c dnn dnn.yaml + + +Sometimes, we may find it more efficient to interactively log into the cluster and debug the job. We can do so by directly :ref:`ssh into the cluster or use VSCode's remote ssh `. + +.. code-block:: bash + + # Log into the cluster. + ssh dnn + + + +After confirming the job is working correctly, we now start adding additional fields to make the job more scalable. + +1. Add Hyperparameters +~~~~~~~~~~~~~~~~~~~~~~ + +To launch many jobs with different hyperparameters, we turn the SkyPilot YAML into a template, by +adding :ref:`environment variables ` as arguments for the job. + +.. raw:: html + +
+ Updated SkyPilot YAML: dnn-template.yaml + +.. code-block:: yaml + :emphasize-lines: 4-6,28-29 + + # dnn-template.yaml + name: huggingface + + envs: + LR: 2e-5 + MAX_STEPS: 50 + + resources: + accelerators: V100:4 + + setup: | + set -e # Exit if any command failed. + git clone https://github.com/huggingface/transformers/ || true + cd transformers + pip install . + cd examples/pytorch/text-classification + pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113 + + run: | + set -e # Exit if any command failed. + cd transformers/examples/pytorch/text-classification + python run_glue.py \ + --model_name_or_path bert-base-cased \ + --dataset_name imdb \ + --do_train \ + --max_seq_length 128 \ + --per_device_train_batch_size 32 \ + --learning_rate ${LR} \ + --max_steps ${MAX_STEPS} \ + --output_dir /tmp/imdb/ --overwrite_output_dir \ + --fp16 + +.. raw:: html + +
+ +We can now launch a job with different hyperparameters by specifying the envs. + +.. code-block:: bash + + sky launch -c dnn dnn-template.yaml \ + --env LR=1e-5 \ + --env MAX_STEPS=100 + +Or, you can store the envs in a dotenv file and launch the job with the file: ``configs/job1.env``. + +.. code-block:: bash + + # configs/job1.env + LR=1e-5 + MAX_STEPS=100 + +.. code-block:: bash + + sky launch -c dnn dnn-template.yaml \ + --env-file configs/job1.env + + + +2. Track Job Output +~~~~~~~~~~~~~~~~~~~ + +When running many jobs, it is useful to track live outputs of each job. We recommend using WandB to track the outputs of all jobs. + +.. raw:: html + +
+ SkyPilot YAML with WandB: dnn-template.yaml + +.. code-block:: yaml + :emphasize-lines: 7-7,33-33 + + # dnn-template.yaml + name: huggingface + + envs: + LR: 2e-5 + MAX_STEPS: 50 + WANDB_API_KEY: # Empty field means this field is required when launching the job. + + resources: + accelerators: V100:4 + + setup: | + set -e # Exit if any command failed. + git clone https://github.com/huggingface/transformers/ || true + cd transformers + pip install . + cd examples/pytorch/text-classification + pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113 + + run: | + set -e # Exit if any command failed. + cd transformers/examples/pytorch/text-classification + python run_glue.py \ + --model_name_or_path bert-base-cased \ + --dataset_name imdb \ + --do_train \ + --max_seq_length 128 \ + --per_device_train_batch_size 32 \ + --learning_rate ${LR} \ + --max_steps ${MAX_STEPS} \ + --output_dir /tmp/imdb/ --overwrite_output_dir \ + --fp16 \ + --report_to wandb + +.. raw:: html + +
+ +We can now launch the job with the following command (``WANDB_API_KEY`` should existing in your local environment variables). + +.. code-block:: bash + + sky launch -c dnn dnn-template.yaml \ + --env-file configs/job1.env \ + --env WANDB_API_KEY + + + +Scale up the Job +----------------- + +With the above setup, we can now scale up a job to many in-parallel jobs by submitting them with SkyPilot managed jobs. + +.. code-block:: bash + + for config_file in configs/*.env; do + sky jobs launch -n dnn-$(basename $config_file) dnn-template.yaml \ + --env-file $config_file \ + --env WANDB_API_KEY + done + +We can check the status of all jobs with the following command. + +.. code-block:: bash + + sky jobs queue From 4254044c614af759717f5e7148bfaa8123c25792 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Tue, 20 Aug 2024 05:16:22 +0000 Subject: [PATCH 02/47] Fix language --- docs/source/running-jobs/many-jobs.rst | 56 ++++++++++++++++++++------ 1 file changed, 44 insertions(+), 12 deletions(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index b84a28d3209..59cafe8ce9a 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -5,8 +5,8 @@ Guide: Run Many Jobs ==================== Hyperparameter tuning, data processing, etc, are common cases where many jobs are run in parallel. -However, managing all those jobs can be a pain. SkyPilot's allows us to run many jobs in parallel and manage them in -a single pane of glass. +However, managing all those jobs can be a pain. SkyPilot's allows us to **run many jobs in parallel** and **manage them in +a single pane of glass**. We show a typical workflow for running many jobs with SkyPilot. @@ -84,7 +84,7 @@ Sometimes, we may find it more efficient to interactively log into the cluster a -After confirming the job is working correctly, we now start adding additional fields to make the job more scalable. +After confirming the job is working correctly, we now start **adding additional fields** to make the job YAML more configurable. 1. Add Hyperparameters ~~~~~~~~~~~~~~~~~~~~~~ @@ -170,7 +170,7 @@ When running many jobs, it is useful to track live outputs of each job. We recom SkyPilot YAML with WandB: dnn-template.yaml .. code-block:: yaml - :emphasize-lines: 7-7,33-33 + :emphasize-lines: 7-7,19-19,34-34 # dnn-template.yaml name: huggingface @@ -190,6 +190,7 @@ When running many jobs, it is useful to track live outputs of each job. We recom pip install . cd examples/pytorch/text-classification pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113 + pip install wandb run: | set -e # Exit if any command failed. @@ -223,18 +224,49 @@ We can now launch the job with the following command (``WANDB_API_KEY`` should e Scale up the Job ----------------- -With the above setup, we can now scale up a job to many in-parallel jobs by submitting them with SkyPilot managed jobs. +With the above setup, we can now scale up a job to many in-parallel jobs by creating multiple config files and +submitting them with SkyPilot managed jobs. + +We create a config file for each job in the ``configs`` directory. .. code-block:: bash - for config_file in configs/*.env; do - sky jobs launch -n dnn-$(basename $config_file) dnn-template.yaml \ - --env-file $config_file \ - --env WANDB_API_KEY - done + # configs/job1.env + LR=1e-5 + MAX_STEPS=100 + + # configs/job2.env + LR=2e-5 + MAX_STEPS=200 + + ... -We can check the status of all jobs with the following command. +We can then submit all jobs by iterating over the config files. .. code-block:: bash - sky jobs queue + for config_file in configs/*.env; do + job_name=$(basename ${config_file%.env}) + # -y means yes to all prompts. + # -d means detach from the job's logging, so the next job can be submitted + # without waiting for the previous job to finish. + sky jobs launch -n dnn-$job_name -y -d dnn-template.yaml \ + --env-file $config_file \ + --env WANDB_API_KEY + done + + +All job statuses can be checked with the following command. + +.. code-block:: console + + $ sky jobs queue + + Fetching managed job statuses... + Managed jobs + In progress tasks: 2 RUNNING, 1 STARTING + ID TASK NAME RESOURCES SUBMITTED TOT. DURATION JOB DURATION #RECOVERIES STATUS + 4 - dnn-job3 1x[V100:4] 3 mins ago 3m 50s - 0 STARTING + 3 - dnn-job2 1x[V100:4] 4 mins ago 4m 55s 58s 0 RUNNING + 2 - dnn-job1 1x[V100:4] 6 mins ago 6m 2m 9s 0 RUNNING + From 19f294f25e72548c4956d5fc3f26a12193f187c4 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Tue, 20 Aug 2024 06:12:25 +0000 Subject: [PATCH 03/47] Fix job queue --- docs/source/running-jobs/many-jobs.rst | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 59cafe8ce9a..5779f721041 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -264,9 +264,10 @@ All job statuses can be checked with the following command. Fetching managed job statuses... Managed jobs - In progress tasks: 2 RUNNING, 1 STARTING - ID TASK NAME RESOURCES SUBMITTED TOT. DURATION JOB DURATION #RECOVERIES STATUS - 4 - dnn-job3 1x[V100:4] 3 mins ago 3m 50s - 0 STARTING - 3 - dnn-job2 1x[V100:4] 4 mins ago 4m 55s 58s 0 RUNNING - 2 - dnn-job1 1x[V100:4] 6 mins ago 6m 2m 9s 0 RUNNING + In progress tasks: 3 RUNNING + ID TASK NAME RESOURCES SUBMITTED TOT. DURATION JOB DURATION #RECOVERIES STATUS + 10 - dnn-job10 1x[V100:4] 5 mins ago 5m 5s 1m 12s 0 RUNNING + 9 - dnn-job9 1x[V100:4] 6 mins ago 6m 11s 2m 23s 0 RUNNING + 8 - dnn-job8 1x[V100:4] 7 mins ago 7m 15s 3m 31s 0 RUNNING + ... From 64446e34b271fdb8b34e2a97aca71a1308cf0764 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Tue, 20 Aug 2024 06:45:19 +0000 Subject: [PATCH 04/47] Add link to managed jobs --- docs/source/running-jobs/many-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 5779f721041..12bca9934a0 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -225,7 +225,7 @@ Scale up the Job ----------------- With the above setup, we can now scale up a job to many in-parallel jobs by creating multiple config files and -submitting them with SkyPilot managed jobs. +submitting them with :ref:`SkyPilot managed jobs `. We create a config file for each job in the ``configs`` directory. From 3e80dabe4da128267247c3b0455ae9d84f4b9856 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 28 Aug 2024 11:51:16 -0700 Subject: [PATCH 05/47] Update docs/source/running-jobs/many-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/running-jobs/many-jobs.rst | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 12bca9934a0..1d74d1f159b 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -4,9 +4,7 @@ Guide: Run Many Jobs ==================== -Hyperparameter tuning, data processing, etc, are common cases where many jobs are run in parallel. -However, managing all those jobs can be a pain. SkyPilot's allows us to **run many jobs in parallel** and **manage them in -a single pane of glass**. +SkyPilot allows you to easily **run many jobs in parallel** and manage them in a single system. This is useful for hyperparameter tuning sweeps, data processing, and other batch jobs. We show a typical workflow for running many jobs with SkyPilot. From c21084a3187a5f3add9abbbe911aa61e477e90d8 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 28 Aug 2024 11:51:32 -0700 Subject: [PATCH 06/47] Update docs/source/running-jobs/many-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/running-jobs/many-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 1d74d1f159b..55781fc1486 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -6,7 +6,7 @@ Guide: Run Many Jobs SkyPilot allows you to easily **run many jobs in parallel** and manage them in a single system. This is useful for hyperparameter tuning sweeps, data processing, and other batch jobs. -We show a typical workflow for running many jobs with SkyPilot. +This guide shows a typical workflow for running many jobs with SkyPilot. Develop a YAML for One Job ----------------------------------- From 2eeb7d34ab3275f7ecbf342d96efca1a25c80a3c Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 28 Aug 2024 11:51:41 -0700 Subject: [PATCH 07/47] Update docs/source/running-jobs/many-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/running-jobs/many-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 55781fc1486..b73d6fdb717 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -11,7 +11,7 @@ This guide shows a typical workflow for running many jobs with SkyPilot. Develop a YAML for One Job ----------------------------------- -Before scaling up to many jobs, develeoping and debugging a SkyPilot YAML for a single job can save a lot of time by avoiding debugging many jobs at once. +Before scaling up to many jobs, develop a SkyPilot YAML for a single job first and ensure it runs correctly. This can save time by avoiding debugging many jobs at once. We use the same example of AI model training as in :ref:`Tutorial: DNN Training `. From 772dbc38feb33f485faa103bd126f3d64a73ee64 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 28 Aug 2024 11:51:47 -0700 Subject: [PATCH 08/47] Update docs/source/running-jobs/many-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/running-jobs/many-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index b73d6fdb717..d02b4f86a5e 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -13,7 +13,7 @@ Develop a YAML for One Job Before scaling up to many jobs, develop a SkyPilot YAML for a single job first and ensure it runs correctly. This can save time by avoiding debugging many jobs at once. -We use the same example of AI model training as in :ref:`Tutorial: DNN Training `. +Here is the same example YAML as in :ref:`Tutorial: DNN Training `: .. raw:: html From 7dace8cdfd1951956a420c8e581f2aad4d678109 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 28 Aug 2024 11:51:52 -0700 Subject: [PATCH 09/47] Update docs/source/running-jobs/many-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/running-jobs/many-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index d02b4f86a5e..163b1f245c2 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -56,7 +56,7 @@ Here is the same example YAML as in :ref:`Tutorial: DNN Training ` -We can launch the job with the following command to check the correctness of the YAML. +First, launch the job to check it successfully launches and runs correctly: .. code-block:: bash From 8059c833fd0d91e5c82f49c3c67ed05a82dab5d1 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 28 Aug 2024 11:52:00 -0700 Subject: [PATCH 10/47] Update docs/source/running-jobs/many-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/running-jobs/many-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 163b1f245c2..16c9346a459 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -63,7 +63,7 @@ First, launch the job to check it successfully launches and runs correctly: sky launch -c dnn dnn.yaml -If there is any error, we can fix the YAML and launch the job again on the same cluster. +If there is any error, you can fix the code and/or the YAML, and launch the job again on the same cluster: .. code-block:: bash From d58011944590b586b3ab54bf0a60e10c210e2139 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 28 Aug 2024 11:52:06 -0700 Subject: [PATCH 11/47] Update docs/source/running-jobs/many-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/running-jobs/many-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 16c9346a459..0e68ab6150f 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -225,7 +225,7 @@ Scale up the Job With the above setup, we can now scale up a job to many in-parallel jobs by creating multiple config files and submitting them with :ref:`SkyPilot managed jobs `. -We create a config file for each job in the ``configs`` directory. +First, create a config file for each job (for example, in a ``configs`` directory): .. code-block:: bash From 013d436a1de6e39cd5d1cbbc94790418781f96b4 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 28 Aug 2024 11:52:10 -0700 Subject: [PATCH 12/47] Update docs/source/running-jobs/many-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/running-jobs/many-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 0e68ab6150f..13ef84d7d5b 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -239,7 +239,7 @@ First, create a config file for each job (for example, in a ``configs`` director ... -We can then submit all jobs by iterating over the config files. +Then, submit all jobs by iterating over the config files and calling `sky jobs launch` on each: .. code-block:: bash From 5e6b644ad005153bba91a72c42712f7fdf9a262f Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 28 Aug 2024 11:52:16 -0700 Subject: [PATCH 13/47] Update docs/source/running-jobs/many-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/running-jobs/many-jobs.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 13ef84d7d5b..091ae52c4ed 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -245,9 +245,9 @@ Then, submit all jobs by iterating over the config files and calling `sky jobs l for config_file in configs/*.env; do job_name=$(basename ${config_file%.env}) - # -y means yes to all prompts. - # -d means detach from the job's logging, so the next job can be submitted - # without waiting for the previous job to finish. + # -y: yes to all prompts. + # -d: detach from the job's logging, so the next job can be submitted + # without waiting for the previous job to finish. sky jobs launch -n dnn-$job_name -y -d dnn-template.yaml \ --env-file $config_file \ --env WANDB_API_KEY From 0144b9658923131e95df82286e99402fc0a03af1 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 28 Aug 2024 11:52:23 -0700 Subject: [PATCH 14/47] Update docs/source/running-jobs/many-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/running-jobs/many-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 091ae52c4ed..2859b989354 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -254,7 +254,7 @@ Then, submit all jobs by iterating over the config files and calling `sky jobs l done -All job statuses can be checked with the following command. +Job statuses can be checked via `sky jobs queue`: .. code-block:: console From 23c4674f3ab335843eaa9f5b54ab59a71bdaec7a Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 28 Aug 2024 11:52:33 -0700 Subject: [PATCH 15/47] Update docs/source/running-jobs/many-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/running-jobs/many-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 2859b989354..92b71afa0c6 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -73,7 +73,7 @@ If there is any error, you can fix the code and/or the YAML, and launch the job sky launch -c dnn dnn.yaml -Sometimes, we may find it more efficient to interactively log into the cluster and debug the job. We can do so by directly :ref:`ssh into the cluster or use VSCode's remote ssh `. +Sometimes, it may be more efficient to log into the cluster and interactively debug the job. You can do so by directly :ref:`ssh'ing into the cluster or using VSCode's remote ssh `. .. code-block:: bash From 3c5b8acb32bca742bf6ad267630ab9961d63035b Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 28 Aug 2024 11:52:43 -0700 Subject: [PATCH 16/47] Update docs/source/running-jobs/many-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/running-jobs/many-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 92b71afa0c6..025be7febf8 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -82,7 +82,7 @@ Sometimes, it may be more efficient to log into the cluster and interactively de -After confirming the job is working correctly, we now start **adding additional fields** to make the job YAML more configurable. +Next, after confirming the job is working correctly, **add (hyper)parameters** to the job YAML so that all job variants can be specified. 1. Add Hyperparameters ~~~~~~~~~~~~~~~~~~~~~~ From 71ab53fc3cb75624d8276cb7196f6268d004b73b Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 28 Aug 2024 11:52:51 -0700 Subject: [PATCH 17/47] Update docs/source/running-jobs/many-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/running-jobs/many-jobs.rst | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 025be7febf8..a314c270912 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -87,8 +87,7 @@ Next, after confirming the job is working correctly, **add (hyper)parameters** t 1. Add Hyperparameters ~~~~~~~~~~~~~~~~~~~~~~ -To launch many jobs with different hyperparameters, we turn the SkyPilot YAML into a template, by -adding :ref:`environment variables ` as arguments for the job. +To launch jobs with different hyperparameters, add them as :ref:`environment variables ` to the SkyPilot YAML, and make your main program read these environment variables: .. raw:: html From d01d5ad3b02f6ec75a27df6fa6d2484f105921bf Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 28 Aug 2024 11:54:18 -0700 Subject: [PATCH 18/47] Update docs/source/running-jobs/many-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/running-jobs/many-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index a314c270912..7f6eda5caf7 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -133,7 +133,7 @@ To launch jobs with different hyperparameters, add them as :ref:`environment var -We can now launch a job with different hyperparameters by specifying the envs. +You can now use `--env` to launch a job with different hyperparameters: .. code-block:: bash From ab5110e7ea6d57d6d246f089ab7c716534693334 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 28 Aug 2024 11:54:25 -0700 Subject: [PATCH 19/47] Update docs/source/running-jobs/many-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/running-jobs/many-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 7f6eda5caf7..eabac868553 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -141,7 +141,7 @@ You can now use `--env` to launch a job with different hyperparameters: --env LR=1e-5 \ --env MAX_STEPS=100 -Or, you can store the envs in a dotenv file and launch the job with the file: ``configs/job1.env``. +Alternative, store the environment variable values in a dotenv file and use `--env-file` to launch: .. code-block:: bash From 2de0066b680c8e13ccefd4625f8539503685ebf4 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 28 Aug 2024 11:54:35 -0700 Subject: [PATCH 20/47] Update docs/source/running-jobs/many-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/running-jobs/many-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index eabac868553..2f83a96b350 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -159,7 +159,7 @@ Alternative, store the environment variable values in a dotenv file and use `--e 2. Track Job Output ~~~~~~~~~~~~~~~~~~~ -When running many jobs, it is useful to track live outputs of each job. We recommend using WandB to track the outputs of all jobs. +When running many jobs, it is useful to log the outputs of all jobs. You can use tools like W&B (TODO link) for this purpose: .. raw:: html From ee7218e4163c6563889dba687871252b0f4a1d85 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 28 Aug 2024 11:54:52 -0700 Subject: [PATCH 21/47] Update docs/source/running-jobs/many-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/running-jobs/many-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 2f83a96b350..15521eeda77 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -208,7 +208,7 @@ When running many jobs, it is useful to log the outputs of all jobs. You can use -We can now launch the job with the following command (``WANDB_API_KEY`` should existing in your local environment variables). +You can now launch the job with the following command (``WANDB_API_KEY`` should existing in your local environment variables). .. code-block:: bash From b096d90651a5ea9d0f551ff16d6587c9c87c6ea0 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 28 Aug 2024 11:55:08 -0700 Subject: [PATCH 22/47] Update docs/source/running-jobs/many-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/running-jobs/many-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 15521eeda77..cd3d817d039 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -221,7 +221,7 @@ You can now launch the job with the following command (``WANDB_API_KEY`` should Scale up the Job ----------------- -With the above setup, we can now scale up a job to many in-parallel jobs by creating multiple config files and +With the above setup, you can now scale out to run many jobs in parallel by (1) creating multiple config files and (2) submitting them with :ref:`SkyPilot managed jobs `. First, create a config file for each job (for example, in a ``configs`` directory): From 4a7053446ee41b3b0dd137544dbf98732d0b0b51 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 28 Aug 2024 19:19:50 +0000 Subject: [PATCH 23/47] Add script for generating config files --- docs/source/running-jobs/many-jobs.rst | 56 +++++++++++++++++++------- 1 file changed, 42 insertions(+), 14 deletions(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index cd3d817d039..9e61898afc7 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -145,26 +145,26 @@ Alternative, store the environment variable values in a dotenv file and use `--e .. code-block:: bash - # configs/job1.env + # configs/job1 LR=1e-5 MAX_STEPS=100 .. code-block:: bash sky launch -c dnn dnn-template.yaml \ - --env-file configs/job1.env + --env-file configs/job1 -2. Track Job Output -~~~~~~~~~~~~~~~~~~~ +2. Logging Job Outputs +~~~~~~~~~~~~~~~~~~~~~~~ -When running many jobs, it is useful to log the outputs of all jobs. You can use tools like W&B (TODO link) for this purpose: +When running many jobs, it is useful to log the outputs of all jobs. You can use tools like [W&B](https://wandb.ai/) for this purpose: .. raw:: html
- SkyPilot YAML with WandB: dnn-template.yaml + SkyPilot YAML with W&B: dnn-template.yaml .. code-block:: yaml :emphasize-lines: 7-7,19-19,34-34 @@ -213,13 +213,13 @@ You can now launch the job with the following command (``WANDB_API_KEY`` should .. code-block:: bash sky launch -c dnn dnn-template.yaml \ - --env-file configs/job1.env \ + --env-file configs/job1 \ --env WANDB_API_KEY -Scale up the Job ------------------ +Scale Out to Many Jobs +--------------------- With the above setup, you can now scale out to run many jobs in parallel by (1) creating multiple config files and (2) submitting them with :ref:`SkyPilot managed jobs `. @@ -228,22 +228,50 @@ First, create a config file for each job (for example, in a ``configs`` director .. code-block:: bash - # configs/job1.env + # configs/job1 LR=1e-5 MAX_STEPS=100 - # configs/job2.env + # configs/job2 LR=2e-5 MAX_STEPS=200 ... +.. code-block:: html + +
+ An example Python script to generate config files + +.. code-block:: python + + import os + + CONFIG_PATH = 'configs' + LR_CANDIDATES = [0.01, 0.03, 0.1, 0.3, 1.0] + MAX_STEPS_CANDIDATES = [100, 300, 1000] + + os.makedirs(CONFIG_PATH, exist_ok=True) + + job_idx = 1 + for lr in LR_CANDIDATES: + for max_steps in MAX_STEPS_CANDIDATES: + config_file = f"{CONFIG_PATH}/job{job_idx}" + with open(config_file, "w") as f: + print(f'LR={lr}', file=f) + print(f'MAX_STEPS={max_steps}', file=f) + job_idx += 1 + +.. raw:: html + +
+ Then, submit all jobs by iterating over the config files and calling `sky jobs launch` on each: .. code-block:: bash - for config_file in configs/*.env; do - job_name=$(basename ${config_file%.env}) + for config_file in configs/*; do + job_name=$(basename $config_file) # -y: yes to all prompts. # -d: detach from the job's logging, so the next job can be submitted # without waiting for the previous job to finish. @@ -261,7 +289,7 @@ Job statuses can be checked via `sky jobs queue`: Fetching managed job statuses... Managed jobs - In progress tasks: 3 RUNNING + In progress tasks: 10 RUNNING ID TASK NAME RESOURCES SUBMITTED TOT. DURATION JOB DURATION #RECOVERIES STATUS 10 - dnn-job10 1x[V100:4] 5 mins ago 5m 5s 1m 12s 0 RUNNING 9 - dnn-job9 1x[V100:4] 6 mins ago 6m 11s 2m 23s 0 RUNNING From 4f90f2485d3698843b4bbaed40ad52b016fc5485 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 28 Aug 2024 21:39:06 +0000 Subject: [PATCH 24/47] Fix comments --- docs/source/getting-started/quickstart.rst | 2 +- docs/source/getting-started/tutorial.rst | 4 +- docs/source/reference/job-queue.rst | 2 +- docs/source/running-jobs/many-jobs.rst | 73 ++++++++++++++++------ 4 files changed, 57 insertions(+), 24 deletions(-) diff --git a/docs/source/getting-started/quickstart.rst b/docs/source/getting-started/quickstart.rst index bfc6fd17e05..cdef2335dd7 100644 --- a/docs/source/getting-started/quickstart.rst +++ b/docs/source/getting-started/quickstart.rst @@ -219,7 +219,7 @@ Congratulations! In this quickstart, you have launched a cluster, run a task, a Next steps: -- Adapt :ref:`Tutorial: DNN Training ` to start running your own project on SkyPilot! +- Adapt :ref:`Tutorial: AI Training ` to start running your own project on SkyPilot! - See the :ref:`Task YAML reference `, :ref:`CLI reference `, and `more examples `_ - To learn more, try out `SkyPilot Tutorials `_ in Jupyter notebooks diff --git a/docs/source/getting-started/tutorial.rst b/docs/source/getting-started/tutorial.rst index 92ef6f68b2f..175f1391a6d 100644 --- a/docs/source/getting-started/tutorial.rst +++ b/docs/source/getting-started/tutorial.rst @@ -1,6 +1,6 @@ -.. _dnn-training: +.. _ai-training: -Tutorial: DNN Training +Tutorial: AI Training ====================== This example uses SkyPilot to train a Transformer-based language model from HuggingFace. diff --git a/docs/source/reference/job-queue.rst b/docs/source/reference/job-queue.rst index 6397c7bbbb6..c0016c4d6da 100644 --- a/docs/source/reference/job-queue.rst +++ b/docs/source/reference/job-queue.rst @@ -160,7 +160,7 @@ SkyPilot's scheduler serves two goals: 2. **Minimizing resource idleness**: If a resource is idle, SkyPilot will schedule a queued job that can utilize that resource. -We illustrate the scheduling behavior by revisiting :ref:`Tutorial: DNN Training `. +We illustrate the scheduling behavior by revisiting :ref:`Tutorial: AI Training `. In that tutorial, we have a task YAML that specifies these resource requirements: .. code-block:: yaml diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 9e61898afc7..1c198e9f8d5 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -13,16 +13,16 @@ Develop a YAML for One Job Before scaling up to many jobs, develop a SkyPilot YAML for a single job first and ensure it runs correctly. This can save time by avoiding debugging many jobs at once. -Here is the same example YAML as in :ref:`Tutorial: DNN Training `: +Here is the same example YAML as in :ref:`Tutorial: AI Training `: .. raw:: html
- Click to expand: dnn.yaml + Click to expand: train.yaml .. code-block:: yaml - # dnn.yaml + # train.yaml name: huggingface resources: @@ -60,7 +60,7 @@ First, launch the job to check it successfully launches and runs correctly: .. code-block:: bash - sky launch -c dnn dnn.yaml + sky launch -c train train.yaml If there is any error, you can fix the code and/or the YAML, and launch the job again on the same cluster: @@ -68,9 +68,9 @@ If there is any error, you can fix the code and/or the YAML, and launch the job .. code-block:: bash # Cancel the latest job. - sky cancel dnn -y + sky cancel train -y # Run the job again on the same cluster. - sky launch -c dnn dnn.yaml + sky launch -c train train.yaml Sometimes, it may be more efficient to log into the cluster and interactively debug the job. You can do so by directly :ref:`ssh'ing into the cluster or using VSCode's remote ssh `. @@ -78,7 +78,7 @@ Sometimes, it may be more efficient to log into the cluster and interactively de .. code-block:: bash # Log into the cluster. - ssh dnn + ssh train @@ -92,12 +92,12 @@ To launch jobs with different hyperparameters, add them as :ref:`environment var .. raw:: html
- Updated SkyPilot YAML: dnn-template.yaml + Updated SkyPilot YAML: train-template.yaml .. code-block:: yaml :emphasize-lines: 4-6,28-29 - # dnn-template.yaml + # train-template.yaml name: huggingface envs: @@ -137,7 +137,7 @@ You can now use `--env` to launch a job with different hyperparameters: .. code-block:: bash - sky launch -c dnn dnn-template.yaml \ + sky launch -c train train-template.yaml \ --env LR=1e-5 \ --env MAX_STEPS=100 @@ -151,7 +151,7 @@ Alternative, store the environment variable values in a dotenv file and use `--e .. code-block:: bash - sky launch -c dnn dnn-template.yaml \ + sky launch -c train train-template.yaml \ --env-file configs/job1 @@ -164,12 +164,12 @@ When running many jobs, it is useful to log the outputs of all jobs. You can use .. raw:: html
- SkyPilot YAML with W&B: dnn-template.yaml + SkyPilot YAML with W&B: train-template.yaml .. code-block:: yaml :emphasize-lines: 7-7,19-19,34-34 - # dnn-template.yaml + # train-template.yaml name: huggingface envs: @@ -212,7 +212,7 @@ You can now launch the job with the following command (``WANDB_API_KEY`` should .. code-block:: bash - sky launch -c dnn dnn-template.yaml \ + sky launch -c train train-template.yaml \ --env-file configs/job1 \ --env WANDB_API_KEY @@ -221,7 +221,13 @@ You can now launch the job with the following command (``WANDB_API_KEY`` should Scale Out to Many Jobs --------------------- -With the above setup, you can now scale out to run many jobs in parallel by (1) creating multiple config files and (2) +With the above setup, you can now scale out to run many jobs in parallel. You +can either use SkyPilot CLI with many config files or use SkyPilot Python API. + +1. With Config Files +~~~~~~~~~~~~~~~~~~~~ + +You can run many jobs in parallel by (1) creating multiple config files and (2) submitting them with :ref:`SkyPilot managed jobs `. First, create a config file for each job (for example, in a ``configs`` directory): @@ -275,7 +281,7 @@ Then, submit all jobs by iterating over the config files and calling `sky jobs l # -y: yes to all prompts. # -d: detach from the job's logging, so the next job can be submitted # without waiting for the previous job to finish. - sky jobs launch -n dnn-$job_name -y -d dnn-template.yaml \ + sky jobs launch -n train-$job_name -y -d train-template.yaml \ --env-file $config_file \ --env WANDB_API_KEY done @@ -290,9 +296,36 @@ Job statuses can be checked via `sky jobs queue`: Fetching managed job statuses... Managed jobs In progress tasks: 10 RUNNING - ID TASK NAME RESOURCES SUBMITTED TOT. DURATION JOB DURATION #RECOVERIES STATUS - 10 - dnn-job10 1x[V100:4] 5 mins ago 5m 5s 1m 12s 0 RUNNING - 9 - dnn-job9 1x[V100:4] 6 mins ago 6m 11s 2m 23s 0 RUNNING - 8 - dnn-job8 1x[V100:4] 7 mins ago 7m 15s 3m 31s 0 RUNNING + ID TASK NAME RESOURCES SUBMITTED TOT. DURATION JOB DURATION #RECOVERIES STATUS + 10 - train-job10 1x[V100:4] 5 mins ago 5m 5s 1m 12s 0 RUNNING + 9 - train-job9 1x[V100:4] 6 mins ago 6m 11s 2m 23s 0 RUNNING + 8 - train-job8 1x[V100:4] 7 mins ago 7m 15s 3m 31s 0 RUNNING ... + +With Python API +~~~~~~~~~~~~~~~ + +To have more customized control over the jobs, you can also use SkyPilot Python +API to launch the jobs. + +.. code-block:: python + + import os + import sky + + LR_CANDIDATES = [0.01, 0.03, 0.1, 0.3, 1.0] + MAX_STEPS_CANDIDATES = [100, 300, 1000] + task = sky.Task.from_yaml('train-template.yaml') + + job_idx = 1 + for lr in LR_CANDIDATES: + for max_steps in MAX_STEPS_CANDIDATES: + task.update_envs({'LR': lr, 'MAX_STEPS': max_steps}) + sky.jobs.launch( + task, + name=f'train-job{job_idx}', + detach_run=True, + retry_until_up=True, + ) + job_idx += 1 From 1af226df7390b2632b6a4829a45925bcb671519d Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 28 Aug 2024 23:08:20 +0000 Subject: [PATCH 25/47] fix title --- docs/source/running-jobs/many-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 1c198e9f8d5..26b752a1bf1 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -219,7 +219,7 @@ You can now launch the job with the following command (``WANDB_API_KEY`` should Scale Out to Many Jobs ---------------------- +----------------------- With the above setup, you can now scale out to run many jobs in parallel. You can either use SkyPilot CLI with many config files or use SkyPilot Python API. From 2837cad99c1ba1f08ab3c3dd839818f3fc1de323 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Thu, 29 Aug 2024 00:58:45 +0000 Subject: [PATCH 26/47] fix title --- docs/source/running-jobs/many-jobs.rst | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 26b752a1bf1..dffaa33b88c 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -8,6 +8,12 @@ SkyPilot allows you to easily **run many jobs in parallel** and manage them in a This guide shows a typical workflow for running many jobs with SkyPilot. + +.. image:: https://i.imgur.com/nIoOApY.png + +.. TODO: Show the components in a GIF. + + Develop a YAML for One Job ----------------------------------- @@ -224,7 +230,7 @@ Scale Out to Many Jobs With the above setup, you can now scale out to run many jobs in parallel. You can either use SkyPilot CLI with many config files or use SkyPilot Python API. -1. With Config Files +With Config Files ~~~~~~~~~~~~~~~~~~~~ You can run many jobs in parallel by (1) creating multiple config files and (2) From e31d814243763247957da07b568c75e55e08bcc2 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Thu, 29 Aug 2024 01:00:11 +0000 Subject: [PATCH 27/47] fix --- docs/source/running-jobs/many-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index dffaa33b88c..de2ef70fd46 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -250,7 +250,7 @@ First, create a config file for each job (for example, in a ``configs`` director ... -.. code-block:: html +.. raw:: html
An example Python script to generate config files From 86f72928f0187afe419235023d5420ff0772ce30 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Thu, 29 Aug 2024 01:02:58 +0000 Subject: [PATCH 28/47] reduce image size --- docs/source/running-jobs/many-jobs.rst | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index de2ef70fd46..15a8c29d8e0 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -9,8 +9,9 @@ SkyPilot allows you to easily **run many jobs in parallel** and manage them in a This guide shows a typical workflow for running many jobs with SkyPilot. -.. image:: https://i.imgur.com/nIoOApY.png - +.. image:: https://i.imgur.com/VX2YKwR.png + :width: 80% + :align: center .. TODO: Show the components in a GIF. From acb5690c524dbbe07f4cbba57f869e8df8df852e Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Thu, 29 Aug 2024 01:10:24 +0000 Subject: [PATCH 29/47] restructure --- docs/source/_static/custom.js | 1 + docs/source/docs/index.rst | 6 ++++++ docs/source/running-jobs/distributed-jobs.rst | 2 +- docs/source/running-jobs/many-jobs.rst | 4 ++-- 4 files changed, 10 insertions(+), 3 deletions(-) diff --git a/docs/source/_static/custom.js b/docs/source/_static/custom.js index e0de1b50d51..91f61d72e26 100644 --- a/docs/source/_static/custom.js +++ b/docs/source/_static/custom.js @@ -29,6 +29,7 @@ document.addEventListener('DOMContentLoaded', () => { { selector: '.toctree-l1 > a', text: 'Managed Jobs' }, { selector: '.toctree-l1 > a', text: 'Running on Kubernetes' }, { selector: '.toctree-l1 > a', text: 'Llama-3.1 (Meta)' }, + { selector: '.caption-text', text: 'Guides' }, ]; newItems.forEach(({ selector, text }) => { document.querySelectorAll(selector).forEach((el) => { diff --git a/docs/source/docs/index.rst b/docs/source/docs/index.rst index 59d8f54cc77..de0f4613b9e 100644 --- a/docs/source/docs/index.rst +++ b/docs/source/docs/index.rst @@ -141,6 +141,12 @@ Read the research: ../examples/managed-jobs ../reference/job-queue ../examples/auto-failover + +.. toctree:: + :hidden: + :maxdepth: 1 + :caption: Guides + ../reference/kubernetes/index ../running-jobs/distributed-jobs ../running-jobs/many-jobs diff --git a/docs/source/running-jobs/distributed-jobs.rst b/docs/source/running-jobs/distributed-jobs.rst index 9249e9faca0..fcf6b11b680 100644 --- a/docs/source/running-jobs/distributed-jobs.rst +++ b/docs/source/running-jobs/distributed-jobs.rst @@ -1,6 +1,6 @@ .. _dist-jobs: -Guide: Distributed Jobs +Distributed Jobs on Multiple Nodes ================================================ SkyPilot supports multi-node cluster diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 15a8c29d8e0..2df86cd12b2 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -1,8 +1,8 @@ .. _many-jobs: -Guide: Run Many Jobs -==================== +Run Many Parallel Jobs +====================== SkyPilot allows you to easily **run many jobs in parallel** and manage them in a single system. This is useful for hyperparameter tuning sweeps, data processing, and other batch jobs. From a720ab2708b83bb1217e76dea0721ef6c27e6296 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Thu, 29 Aug 2024 01:11:01 +0000 Subject: [PATCH 30/47] rename --- docs/source/running-jobs/many-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 2df86cd12b2..297d646fab4 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -1,7 +1,7 @@ .. _many-jobs: -Run Many Parallel Jobs +Many Parallel Jobs ====================== SkyPilot allows you to easily **run many jobs in parallel** and manage them in a single system. This is useful for hyperparameter tuning sweeps, data processing, and other batch jobs. From 30b558f631d30cb1bf242cae5218e52ea73396a5 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Thu, 29 Aug 2024 02:36:34 +0000 Subject: [PATCH 31/47] adopt comments --- docs/source/_static/custom.js | 2 +- docs/source/docs/index.rst | 6 ------ docs/source/running-jobs/distributed-jobs.rst | 2 +- docs/source/running-jobs/many-jobs.rst | 6 +++--- 4 files changed, 5 insertions(+), 11 deletions(-) diff --git a/docs/source/_static/custom.js b/docs/source/_static/custom.js index 91f61d72e26..a15f0cce666 100644 --- a/docs/source/_static/custom.js +++ b/docs/source/_static/custom.js @@ -29,7 +29,7 @@ document.addEventListener('DOMContentLoaded', () => { { selector: '.toctree-l1 > a', text: 'Managed Jobs' }, { selector: '.toctree-l1 > a', text: 'Running on Kubernetes' }, { selector: '.toctree-l1 > a', text: 'Llama-3.1 (Meta)' }, - { selector: '.caption-text', text: 'Guides' }, + { selector: '.toctree-l1 > a', text: 'Many Parallel Jobs' }, ]; newItems.forEach(({ selector, text }) => { document.querySelectorAll(selector).forEach((el) => { diff --git a/docs/source/docs/index.rst b/docs/source/docs/index.rst index de0f4613b9e..59d8f54cc77 100644 --- a/docs/source/docs/index.rst +++ b/docs/source/docs/index.rst @@ -141,12 +141,6 @@ Read the research: ../examples/managed-jobs ../reference/job-queue ../examples/auto-failover - -.. toctree:: - :hidden: - :maxdepth: 1 - :caption: Guides - ../reference/kubernetes/index ../running-jobs/distributed-jobs ../running-jobs/many-jobs diff --git a/docs/source/running-jobs/distributed-jobs.rst b/docs/source/running-jobs/distributed-jobs.rst index fcf6b11b680..9eb590c10bc 100644 --- a/docs/source/running-jobs/distributed-jobs.rst +++ b/docs/source/running-jobs/distributed-jobs.rst @@ -1,6 +1,6 @@ .. _dist-jobs: -Distributed Jobs on Multiple Nodes +Distributed Jobs on Many Nodes ================================================ SkyPilot supports multi-node cluster diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 297d646fab4..7c65cc9fa7e 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -9,8 +9,8 @@ SkyPilot allows you to easily **run many jobs in parallel** and manage them in a This guide shows a typical workflow for running many jobs with SkyPilot. -.. image:: https://i.imgur.com/VX2YKwR.png - :width: 80% +.. image:: https://i.imgur.com/x1aFtVn.png + :width: 90% :align: center .. TODO: Show the components in a GIF. @@ -166,7 +166,7 @@ Alternative, store the environment variable values in a dotenv file and use `--e 2. Logging Job Outputs ~~~~~~~~~~~~~~~~~~~~~~~ -When running many jobs, it is useful to log the outputs of all jobs. You can use tools like [W&B](https://wandb.ai/) for this purpose: +When running many jobs, it is useful to log the outputs of all jobs. You can use tools like `W&B `__ for this purpose: .. raw:: html From 512627ca8d51a7f6148ef410a93d964ed581cc56 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Thu, 29 Aug 2024 03:40:29 +0000 Subject: [PATCH 32/47] Add benefits --- docs/source/running-jobs/many-jobs.rst | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 7c65cc9fa7e..9d825d2607b 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -14,6 +14,13 @@ This guide shows a typical workflow for running many jobs with SkyPilot. :align: center .. TODO: Show the components in a GIF. +Reason for using SkyPilot to run many jobs: + +- **Seamless**: Utilize resources on any of your infrastructure (Kubernetes, cloud VMs, reservations, etc.) +- **Observable**: See and manage all jobs in a single pane of glass +- **Elastic**: Scale up and down based on demands +- **Cost-effective**: Only pay for the cheapest resources +- **Robust**: Automatically recover jobs from failures Develop a YAML for One Job ----------------------------------- From eb50f26448e485488c6b8ec43a90e7bd6ce2224d Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 28 Aug 2024 22:03:01 -0700 Subject: [PATCH 33/47] Update docs/source/running-jobs/many-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/running-jobs/many-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 9d825d2607b..ff34dcb252b 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -238,7 +238,7 @@ Scale Out to Many Jobs With the above setup, you can now scale out to run many jobs in parallel. You can either use SkyPilot CLI with many config files or use SkyPilot Python API. -With Config Files +With CLI and Config Files ~~~~~~~~~~~~~~~~~~~~ You can run many jobs in parallel by (1) creating multiple config files and (2) From f8f3eb7bc2e2b41ae19f9c3e661ac29d4c5d417c Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 28 Aug 2024 22:03:11 -0700 Subject: [PATCH 34/47] Update docs/source/running-jobs/many-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/running-jobs/many-jobs.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index ff34dcb252b..69cfa7bc744 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -22,10 +22,10 @@ Reason for using SkyPilot to run many jobs: - **Cost-effective**: Only pay for the cheapest resources - **Robust**: Automatically recover jobs from failures -Develop a YAML for One Job +Write a YAML for One Job ----------------------------------- -Before scaling up to many jobs, develop a SkyPilot YAML for a single job first and ensure it runs correctly. This can save time by avoiding debugging many jobs at once. +Before scaling up to many jobs, write a SkyPilot YAML for a single job first and ensure it runs correctly. This can save time by avoiding debugging many jobs at once. Here is the same example YAML as in :ref:`Tutorial: AI Training `: From 48d6f800d4fbef0ce65451f9ea90e9d783d66c15 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 28 Aug 2024 22:04:08 -0700 Subject: [PATCH 35/47] Update docs/source/running-jobs/many-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/running-jobs/many-jobs.rst | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 69cfa7bc744..803d0eed1f4 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -14,13 +14,15 @@ This guide shows a typical workflow for running many jobs with SkyPilot. :align: center .. TODO: Show the components in a GIF. -Reason for using SkyPilot to run many jobs: -- **Seamless**: Utilize resources on any of your infrastructure (Kubernetes, cloud VMs, reservations, etc.) -- **Observable**: See and manage all jobs in a single pane of glass -- **Elastic**: Scale up and down based on demands -- **Cost-effective**: Only pay for the cheapest resources -- **Robust**: Automatically recover jobs from failures +Why Use SkyPilot to Run Many Jobs +------------------------------------- + +- **Unified**: Utilize resources on any of your infrastructure (Kubernetes, cloud VMs, reservations, etc.). +- **Elastic**: Scale up and down based on demands. +- **Cost-effective**: Only pay for the cheapest resources. +- **Robust**: Automatically recover jobs from failures. +- **Observable**: See and manage all jobs in a single pane of glass. Write a YAML for One Job ----------------------------------- From d16bab679cce72bae82f06d6b5ec914f5f4013f0 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 28 Aug 2024 22:33:13 -0700 Subject: [PATCH 36/47] Update docs/source/running-jobs/many-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/running-jobs/many-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 803d0eed1f4..46e34b7324a 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -244,7 +244,7 @@ With CLI and Config Files ~~~~~~~~~~~~~~~~~~~~ You can run many jobs in parallel by (1) creating multiple config files and (2) -submitting them with :ref:`SkyPilot managed jobs `. +submitting them as :ref:`SkyPilot managed jobs `. First, create a config file for each job (for example, in a ``configs`` directory): From e13f72e199f72f63eff485dbbda0d7f68656ca3d Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 28 Aug 2024 22:33:44 -0700 Subject: [PATCH 37/47] Update docs/source/running-jobs/many-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/running-jobs/many-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 46e34b7324a..aad22d0912a 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -288,7 +288,7 @@ First, create a config file for each job (for example, in a ``configs`` director
-Then, submit all jobs by iterating over the config files and calling `sky jobs launch` on each: +Then, submit all jobs by iterating over the config files and calling ``sky jobs launch`` on each: .. code-block:: bash From ce8730b3db3b498a8ae49305a2c409d1cd90b648 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 28 Aug 2024 22:33:49 -0700 Subject: [PATCH 38/47] Update docs/source/running-jobs/many-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/running-jobs/many-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index aad22d0912a..07fa55d6083 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -303,7 +303,7 @@ Then, submit all jobs by iterating over the config files and calling ``sky jobs done -Job statuses can be checked via `sky jobs queue`: +Job statuses can be checked via ``sky jobs queue``: .. code-block:: console From 8ba7a88b0b32543e552d3ccb13082631a416d7da Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 28 Aug 2024 22:34:02 -0700 Subject: [PATCH 39/47] Update docs/source/running-jobs/many-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/running-jobs/many-jobs.rst | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 07fa55d6083..63053197d51 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -322,8 +322,7 @@ Job statuses can be checked via ``sky jobs queue``: With Python API ~~~~~~~~~~~~~~~ -To have more customized control over the jobs, you can also use SkyPilot Python -API to launch the jobs. +To have more customized control over generation of job variants, you can also use SkyPilot Python API to launch the jobs. .. code-block:: python From 5afc7523ee5f7f9b57a6ddfd51ab13b0d7f4dc06 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 28 Aug 2024 22:34:48 -0700 Subject: [PATCH 40/47] Update docs/source/running-jobs/many-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/running-jobs/many-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 63053197d51..f64f06c6977 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -149,7 +149,7 @@ To launch jobs with different hyperparameters, add them as :ref:`environment var
-You can now use `--env` to launch a job with different hyperparameters: +You can now use ``--env`` to launch a job with different hyperparameters: .. code-block:: bash From 516fdb127041d2549066ad700bf00a36123ccac5 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 28 Aug 2024 22:34:58 -0700 Subject: [PATCH 41/47] Update docs/source/running-jobs/many-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/running-jobs/many-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index f64f06c6977..9586c1304cd 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -157,7 +157,7 @@ You can now use ``--env`` to launch a job with different hyperparameters: --env LR=1e-5 \ --env MAX_STEPS=100 -Alternative, store the environment variable values in a dotenv file and use `--env-file` to launch: +Alternative, store the environment variable values in a dotenv file and use ``--env-file`` to launch: .. code-block:: bash From 39ca2390c9766abba20b7e7fd9dcb14826e29d08 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Thu, 29 Aug 2024 05:41:26 +0000 Subject: [PATCH 42/47] update --- docs/source/running-jobs/many-jobs.rst | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 9d825d2607b..8154f05f1ff 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -9,7 +9,7 @@ SkyPilot allows you to easily **run many jobs in parallel** and manage them in a This guide shows a typical workflow for running many jobs with SkyPilot. -.. image:: https://i.imgur.com/x1aFtVn.png +.. image:: https://i.imgur.com/tvxeNyR.png :width: 90% :align: center .. TODO: Show the components in a GIF. @@ -248,11 +248,11 @@ First, create a config file for each job (for example, in a ``configs`` director .. code-block:: bash - # configs/job1 + # configs/job-1 LR=1e-5 MAX_STEPS=100 - # configs/job2 + # configs/job-2 LR=2e-5 MAX_STEPS=200 @@ -276,8 +276,8 @@ First, create a config file for each job (for example, in a ``configs`` director job_idx = 1 for lr in LR_CANDIDATES: for max_steps in MAX_STEPS_CANDIDATES: - config_file = f"{CONFIG_PATH}/job{job_idx}" - with open(config_file, "w") as f: + config_file = f'{CONFIG_PATH}/job-{job_idx}' + with open(config_file, 'w') as f: print(f'LR={lr}', file=f) print(f'MAX_STEPS={max_steps}', file=f) job_idx += 1 From 2440d90d2488719e38cb1b4e3d4e72f9a05dc8ca Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Thu, 29 Aug 2024 05:43:18 +0000 Subject: [PATCH 43/47] rename --- docs/source/running-jobs/distributed-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/running-jobs/distributed-jobs.rst b/docs/source/running-jobs/distributed-jobs.rst index 9eb590c10bc..da3ddd8e94f 100644 --- a/docs/source/running-jobs/distributed-jobs.rst +++ b/docs/source/running-jobs/distributed-jobs.rst @@ -1,6 +1,6 @@ .. _dist-jobs: -Distributed Jobs on Many Nodes +Distributed Multi-Node Jobs ================================================ SkyPilot supports multi-node cluster From 9b36c6b5bd6f2490323c3f04ddfdb4c603b7d2a0 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Thu, 29 Aug 2024 05:43:46 +0000 Subject: [PATCH 44/47] fix --- docs/source/running-jobs/many-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 09cd9f1be12..1d4280eb6ac 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -241,7 +241,7 @@ With the above setup, you can now scale out to run many jobs in parallel. You can either use SkyPilot CLI with many config files or use SkyPilot Python API. With CLI and Config Files -~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~~~~~~ You can run many jobs in parallel by (1) creating multiple config files and (2) submitting them as :ref:`SkyPilot managed jobs `. From 8651facea762830bb4a8e6f88fbf62f686b36921 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Thu, 29 Aug 2024 10:53:07 -0700 Subject: [PATCH 45/47] Update docs/source/running-jobs/many-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/running-jobs/many-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 1d4280eb6ac..6051e188f11 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -18,7 +18,7 @@ This guide shows a typical workflow for running many jobs with SkyPilot. Why Use SkyPilot to Run Many Jobs ------------------------------------- -- **Unified**: Utilize resources on any of your infrastructure (Kubernetes, cloud VMs, reservations, etc.). +- **Unified**: Use any or multiple of your own infrastructure (Kubernetes, cloud VMs, reservations, etc.). - **Elastic**: Scale up and down based on demands. - **Cost-effective**: Only pay for the cheapest resources. - **Robust**: Automatically recover jobs from failures. From 0f02e4b72a87f18b4a2da749036131ed608b423e Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Thu, 29 Aug 2024 10:53:15 -0700 Subject: [PATCH 46/47] Update docs/source/running-jobs/many-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/running-jobs/many-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/running-jobs/many-jobs.rst b/docs/source/running-jobs/many-jobs.rst index 6051e188f11..3d26d74e794 100644 --- a/docs/source/running-jobs/many-jobs.rst +++ b/docs/source/running-jobs/many-jobs.rst @@ -22,7 +22,7 @@ Why Use SkyPilot to Run Many Jobs - **Elastic**: Scale up and down based on demands. - **Cost-effective**: Only pay for the cheapest resources. - **Robust**: Automatically recover jobs from failures. -- **Observable**: See and manage all jobs in a single pane of glass. +- **Observable**: Monitor and manage all jobs in a single pane of glass. Write a YAML for One Job ----------------------------------- From 83e2b5c8cb7393bf14402afe7a7100dbec04894e Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Thu, 29 Aug 2024 18:04:55 +0000 Subject: [PATCH 47/47] Minor fix for comments --- docs/source/_static/custom.js | 1 - docs/source/developers/index.rst | 8 ++++++++ docs/source/docs/index.rst | 9 +-------- 3 files changed, 9 insertions(+), 9 deletions(-) create mode 100644 docs/source/developers/index.rst diff --git a/docs/source/_static/custom.js b/docs/source/_static/custom.js index a15f0cce666..06b0cb2886a 100644 --- a/docs/source/_static/custom.js +++ b/docs/source/_static/custom.js @@ -27,7 +27,6 @@ document.addEventListener('DOMContentLoaded', () => { const newItems = [ { selector: '.caption-text', text: 'SkyServe: Model Serving' }, { selector: '.toctree-l1 > a', text: 'Managed Jobs' }, - { selector: '.toctree-l1 > a', text: 'Running on Kubernetes' }, { selector: '.toctree-l1 > a', text: 'Llama-3.1 (Meta)' }, { selector: '.toctree-l1 > a', text: 'Many Parallel Jobs' }, ]; diff --git a/docs/source/developers/index.rst b/docs/source/developers/index.rst new file mode 100644 index 00000000000..a6b76e8a53c --- /dev/null +++ b/docs/source/developers/index.rst @@ -0,0 +1,8 @@ +Developer Guides +================= + +.. toctree:: + :maxdepth: 1 + + ../developers/CONTRIBUTING + Guide: Adding a New Cloud diff --git a/docs/source/docs/index.rst b/docs/source/docs/index.rst index 59d8f54cc77..31eaf0c9106 100644 --- a/docs/source/docs/index.rst +++ b/docs/source/docs/index.rst @@ -185,14 +185,6 @@ Read the research: SkyPilot vs. Other Systems <../reference/comparison> -.. toctree:: - :hidden: - :maxdepth: 1 - :caption: Developer Guides - - ../developers/CONTRIBUTING - Guide: Adding a New Cloud - .. toctree:: :hidden: :maxdepth: 1 @@ -211,4 +203,5 @@ Read the research: ../reference/cli ../reference/api ../reference/config + ../developers/index