From 8d761f91f85a8c5698845b71ee879a9b95d2f5fa Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Thu, 16 May 2024 23:13:53 +0000 Subject: [PATCH 1/8] Clarify managed spot jobs against sky launch --- docs/source/examples/managed-jobs.rst | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/docs/source/examples/managed-jobs.rst b/docs/source/examples/managed-jobs.rst index ba449c1f087..cd7a5191781 100644 --- a/docs/source/examples/managed-jobs.rst +++ b/docs/source/examples/managed-jobs.rst @@ -23,6 +23,12 @@ Managed Spot Jobs SkyPilot automatically finds available spot resources across regions and clouds to maximize availability. Any spot preemptions are automatically handled by SkyPilot without user intervention. +.. tip:: + + :code:`sky launch --use-spot` is a "serverful" command that launches a cluster for + running jobs, where recoveries of the cluster after preemptions is user's responsibility. In contrast, managed spot jobs, :code:`sky jobs launch --use-spot`, is a "serverless" command, where SkyPilot is in charge of the whole + lifecycle of each job, including provisioning clusters, monitoring job status, and recovering the job from preemptions. + Here is an example of a BERT training job failing over different regions across AWS and GCP. .. image:: https://i.imgur.com/Vteg3fK.gif From 9165b861fbd4e9613393db3444a1e6f5db6f6309 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Sat, 18 May 2024 05:21:30 +0000 Subject: [PATCH 2/8] change to table --- docs/source/examples/managed-jobs.rst | 27 ++++++++++++++++++++------- 1 file changed, 20 insertions(+), 7 deletions(-) diff --git a/docs/source/examples/managed-jobs.rst b/docs/source/examples/managed-jobs.rst index cd7a5191781..4b2b45ba814 100644 --- a/docs/source/examples/managed-jobs.rst +++ b/docs/source/examples/managed-jobs.rst @@ -7,7 +7,7 @@ Managed Jobs This feature is great for scaling out: running a single job for long durations, or running many jobs (pipelines). -SkyPilot supports **managed jobs**, which can automatically recover from any spot preemptions or hardware failures. +SkyPilot supports **managed jobs** (:code:`sky jobs`), which can automatically recover from any spot preemptions or hardware failures. It can be used in three modes: #. :ref:`Managed Spot Jobs `: Jobs run on auto-recovering spot instances. This can **save significant costs** (e.g., up to 70\% for GPU VMs) by making preemptible spot instances useful for long-running jobs. @@ -20,14 +20,27 @@ It can be used in three modes: Managed Spot Jobs ----------------- -SkyPilot automatically finds available spot resources across regions and clouds to maximize availability. +SkyPilot managed spot jobs automatically finds available spot resources across regions and clouds to maximize availability. Any spot preemptions are automatically handled by SkyPilot without user intervention. -.. tip:: - - :code:`sky launch --use-spot` is a "serverful" command that launches a cluster for - running jobs, where recoveries of the cluster after preemptions is user's responsibility. In contrast, managed spot jobs, :code:`sky jobs launch --use-spot`, is a "serverless" command, where SkyPilot is in charge of the whole - lifecycle of each job, including provisioning clusters, monitoring job status, and recovering the job from preemptions. +Difference between **managed spot jobs** and **unmanaged spot cluster** (:code:`sky jobs launch --use-spot` vs :code:`sky launch --use-spot`): + +.. list-table:: + :widths: 30 15 12 37 + :header-rows: 1 + + * - Command + - Managed? + - SSH-able? + - Best for + * - :code:`sky launch --use-spot` + - No + - Yes + - Interactive dev on spot instances (e.g., :ref:`SSH, VSCode, Jupyter `). + * - :code:`sky jobs launch --use-spot` + - Yes (monitoring and recovery) + - No + - Scaling out long-running jobs (e.g., training, batch inference, data processing). Here is an example of a BERT training job failing over different regions across AWS and GCP. From 268e3c7cb775a69f33e3685c80b3e9ab15142723 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Sat, 18 May 2024 05:32:32 +0000 Subject: [PATCH 3/8] change to table --- docs/source/examples/managed-jobs.rst | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/docs/source/examples/managed-jobs.rst b/docs/source/examples/managed-jobs.rst index 4b2b45ba814..984f34a681e 100644 --- a/docs/source/examples/managed-jobs.rst +++ b/docs/source/examples/managed-jobs.rst @@ -20,13 +20,12 @@ It can be used in three modes: Managed Spot Jobs ----------------- -SkyPilot managed spot jobs automatically finds available spot resources across regions and clouds to maximize availability. +*SkyPilot managed spot job* (:code:`sky jobs launch --use-spot`) automatically finds available spot resources across regions and clouds to maximize availability. Any spot preemptions are automatically handled by SkyPilot without user intervention. -Difference between **managed spot jobs** and **unmanaged spot cluster** (:code:`sky jobs launch --use-spot` vs :code:`sky launch --use-spot`): .. list-table:: - :widths: 30 15 12 37 + :widths: 30 18 12 35 :header-rows: 1 * - Command @@ -34,13 +33,13 @@ Difference between **managed spot jobs** and **unmanaged spot cluster** (:code:` - SSH-able? - Best for * - :code:`sky launch --use-spot` - - No + - Unmanaged spot cluster - Yes - Interactive dev on spot instances (e.g., :ref:`SSH, VSCode, Jupyter `). * - :code:`sky jobs launch --use-spot` - - Yes (monitoring and recovery) + - Managed spot job (auto-recovery) - No - - Scaling out long-running jobs (e.g., training, batch inference, data processing). + - Scaling out long-running jobs (e.g., data processing, training, batch inference). Here is an example of a BERT training job failing over different regions across AWS and GCP. From 2882cf321dca19059bbd45dcb63dec18f791ffe9 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Sat, 18 May 2024 21:09:05 -0700 Subject: [PATCH 4/8] Update docs/source/examples/managed-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/examples/managed-jobs.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/source/examples/managed-jobs.rst b/docs/source/examples/managed-jobs.rst index 984f34a681e..491fd054996 100644 --- a/docs/source/examples/managed-jobs.rst +++ b/docs/source/examples/managed-jobs.rst @@ -24,6 +24,7 @@ Managed Spot Jobs Any spot preemptions are automatically handled by SkyPilot without user intervention. +Quick comparison between _unmanaged spot clusters_ vs. _managed spot jobs_: .. list-table:: :widths: 30 18 12 35 :header-rows: 1 From 2ebf6f5c028997d4fe8f3fca8c2e41cc863bea08 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Sat, 18 May 2024 21:10:04 -0700 Subject: [PATCH 5/8] Update docs/source/examples/managed-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/examples/managed-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/examples/managed-jobs.rst b/docs/source/examples/managed-jobs.rst index 491fd054996..a354545a2c7 100644 --- a/docs/source/examples/managed-jobs.rst +++ b/docs/source/examples/managed-jobs.rst @@ -20,7 +20,7 @@ It can be used in three modes: Managed Spot Jobs ----------------- -*SkyPilot managed spot job* (:code:`sky jobs launch --use-spot`) automatically finds available spot resources across regions and clouds to maximize availability. +In this mode, :code:`sky jobs launch --use-spot` is used to launch a managed spot job. SkyPilot automatically finds available spot resources across regions and clouds to maximize availability. Any spot preemptions are automatically handled by SkyPilot without user intervention. From 40daaebfdf0a73ef4c72f672a5a9b342de6aa643 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Sat, 18 May 2024 21:10:10 -0700 Subject: [PATCH 6/8] Update docs/source/examples/managed-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/examples/managed-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/examples/managed-jobs.rst b/docs/source/examples/managed-jobs.rst index a354545a2c7..709f20815a7 100644 --- a/docs/source/examples/managed-jobs.rst +++ b/docs/source/examples/managed-jobs.rst @@ -40,7 +40,7 @@ Quick comparison between _unmanaged spot clusters_ vs. _managed spot jobs_: * - :code:`sky jobs launch --use-spot` - Managed spot job (auto-recovery) - No - - Scaling out long-running jobs (e.g., data processing, training, batch inference). + - Scaling out long-running jobs (e.g., data processing, training, batch inference) Here is an example of a BERT training job failing over different regions across AWS and GCP. From cff269e495749b043bb0a28f52832222771628b9 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Sat, 18 May 2024 21:10:45 -0700 Subject: [PATCH 7/8] Update docs/source/examples/managed-jobs.rst Co-authored-by: Zongheng Yang --- docs/source/examples/managed-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/examples/managed-jobs.rst b/docs/source/examples/managed-jobs.rst index 709f20815a7..1fd1e72be6b 100644 --- a/docs/source/examples/managed-jobs.rst +++ b/docs/source/examples/managed-jobs.rst @@ -36,7 +36,7 @@ Quick comparison between _unmanaged spot clusters_ vs. _managed spot jobs_: * - :code:`sky launch --use-spot` - Unmanaged spot cluster - Yes - - Interactive dev on spot instances (e.g., :ref:`SSH, VSCode, Jupyter `). + - Interactive dev on spot instances (best for hardware with low preemption rates) * - :code:`sky jobs launch --use-spot` - Managed spot job (auto-recovery) - No From 2756ef47a230445e1c8dd8f7a0a47cd4865d6239 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Sun, 19 May 2024 04:14:31 +0000 Subject: [PATCH 8/8] fix --- docs/source/examples/managed-jobs.rst | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/source/examples/managed-jobs.rst b/docs/source/examples/managed-jobs.rst index 1fd1e72be6b..a47b4345b9f 100644 --- a/docs/source/examples/managed-jobs.rst +++ b/docs/source/examples/managed-jobs.rst @@ -24,7 +24,8 @@ In this mode, :code:`sky jobs launch --use-spot` is used to launch a managed spo Any spot preemptions are automatically handled by SkyPilot without user intervention. -Quick comparison between _unmanaged spot clusters_ vs. _managed spot jobs_: +Quick comparison between *unmanaged spot clusters* vs. *managed spot jobs*: + .. list-table:: :widths: 30 18 12 35 :header-rows: 1 @@ -36,7 +37,7 @@ Quick comparison between _unmanaged spot clusters_ vs. _managed spot jobs_: * - :code:`sky launch --use-spot` - Unmanaged spot cluster - Yes - - Interactive dev on spot instances (best for hardware with low preemption rates) + - Interactive dev on spot instances (especially for hardware with low preemption rates) * - :code:`sky jobs launch --use-spot` - Managed spot job (auto-recovery) - No