Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Clarify managed spot jobs against sky launch #3561

Merged
merged 8 commits into from
May 19, 2024
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 20 additions & 2 deletions docs/source/examples/managed-jobs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Managed Jobs

This feature is great for scaling out: running a single job for long durations, or running many jobs (pipelines).

SkyPilot supports **managed jobs**, which can automatically recover from any spot preemptions or hardware failures.
SkyPilot supports **managed jobs** (:code:`sky jobs`), which can automatically recover from any spot preemptions or hardware failures.
It can be used in three modes:

#. :ref:`Managed Spot Jobs <spot-jobs>`: Jobs run on auto-recovering spot instances. This can **save significant costs** (e.g., up to 70\% for GPU VMs) by making preemptible spot instances useful for long-running jobs.
Expand All @@ -20,9 +20,27 @@ It can be used in three modes:
Managed Spot Jobs
-----------------

SkyPilot automatically finds available spot resources across regions and clouds to maximize availability.
*SkyPilot managed spot job* (:code:`sky jobs launch --use-spot`) automatically finds available spot resources across regions and clouds to maximize availability.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
Any spot preemptions are automatically handled by SkyPilot without user intervention.


Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
.. list-table::
:widths: 30 18 12 35
:header-rows: 1

* - Command
- Managed?
- SSH-able?
- Best for
* - :code:`sky launch --use-spot`
- Unmanaged spot cluster
- Yes
- Interactive dev on spot instances (e.g., :ref:`SSH, VSCode, Jupyter <dev-connect>`).
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
* - :code:`sky jobs launch --use-spot`
- Managed spot job (auto-recovery)
- No
- Scaling out long-running jobs (e.g., data processing, training, batch inference).
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

Here is an example of a BERT training job failing over different regions across AWS and GCP.

.. image:: https://i.imgur.com/Vteg3fK.gif
Expand Down
Loading