Skip to content

Commit

Permalink
Docs: revamp. (#2693)
Browse files Browse the repository at this point in the history
* Docs: revamp.

* Updates
  • Loading branch information
concretevitamin authored Oct 17, 2023
1 parent 8227ce9 commit bf33602
Show file tree
Hide file tree
Showing 12 changed files with 110 additions and 165 deletions.
14 changes: 7 additions & 7 deletions docs/source/examples/docker-containers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,9 +42,9 @@ For Docker images hosted on private registries, you can provide the registry aut
# SKYPILOT_DOCKER_PASSWORD: <password>
SKYPILOT_DOCKER_SERVER: <your-user-id>.dkr.ecr.us-east-1.amazonaws.com
We suggest to setting the :code:`SKYPILOT_DOCKER_PASSWORD` environment variable through the CLI (see :ref:`passing secrets <passing-secrets>`):
We suggest setting the :code:`SKYPILOT_DOCKER_PASSWORD` environment variable through the CLI (see :ref:`passing secrets <passing-secrets>`):

.. code-block:: bash
.. code-block:: console
$ export SKYPILOT_DOCKER_PASSWORD=$(aws ecr get-login-password --region us-east-1)
$ sky launch ecr_private_docker.yaml --env SKYPILOT_DOCKER_PASSWORD
Expand Down Expand Up @@ -89,14 +89,14 @@ The :code:`echo_app` `example <https://github.com/skypilot-org/skypilot/tree/mas
run: |
docker run --rm \
--volume="/inputs:/inputs:ro" \
--volume="/outputs:/outputs:rw" \
echo:v0 \
/inputs/README.md /outputs/output.txt
--volume="/inputs:/inputs:ro" \
--volume="/outputs:/outputs:rw" \
echo:v0 \
/inputs/README.md /outputs/output.txt
In this example, the Dockerfile and build context are contained in :code:`./echo_app`.
The :code:`setup` phase of the task builds the image, and the :code:`run` phase runs the container.
The inputs to the app are copied to SkyPilot using :code:`file_mounts` and mounted into the container using docker volume mounts (:code:`--volume` flag).
The output of the app produced at :code:`/outputs` path in the container is also volume mounted to :code:`/outputs` on the VM, which gets directly written to a S3 bucket through SkyPilot Storage mounting.

Our GitHub repository has more examples, including running `Detectron2 in a Docker container <https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml>`_ via SkyPilot.
Our GitHub repository has more examples, including running `Detectron2 in a Docker container <https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml>`_ via SkyPilot.
85 changes: 0 additions & 85 deletions docs/source/examples/iterative-dev-project.rst

This file was deleted.

12 changes: 6 additions & 6 deletions docs/source/examples/ports.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ At times, it might be crucial to expose specific ports on your cluster to the pu
- **Creating Web Services**: Whether you're setting up a web server, database, or another service, they all communicate via specific ports that need to be accessible.
- **Collaborative Tools**: Some tools and platforms may require port openings to enable collaboration with teammates or to integrate with other services.

Opening Ports for SkyPilot cluster
Opening Ports on a Cluster
----------------------------------

To open a port on a SkyPilot cluster, specify :code:`ports` in the :code:`resources` section of your task. For example, here is a YAML configuration to expose a Jupyter Lab server:
Expand All @@ -26,22 +26,22 @@ To open a port on a SkyPilot cluster, specify :code:`ports` in the :code:`resour
In this example, the :code:`run` command will start the Jupyter Lab server on port 8888. By specifying :code:`ports: 8888`, SkyPilot will expose port 8888 on the cluster, making the jupyter server publicly accessible. To launch and access the server, run:

.. code-block:: bash
.. code-block:: console
$ sky launch -c jupyter jupyter_lab.yaml
and look in for the logs for some output like:

.. code-block:: bash
.. code-block:: console
Jupyter Server 2.7.0 is running at:
http://127.0.0.1:8888/lab?token=<token>
To get the public IP address of the head node of the cluster, run :code:`sky status --ip jupyter`:

.. code-block:: bash
.. code-block:: console
$ sky status --ip jupyter
35.223.97.21
Expand All @@ -59,7 +59,7 @@ If you want to expose multiple ports, you can specify a list of ports or port ra
SkyPilot also support opening ports through the CLI:

.. code-block:: bash
.. code-block:: console
$ sky launch -c jupyter --ports 8888 jupyter_lab.yaml
Expand Down
8 changes: 8 additions & 0 deletions docs/source/examples/spot-jobs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,10 @@
Managed Spot Jobs
================================================

.. tip::

This feature is great for scaling out: running a single job for long durations, or running many jobs.

SkyPilot supports managed spot jobs that can **automatically recover from preemptions**.
This feature **saves significant cost** (e.g., up to 70\% for GPU VMs) by making preemptible spot instances practical for long-running jobs.

Expand Down Expand Up @@ -185,6 +189,10 @@ cost savings from spot instances without worrying about preemption or losing pro
$ sky spot launch -n bert-qa bert_qa.yaml
.. tip::

Try copy-paste this example and adapt it to your own job.


Useful CLIs
-----------
Expand Down
8 changes: 5 additions & 3 deletions docs/source/getting-started/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ Quickstart

This guide will walk you through:

- defining a task in a simple YAML format
- provisioning a cluster and running a task
- using the core SkyPilot CLI commands
- Defining a task in a simple YAML format
- Provisioning a cluster and running a task
- Using the core SkyPilot CLI commands

Be sure to complete the :ref:`installation instructions <installation>` first before continuing with this guide.

Expand Down Expand Up @@ -127,6 +127,8 @@ This may show multiple clusters, if you have created several:
mycluster 4 mins ago 1x AWS(p3.2xlarge) sky exec mycluster hello_sky.yaml UP
.. _ssh:

SSH into clusters
=================
Simply run :code:`ssh <cluster_name>` to log into a cluster:
Expand Down
8 changes: 4 additions & 4 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -104,9 +104,10 @@ Documentation
:maxdepth: 1
:caption: Running Jobs

examples/spot-jobs
reference/job-queue
reference/tpu
examples/auto-failover
reference/kubernetes/index
running-jobs/index

.. toctree::
Expand All @@ -130,11 +131,10 @@ Documentation

examples/docker-containers
examples/ports
examples/iterative-dev-project
reference/tpu
reference/interactive-nodes
reference/faq
reference/logging
reference/kubernetes/index
reference/faq

.. toctree::
:maxdepth: 1
Expand Down
25 changes: 18 additions & 7 deletions docs/source/reference/interactive-nodes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,16 @@ Interactive nodes can be stopped, restarted, and terminated, like any other clus
$ # Terminate entirely:
$ sky down sky-gpunode-<username>
.. note::

Stopping a cluster does not lose data on the attached disks (billing for the
instances will stop while the disks will still be charged). Those disks
will be reattached when restarting the cluster.

Terminating a cluster will delete all associated resources (all billing
stops), and any data on the attached disks will be lost. Terminated
clusters cannot be restarted.

.. note::

Since :code:`sky start` restarts a stopped cluster, :ref:`auto-failover
Expand All @@ -96,8 +106,13 @@ Interactive nodes can be stopped, restarted, and terminated, like any other clus
Getting multiple nodes
----------------------
By default, interactive clusters are a single node. If you require a cluster
with multiple nodes (e.g., for hyperparameter tuning or distributed training),
use :code:`num_nodes` in a YAML spec:
with multiple nodes, use ``sky launch`` directly:

.. code-block:: console
$ sky launch -c my-cluster --num-nodes 16 --gpus V100:8
The same can be achieved with a YAML spec:

.. code-block:: yaml
Expand All @@ -110,8 +125,4 @@ use :code:`num_nodes` in a YAML spec:
$ sky launch -c my-cluster multi_node.yaml
To log in to the head node:

.. code-block:: console
$ ssh my-cluster
You can then :ref:`SSH into any node <ssh>` of the cluster by name.
53 changes: 49 additions & 4 deletions docs/source/reference/job-queue.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,12 @@ Job Queue
=========

SkyPilot's **job queue** allows multiple jobs to be scheduled on a cluster.
This enables parallel experiments or :ref:`hyperparameter tuning <grid-search>`.

Getting started
--------------------------------

Each task submitted by :code:`sky exec` is automatically queued and scheduled
for execution:
for execution on an existing cluster:

.. code-block:: bash
Expand Down Expand Up @@ -90,10 +92,53 @@ For example, ``task.yaml`` above launches a 4-GPU task on each node that has 8
GPUs, so the task's ``run`` commands will be invoked with
``CUDA_VISIBLE_DEVICES`` populated with 4 device IDs.

If your ``run`` commands use Docker/``docker run``, simply pass ``--gpus=all``
as the correct environment variable is made available to the container (only the
If your ``run`` commands use Docker/``docker run``, simply pass ``--gpus=all``;
the correct environment variable would be set inside the container (only the
allocated device IDs will be set).

Example: Grid Search
----------------------

To submit multiple trials with different hyperparameters to a cluster:

.. code-block:: bash
$ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-3
$ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 3e-3
$ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-4
$ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-2
$ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-6
Options used:

- :code:`--gpus`: specify the resource requirement for each job.
- :code:`-d` / :code:`--detach`: detach the run and logging from the terminal, allowing multiple trials to run concurrently.

If there are only 4 V100 GPUs on the cluster, SkyPilot will queue 1 job while the
other 4 run in parallel. Once a job finishes, the next job will begin executing
immediately.
See :ref:`below <scheduling-behavior>` for more details on SkyPilot's scheduling behavior.

.. tip::

You can also use :ref:`environment variables <env-vars>` to set different arguments for each trial.

Example: Fractional GPUs
-------------------------

To run multiple trials per GPU, use *fractional GPUs* in the resource requirement.
For example, use :code:`--gpus V100:0.5` to make 2 trials share 1 GPU:

.. code-block:: bash
$ sky exec mycluster --gpus V100:0.5 -d -- python train.py --lr 1e-3
$ sky exec mycluster --gpus V100:0.5 -d -- python train.py --lr 3e-3
...
When sharing a GPU, ensure that the GPU's memory is not oversubscribed
(otherwise, out-of-memory errors could occur).

.. _scheduling-behavior:

Scheduling behavior
--------------------------------
Expand Down
2 changes: 1 addition & 1 deletion docs/source/reference/kubernetes/index.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
.. _kubernetes-overview:

Running on Kubernetes (Alpha)
Running on Kubernetes
=============================

.. note::
Expand Down
19 changes: 12 additions & 7 deletions docs/source/running-jobs/distributed-jobs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ provisioning and distributed execution on many VMs.
For example, here is a simple PyTorch Distributed training example:

.. code-block:: yaml
:emphasize-lines: 6-6
:emphasize-lines: 6-6,21-22,24-25
name: resnet-distributed-app
Expand All @@ -33,13 +33,18 @@ For example, here is a simple PyTorch Distributed training example:
num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1`
python3 -m torch.distributed.launch --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--nnodes=$num_nodes --node_rank=${SKYPILOT_NODE_RANK} --master_addr=$master_addr \
--master_port=8008 resnet_ddp.py --num_epochs 20
python3 -m torch.distributed.launch \
--nproc_per_node=${SKYPILOT_NUM_GPUS_PER_NODE} \
--node_rank=${SKYPILOT_NODE_RANK} \
--nnodes=$num_nodes \
--master_addr=$master_addr \
--master_port=8008 \
resnet_ddp.py --num_epochs 20
In the above, :code:`num_nodes: 2` specifies that this task is to be run on 2
nodes, with each node having 4 V100s.
In the above,

- :code:`num_nodes: 2` specifies that this task is to be run on 2 nodes, with each node having 4 V100s;
- The highlighted lines in the ``run`` section show common environment variables that are useful for launching distributed training, explained below.

Environment variables
-----------------------------------------
Expand Down Expand Up @@ -120,4 +125,4 @@ This allows you directly to SSH into the worker nodes, if required.
# Worker nodes.
$ ssh mycluster-worker1
$ ssh mycluster-worker2
$ ssh mycluster-worker2
Loading

0 comments on commit bf33602

Please sign in to comment.