Docs: revamp. (#2693)

* Docs: revamp. * Updates
skypilot-org · Oct 17, 2023 · bf33602 · bf33602
1 parent 8227ce9
commit bf33602
Show file tree

Hide file tree

Showing 12 changed files with 110 additions and 165 deletions.
diff --git a/docs/source/examples/docker-containers.rst b/docs/source/examples/docker-containers.rst
@@ -42,9 +42,9 @@ For Docker images hosted on private registries, you can provide the registry aut
     # SKYPILOT_DOCKER_PASSWORD: <password>
     SKYPILOT_DOCKER_SERVER: <your-user-id>.dkr.ecr.us-east-1.amazonaws.com
 
-We suggest to setting the :code:`SKYPILOT_DOCKER_PASSWORD` environment variable through the CLI (see :ref:`passing secrets <passing-secrets>`):
+We suggest setting the :code:`SKYPILOT_DOCKER_PASSWORD` environment variable through the CLI (see :ref:`passing secrets <passing-secrets>`):
 
-.. code-block:: bash
+.. code-block:: console
 
   $ export SKYPILOT_DOCKER_PASSWORD=$(aws ecr get-login-password --region us-east-1)
   $ sky launch ecr_private_docker.yaml --env SKYPILOT_DOCKER_PASSWORD
@@ -89,14 +89,14 @@ The :code:`echo_app` `example <https://github.com/skypilot-org/skypilot/tree/mas
 
   run: |
     docker run --rm \
-    --volume="/inputs:/inputs:ro" \
-    --volume="/outputs:/outputs:rw" \
-    echo:v0 \
-    /inputs/README.md /outputs/output.txt
+      --volume="/inputs:/inputs:ro" \
+      --volume="/outputs:/outputs:rw" \
+      echo:v0 \
+      /inputs/README.md /outputs/output.txt
 
 In this example, the Dockerfile and build context are contained in :code:`./echo_app`.
 The :code:`setup` phase of the task builds the image, and the :code:`run` phase runs the container.
 The inputs to the app are copied to SkyPilot using :code:`file_mounts` and mounted into the container using docker volume mounts (:code:`--volume` flag).
 The output of the app produced at :code:`/outputs` path in the container is also volume mounted to :code:`/outputs` on the VM, which gets directly written to a S3 bucket through SkyPilot Storage mounting.
 
-Our GitHub repository has more examples, including running `Detectron2 in a Docker container <https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml>`_ via SkyPilot.
+Our GitHub repository has more examples, including running `Detectron2 in a Docker container <https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml>`_ via SkyPilot.
diff --git a/docs/source/examples/iterative-dev-project.rst b/docs/source/examples/iterative-dev-project.rst
diff --git a/docs/source/examples/ports.rst b/docs/source/examples/ports.rst
@@ -9,7 +9,7 @@ At times, it might be crucial to expose specific ports on your cluster to the pu
 - **Creating Web Services**: Whether you're setting up a web server, database, or another service, they all communicate via specific ports that need to be accessible.
 - **Collaborative Tools**: Some tools and platforms may require port openings to enable collaboration with teammates or to integrate with other services.
 
-Opening Ports for SkyPilot cluster
+Opening Ports on a Cluster
 ----------------------------------
 
 To open a port on a SkyPilot cluster, specify :code:`ports` in the :code:`resources` section of your task. For example, here is a YAML configuration to expose a Jupyter Lab server:
@@ -26,22 +26,22 @@ To open a port on a SkyPilot cluster, specify :code:`ports` in the :code:`resour
 
 In this example, the :code:`run` command will start the Jupyter Lab server on port 8888. By specifying :code:`ports: 8888`, SkyPilot will expose port 8888 on the cluster, making the jupyter server publicly accessible. To launch and access the server, run:
 
-.. code-block:: bash
+.. code-block:: console
 
     $ sky launch -c jupyter jupyter_lab.yaml
 
 and look in for the logs for some output like:
 
-.. code-block:: bash
+.. code-block:: console
 
     Jupyter Server 2.7.0 is running at:
         http://127.0.0.1:8888/lab?token=<token>
 
 
 To get the public IP address of the head node of the cluster, run :code:`sky status --ip jupyter`:
 
-.. code-block:: bash
-    
+.. code-block:: console
+
     $ sky status --ip jupyter
     35.223.97.21
 
@@ -59,7 +59,7 @@ If you want to expose multiple ports, you can specify a list of ports or port ra
 
 SkyPilot also support opening ports through the CLI:
 
-.. code-block:: bash
+.. code-block:: console
 
     $ sky launch -c jupyter --ports 8888 jupyter_lab.yaml
 

diff --git a/docs/source/examples/spot-jobs.rst b/docs/source/examples/spot-jobs.rst
@@ -3,6 +3,10 @@
 Managed Spot Jobs
 ================================================
 
+.. tip::
+
+  This feature is great for scaling out: running a single job for long durations, or running many jobs.
+
 SkyPilot supports managed spot jobs that can **automatically recover from preemptions**.
 This feature **saves significant cost** (e.g., up to 70\% for GPU VMs) by making preemptible spot instances practical for long-running jobs.
 
@@ -185,6 +189,10 @@ cost savings from spot instances without worrying about preemption or losing pro
 
   $ sky spot launch -n bert-qa bert_qa.yaml
 
+.. tip::
+
+  Try copy-paste this example and adapt it to your own job.
+
 
 Useful CLIs
 -----------

diff --git a/docs/source/getting-started/quickstart.rst b/docs/source/getting-started/quickstart.rst
@@ -6,9 +6,9 @@ Quickstart
 
 This guide will walk you through:
 
-- defining a task in a simple YAML format
-- provisioning a cluster and running a task
-- using the core SkyPilot CLI commands
+- Defining a task in a simple YAML format
+- Provisioning a cluster and running a task
+- Using the core SkyPilot CLI commands
 
 Be sure to complete the :ref:`installation instructions <installation>` first before continuing with this guide.
 
@@ -127,6 +127,8 @@ This may show multiple clusters, if you have created several:
   mycluster  4 mins ago   1x AWS(p3.2xlarge)    sky exec mycluster hello_sky.yaml  UP
 
 
+.. _ssh:
+
 SSH into clusters
 =================
 Simply run :code:`ssh <cluster_name>` to log into a cluster:

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -104,9 +104,10 @@ Documentation
    :maxdepth: 1
    :caption: Running Jobs
 
+   examples/spot-jobs
    reference/job-queue
-   reference/tpu
    examples/auto-failover
+   reference/kubernetes/index
    running-jobs/index
 
 .. toctree::
@@ -130,11 +131,10 @@ Documentation
 
    examples/docker-containers
    examples/ports
-   examples/iterative-dev-project
+   reference/tpu
    reference/interactive-nodes
-   reference/faq
    reference/logging
-   reference/kubernetes/index
+   reference/faq
 
 .. toctree::
    :maxdepth: 1

diff --git a/docs/source/reference/interactive-nodes.rst b/docs/source/reference/interactive-nodes.rst
@@ -86,6 +86,16 @@ Interactive nodes can be stopped, restarted, and terminated, like any other clus
     $ # Terminate entirely:
     $ sky down sky-gpunode-<username>
 
+.. note::
+
+    Stopping a cluster does not lose data on the attached disks (billing for the
+    instances will stop while the disks will still be charged).  Those disks
+    will be reattached when restarting the cluster.
+
+    Terminating a cluster will delete all associated resources (all billing
+    stops), and any data on the attached disks will be lost.  Terminated
+    clusters cannot be restarted.
+
 .. note::
 
     Since :code:`sky start` restarts a stopped cluster, :ref:`auto-failover
@@ -96,8 +106,13 @@ Interactive nodes can be stopped, restarted, and terminated, like any other clus
 Getting multiple nodes
 ----------------------
 By default, interactive clusters are a single node. If you require a cluster
-with multiple nodes (e.g., for hyperparameter tuning or distributed training),
-use :code:`num_nodes` in a YAML spec:
+with multiple nodes, use ``sky launch`` directly:
+
+.. code-block:: console
+
+    $ sky launch -c my-cluster --num-nodes 16 --gpus V100:8
+
+The same can be achieved with a YAML spec:
 
 .. code-block:: yaml
 
@@ -110,8 +125,4 @@ use :code:`num_nodes` in a YAML spec:
 
     $ sky launch -c my-cluster multi_node.yaml
 
-To log in to the head node:
-
-.. code-block:: console
-
-    $ ssh my-cluster
+You can then :ref:`SSH into any node <ssh>` of the cluster by name.
diff --git a/docs/source/reference/job-queue.rst b/docs/source/reference/job-queue.rst
@@ -4,10 +4,12 @@ Job Queue
 =========
 
 SkyPilot's **job queue** allows multiple jobs to be scheduled on a cluster.
-This enables parallel experiments or :ref:`hyperparameter tuning <grid-search>`.
+
+Getting started
+--------------------------------
 
 Each task submitted by :code:`sky exec` is automatically queued and scheduled
-for execution:
+for execution on an existing cluster:
 
 .. code-block:: bash
 
@@ -90,10 +92,53 @@ For example, ``task.yaml`` above launches a 4-GPU task on each node that has 8
 GPUs, so the task's ``run`` commands will be invoked with
 ``CUDA_VISIBLE_DEVICES`` populated with 4 device IDs.
 
-If your ``run`` commands use Docker/``docker run``, simply pass ``--gpus=all``
-as the correct environment variable is made available to the container (only the
+If your ``run`` commands use Docker/``docker run``, simply pass ``--gpus=all``;
+the correct environment variable would be set inside the container (only the
 allocated device IDs will be set).
 
+Example: Grid Search
+----------------------
+
+To submit multiple trials with different hyperparameters to a cluster:
+
+.. code-block:: bash
+
+  $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-3
+  $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 3e-3
+  $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-4
+  $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-2
+  $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-6
+
+Options used:
+
+- :code:`--gpus`: specify the resource requirement for each job.
+- :code:`-d` / :code:`--detach`: detach the run and logging from the terminal, allowing multiple trials to run concurrently.
+
+If there are only 4 V100 GPUs on the cluster, SkyPilot will queue 1 job while the
+other 4 run in parallel. Once a job finishes, the next job will begin executing
+immediately.
+See :ref:`below <scheduling-behavior>` for more details on SkyPilot's scheduling behavior.
+
+.. tip::
+
+  You can also use :ref:`environment variables <env-vars>` to set different arguments for each trial.
+
+Example: Fractional GPUs
+-------------------------
+
+To run multiple trials per GPU, use *fractional GPUs* in the resource requirement.
+For example, use :code:`--gpus V100:0.5` to make 2 trials share 1 GPU:
+
+.. code-block:: bash
+
+  $ sky exec mycluster --gpus V100:0.5 -d -- python train.py --lr 1e-3
+  $ sky exec mycluster --gpus V100:0.5 -d -- python train.py --lr 3e-3
+  ...
+
+When sharing a GPU, ensure that the GPU's memory is not oversubscribed
+(otherwise, out-of-memory errors could occur).
+
+.. _scheduling-behavior:
 
 Scheduling behavior
 --------------------------------

diff --git a/docs/source/reference/kubernetes/index.rst b/docs/source/reference/kubernetes/index.rst
@@ -1,6 +1,6 @@
 .. _kubernetes-overview:
 
-Running on Kubernetes (Alpha)
+Running on Kubernetes
 =============================
 
 .. note::

diff --git a/docs/source/running-jobs/distributed-jobs.rst b/docs/source/running-jobs/distributed-jobs.rst
@@ -9,7 +9,7 @@ provisioning and distributed execution on many VMs.
 For example, here is a simple PyTorch Distributed training example:
 
 .. code-block:: yaml
-   :emphasize-lines: 6-6
+   :emphasize-lines: 6-6,21-22,24-25
 
    name: resnet-distributed-app
 
@@ -33,13 +33,18 @@ For example, here is a simple PyTorch Distributed training example:
 
      num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
      master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1`
-     python3 -m torch.distributed.launch --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
-       --nnodes=$num_nodes --node_rank=${SKYPILOT_NODE_RANK} --master_addr=$master_addr \
-       --master_port=8008 resnet_ddp.py --num_epochs 20
+     python3 -m torch.distributed.launch \
+       --nproc_per_node=${SKYPILOT_NUM_GPUS_PER_NODE} \
+       --node_rank=${SKYPILOT_NODE_RANK} \
+       --nnodes=$num_nodes \
+       --master_addr=$master_addr \
+       --master_port=8008 \
+       resnet_ddp.py --num_epochs 20
 
-In the above, :code:`num_nodes: 2` specifies that this task is to be run on 2
-nodes, with each node having 4 V100s.
+In the above,
 
+- :code:`num_nodes: 2` specifies that this task is to be run on 2 nodes, with each node having 4 V100s;
+- The highlighted lines in the ``run`` section show common environment variables that are useful for launching distributed training, explained below.
 
 Environment variables
 -----------------------------------------
@@ -120,4 +125,4 @@ This allows you directly to SSH into the worker nodes, if required.
 
   # Worker nodes.
   $ ssh mycluster-worker1
-  $ ssh mycluster-worker2
+  $ ssh mycluster-worker2