From bf33602836372dd70c2442c39d92e0cc27fa0e5b Mon Sep 17 00:00:00 2001
From: Zongheng Yang <zongheng.y@gmail.com>
Date: Mon, 16 Oct 2023 22:29:04 -0700
Subject: [PATCH] Docs: revamp. (#2693)

* Docs: revamp.

* Updates
---
 docs/source/examples/docker-containers.rst    | 14 +--
 .../source/examples/iterative-dev-project.rst | 85 -------------------
 docs/source/examples/ports.rst                | 12 +--
 docs/source/examples/spot-jobs.rst            |  8 ++
 docs/source/getting-started/quickstart.rst    |  8 +-
 docs/source/index.rst                         |  8 +-
 docs/source/reference/interactive-nodes.rst   | 25 ++++--
 docs/source/reference/job-queue.rst           | 53 +++++++++++-
 docs/source/reference/kubernetes/index.rst    |  2 +-
 docs/source/running-jobs/distributed-jobs.rst | 19 +++--
 docs/source/running-jobs/grid-search.rst      | 40 ---------
 docs/source/running-jobs/index.rst            |  1 -
 12 files changed, 110 insertions(+), 165 deletions(-)
 delete mode 100644 docs/source/examples/iterative-dev-project.rst
 delete mode 100644 docs/source/running-jobs/grid-search.rst
diff --git a/docs/source/examples/docker-containers.rst b/docs/source/examples/docker-containers.rst
index 99a967fd972..27506b1d81c 100644
--- a/docs/source/examples/docker-containers.rst
+++ b/docs/source/examples/docker-containers.rst
@@ -42,9 +42,9 @@ For Docker images hosted on private registries, you can provide the registry aut
     # SKYPILOT_DOCKER_PASSWORD: <password>
     SKYPILOT_DOCKER_SERVER: <your-user-id>.dkr.ecr.us-east-1.amazonaws.com
 
-We suggest to setting the :code:`SKYPILOT_DOCKER_PASSWORD` environment variable through the CLI (see :ref:`passing secrets <passing-secrets>`):
+We suggest setting the :code:`SKYPILOT_DOCKER_PASSWORD` environment variable through the CLI (see :ref:`passing secrets <passing-secrets>`):
 
-.. code-block:: bash
+.. code-block:: console
 
   $ export SKYPILOT_DOCKER_PASSWORD=$(aws ecr get-login-password --region us-east-1)
   $ sky launch ecr_private_docker.yaml --env SKYPILOT_DOCKER_PASSWORD
@@ -89,14 +89,14 @@ The :code:`echo_app` `example <https://github.com/skypilot-org/skypilot/tree/mas
 
   run: |
     docker run --rm \
-    --volume="/inputs:/inputs:ro" \
-    --volume="/outputs:/outputs:rw" \
-    echo:v0 \
-    /inputs/README.md /outputs/output.txt
+      --volume="/inputs:/inputs:ro" \
+      --volume="/outputs:/outputs:rw" \
+      echo:v0 \
+      /inputs/README.md /outputs/output.txt
 
 In this example, the Dockerfile and build context are contained in :code:`./echo_app`.
 The :code:`setup` phase of the task builds the image, and the :code:`run` phase runs the container.
 The inputs to the app are copied to SkyPilot using :code:`file_mounts` and mounted into the container using docker volume mounts (:code:`--volume` flag).
 The output of the app produced at :code:`/outputs` path in the container is also volume mounted to :code:`/outputs` on the VM, which gets directly written to a S3 bucket through SkyPilot Storage mounting.
 
-Our GitHub repository has more examples, including running `Detectron2 in a Docker container <https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml>`_ via SkyPilot.
\ No newline at end of file
+Our GitHub repository has more examples, including running `Detectron2 in a Docker container <https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml>`_ via SkyPilot.
diff --git a/docs/source/examples/iterative-dev-project.rst b/docs/source/examples/iterative-dev-project.rst
deleted file mode 100644
index 39f81f64f62..00000000000
--- a/docs/source/examples/iterative-dev-project.rst
+++ /dev/null
@@ -1,85 +0,0 @@
-.. _iter-dev:
-
-Iteratively Developing a Project
-====================================
-
-This page shows a typical workflow for iteratively developing and running a
-project on SkyPilot.
-
-Getting an interactive node
-----------------------------
-:ref:`Interactive nodes <interactive-nodes>` are easy-to-spin-up VMs that enable **fast development and interactive debugging**.
-
-To provision a GPU interactive node named :code:`dev`, run
-
-.. code-block:: console
-
-  $ # Provisions/reuses an interactive node with a single K80 GPU.
-  $ sky gpunode -c dev --gpus K80
-
-See the :ref:`CLI reference <sky-gpunode>` for all flags such as changing the GPU type and count.
-
-Running code
---------------------
-To run a command or a script on the cluster, use :code:`sky exec`:
-
-.. code-block:: console
-
-  $ # If the user has written a task.yaml, this directly
-  $ # executes the `run` section in the task YAML:
-  $ sky exec dev task.yaml
-
-  $ # Run a script inside the workdir.
-  $ # Workdir contents are synced to the cluster (~/sky_workdir/).
-  $ sky exec dev -- python train_cpu.py
-  $ sky exec dev --gpus=V100:1 -- python train_gpu.py
-
-Alternatively, the user can directly :code:`ssh` into the cluster's nodes and run commands:
-
-.. code-block:: console
-
-  $ # SSH into head node
-  $ ssh dev
-
-  $ # SSH into worker nodes
-  $ ssh dev-worker1
-  $ ssh dev-worker2
-
-SkyPilot provides easy password-less SSH access by automatically creating entries for each cluster in :code:`~/.ssh/config`.
-Referring to clusters by names also allows for seamless integration with common tools
-such as :code:`scp`, :code:`rsync`, and `Visual Studio Code Remote
-<https://code.visualstudio.com/docs/remote/remote-overview>`_.
-
-.. note::
-
-  Refer to :ref:`Syncing Code and Artifacts <sync-code-artifacts>` for more details
-  on how to upload code and download outputs from the cluster.
-
-Ending a development session
------------------------------
-To end a development session:
-
-.. code-block:: console
-
-  $ # Stop at the end of the work day:
-  $ sky stop dev
-
-  $ # Or, to terminate:
-  $ sky down dev
-
-To restart a stopped cluster:
-
-.. code-block:: console
-
-  $ # Restart it the next morning:
-  $ sky start dev
-
-.. note::
-
-    Stopping a cluster does not lose data on the attached disks (billing for the
-    instances will stop while the disks will still be charged).  Those disks
-    will be reattached when restarting the cluster.
-
-    Terminating a cluster will delete all associated resources (all billing
-    stops), and any data on the attached disks will be lost.  Terminated
-    clusters cannot be restarted.
diff --git a/docs/source/examples/ports.rst b/docs/source/examples/ports.rst
index db94c8632f3..8452efe6967 100644
--- a/docs/source/examples/ports.rst
+++ b/docs/source/examples/ports.rst
@@ -9,7 +9,7 @@ At times, it might be crucial to expose specific ports on your cluster to the pu
 - **Creating Web Services**: Whether you're setting up a web server, database, or another service, they all communicate via specific ports that need to be accessible.
 - **Collaborative Tools**: Some tools and platforms may require port openings to enable collaboration with teammates or to integrate with other services.
 
-Opening Ports for SkyPilot cluster
+Opening Ports on a Cluster
 ----------------------------------
 
 To open a port on a SkyPilot cluster, specify :code:`ports` in the :code:`resources` section of your task. For example, here is a YAML configuration to expose a Jupyter Lab server:
@@ -26,13 +26,13 @@ To open a port on a SkyPilot cluster, specify :code:`ports` in the :code:`resour
 
 In this example, the :code:`run` command will start the Jupyter Lab server on port 8888. By specifying :code:`ports: 8888`, SkyPilot will expose port 8888 on the cluster, making the jupyter server publicly accessible. To launch and access the server, run:
 
-.. code-block:: bash
+.. code-block:: console
 
     $ sky launch -c jupyter jupyter_lab.yaml
 
 and look in for the logs for some output like:
 
-.. code-block:: bash
+.. code-block:: console
 
     Jupyter Server 2.7.0 is running at:
         http://127.0.0.1:8888/lab?token=<token>
@@ -40,8 +40,8 @@ and look in for the logs for some output like:
 
 To get the public IP address of the head node of the cluster, run :code:`sky status --ip jupyter`:
 
-.. code-block:: bash
-    
+.. code-block:: console
+
     $ sky status --ip jupyter
     35.223.97.21
 
@@ -59,7 +59,7 @@ If you want to expose multiple ports, you can specify a list of ports or port ra
 
 SkyPilot also support opening ports through the CLI:
 
-.. code-block:: bash
+.. code-block:: console
 
     $ sky launch -c jupyter --ports 8888 jupyter_lab.yaml
 
diff --git a/docs/source/examples/spot-jobs.rst b/docs/source/examples/spot-jobs.rst
index 43cfb9eaefa..c62246f97dc 100644
--- a/docs/source/examples/spot-jobs.rst
+++ b/docs/source/examples/spot-jobs.rst
@@ -3,6 +3,10 @@
 Managed Spot Jobs
 ================================================
 
+.. tip::
+
+  This feature is great for scaling out: running a single job for long durations, or running many jobs.
+
 SkyPilot supports managed spot jobs that can **automatically recover from preemptions**.
 This feature **saves significant cost** (e.g., up to 70\% for GPU VMs) by making preemptible spot instances practical for long-running jobs.
 
@@ -185,6 +189,10 @@ cost savings from spot instances without worrying about preemption or losing pro
 
   $ sky spot launch -n bert-qa bert_qa.yaml
 
+.. tip::
+
+  Try copy-paste this example and adapt it to your own job.
+
 
 Useful CLIs
 -----------
diff --git a/docs/source/getting-started/quickstart.rst b/docs/source/getting-started/quickstart.rst
index e12022c4eb9..38f1b1ab129 100644
--- a/docs/source/getting-started/quickstart.rst
+++ b/docs/source/getting-started/quickstart.rst
@@ -6,9 +6,9 @@ Quickstart
 
 This guide will walk you through:
 
-- defining a task in a simple YAML format
-- provisioning a cluster and running a task
-- using the core SkyPilot CLI commands
+- Defining a task in a simple YAML format
+- Provisioning a cluster and running a task
+- Using the core SkyPilot CLI commands
 
 Be sure to complete the :ref:`installation instructions <installation>` first before continuing with this guide.
 
@@ -127,6 +127,8 @@ This may show multiple clusters, if you have created several:
   mycluster  4 mins ago   1x AWS(p3.2xlarge)    sky exec mycluster hello_sky.yaml  UP
 
 
+.. _ssh:
+
 SSH into clusters
 =================
 Simply run :code:`ssh <cluster_name>` to log into a cluster:
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 9c11a67b8a1..e6e48e8a0eb 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -104,9 +104,10 @@ Documentation
    :maxdepth: 1
    :caption: Running Jobs
 
+   examples/spot-jobs
    reference/job-queue
-   reference/tpu
    examples/auto-failover
+   reference/kubernetes/index
    running-jobs/index
 
 .. toctree::
@@ -130,11 +131,10 @@ Documentation
 
    examples/docker-containers
    examples/ports
-   examples/iterative-dev-project
+   reference/tpu
    reference/interactive-nodes
-   reference/faq
    reference/logging
-   reference/kubernetes/index
+   reference/faq
 
 .. toctree::
    :maxdepth: 1
diff --git a/docs/source/reference/interactive-nodes.rst b/docs/source/reference/interactive-nodes.rst
index 1ba83100d9f..f92347abba5 100644
--- a/docs/source/reference/interactive-nodes.rst
+++ b/docs/source/reference/interactive-nodes.rst
@@ -86,6 +86,16 @@ Interactive nodes can be stopped, restarted, and terminated, like any other clus
     $ # Terminate entirely:
     $ sky down sky-gpunode-<username>
 
+.. note::
+
+    Stopping a cluster does not lose data on the attached disks (billing for the
+    instances will stop while the disks will still be charged).  Those disks
+    will be reattached when restarting the cluster.
+
+    Terminating a cluster will delete all associated resources (all billing
+    stops), and any data on the attached disks will be lost.  Terminated
+    clusters cannot be restarted.
+
 .. note::
 
     Since :code:`sky start` restarts a stopped cluster, :ref:`auto-failover
@@ -96,8 +106,13 @@ Interactive nodes can be stopped, restarted, and terminated, like any other clus
 Getting multiple nodes
 ----------------------
 By default, interactive clusters are a single node. If you require a cluster
-with multiple nodes (e.g., for hyperparameter tuning or distributed training),
-use :code:`num_nodes` in a YAML spec:
+with multiple nodes, use ``sky launch`` directly:
+
+.. code-block:: console
+
+    $ sky launch -c my-cluster --num-nodes 16 --gpus V100:8
+
+The same can be achieved with a YAML spec:
 
 .. code-block:: yaml
 
@@ -110,8 +125,4 @@ use :code:`num_nodes` in a YAML spec:
 
     $ sky launch -c my-cluster multi_node.yaml
 
-To log in to the head node:
-
-.. code-block:: console
-
-    $ ssh my-cluster
+You can then :ref:`SSH into any node <ssh>` of the cluster by name.
diff --git a/docs/source/reference/job-queue.rst b/docs/source/reference/job-queue.rst
index b969fddc87f..a42248239ad 100644
--- a/docs/source/reference/job-queue.rst
+++ b/docs/source/reference/job-queue.rst
@@ -4,10 +4,12 @@ Job Queue
 =========
 
 SkyPilot's **job queue** allows multiple jobs to be scheduled on a cluster.
-This enables parallel experiments or :ref:`hyperparameter tuning <grid-search>`.
+
+Getting started
+--------------------------------
 
 Each task submitted by :code:`sky exec` is automatically queued and scheduled
-for execution:
+for execution on an existing cluster:
 
 .. code-block:: bash
 
@@ -90,10 +92,53 @@ For example, ``task.yaml`` above launches a 4-GPU task on each node that has 8
 GPUs, so the task's ``run`` commands will be invoked with
 ``CUDA_VISIBLE_DEVICES`` populated with 4 device IDs.
 
-If your ``run`` commands use Docker/``docker run``, simply pass ``--gpus=all``
-as the correct environment variable is made available to the container (only the
+If your ``run`` commands use Docker/``docker run``, simply pass ``--gpus=all``;
+the correct environment variable would be set inside the container (only the
 allocated device IDs will be set).
 
+Example: Grid Search
+----------------------
+
+To submit multiple trials with different hyperparameters to a cluster:
+
+.. code-block:: bash
+
+  $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-3
+  $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 3e-3
+  $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-4
+  $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-2
+  $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-6
+
+Options used:
+
+- :code:`--gpus`: specify the resource requirement for each job.
+- :code:`-d` / :code:`--detach`: detach the run and logging from the terminal, allowing multiple trials to run concurrently.
+
+If there are only 4 V100 GPUs on the cluster, SkyPilot will queue 1 job while the
+other 4 run in parallel. Once a job finishes, the next job will begin executing
+immediately.
+See :ref:`below <scheduling-behavior>` for more details on SkyPilot's scheduling behavior.
+
+.. tip::
+
+  You can also use :ref:`environment variables <env-vars>` to set different arguments for each trial.
+
+Example: Fractional GPUs
+-------------------------
+
+To run multiple trials per GPU, use *fractional GPUs* in the resource requirement.
+For example, use :code:`--gpus V100:0.5` to make 2 trials share 1 GPU:
+
+.. code-block:: bash
+
+  $ sky exec mycluster --gpus V100:0.5 -d -- python train.py --lr 1e-3
+  $ sky exec mycluster --gpus V100:0.5 -d -- python train.py --lr 3e-3
+  ...
+
+When sharing a GPU, ensure that the GPU's memory is not oversubscribed
+(otherwise, out-of-memory errors could occur).
+
+.. _scheduling-behavior:
 
 Scheduling behavior
 --------------------------------
diff --git a/docs/source/reference/kubernetes/index.rst b/docs/source/reference/kubernetes/index.rst
index 7517c70d211..f3f950961aa 100644
--- a/docs/source/reference/kubernetes/index.rst
+++ b/docs/source/reference/kubernetes/index.rst
@@ -1,6 +1,6 @@
 .. _kubernetes-overview:
 
-Running on Kubernetes (Alpha)
+Running on Kubernetes
 =============================
 
 .. note::
diff --git a/docs/source/running-jobs/distributed-jobs.rst b/docs/source/running-jobs/distributed-jobs.rst
index e356cca37a4..0f9ce453561 100644
--- a/docs/source/running-jobs/distributed-jobs.rst
+++ b/docs/source/running-jobs/distributed-jobs.rst
@@ -9,7 +9,7 @@ provisioning and distributed execution on many VMs.
 For example, here is a simple PyTorch Distributed training example:
 
 .. code-block:: yaml
-   :emphasize-lines: 6-6
+   :emphasize-lines: 6-6,21-22,24-25
 
    name: resnet-distributed-app
 
@@ -33,13 +33,18 @@ For example, here is a simple PyTorch Distributed training example:
 
      num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
      master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1`
-     python3 -m torch.distributed.launch --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
-       --nnodes=$num_nodes --node_rank=${SKYPILOT_NODE_RANK} --master_addr=$master_addr \
-       --master_port=8008 resnet_ddp.py --num_epochs 20
+     python3 -m torch.distributed.launch \
+       --nproc_per_node=${SKYPILOT_NUM_GPUS_PER_NODE} \
+       --node_rank=${SKYPILOT_NODE_RANK} \
+       --nnodes=$num_nodes \
+       --master_addr=$master_addr \
+       --master_port=8008 \
+       resnet_ddp.py --num_epochs 20
 
-In the above, :code:`num_nodes: 2` specifies that this task is to be run on 2
-nodes, with each node having 4 V100s.
+In the above,
 
+- :code:`num_nodes: 2` specifies that this task is to be run on 2 nodes, with each node having 4 V100s;
+- The highlighted lines in the ``run`` section show common environment variables that are useful for launching distributed training, explained below.
 
 Environment variables
 -----------------------------------------
@@ -120,4 +125,4 @@ This allows you directly to SSH into the worker nodes, if required.
 
   # Worker nodes.
   $ ssh mycluster-worker1
-  $ ssh mycluster-worker2
\ No newline at end of file
+  $ ssh mycluster-worker2
diff --git a/docs/source/running-jobs/grid-search.rst b/docs/source/running-jobs/grid-search.rst
deleted file mode 100644
index 4956bf6840a..00000000000
--- a/docs/source/running-jobs/grid-search.rst
+++ /dev/null
@@ -1,40 +0,0 @@
-.. _grid-search:
-
-Grid Search
-===========
-
-To submit multiple trials with different hyperparameters to a cluster:
-
-.. code-block:: bash
-
-  $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-3
-  $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 3e-3
-  $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-4
-  $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-2
-  $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-6
-
-Options used:
-
-- :code:`--gpus`: specify the resource requirement for each job.
-- :code:`-d` / :code:`--detach`: detach the run and logging from the terminal, allowing multiple trials to run concurrently.
-
-If there are only 4 V100 GPUs on the cluster, SkyPilot will queue 1 job while the
-other 4 run in parallel. Once a job finishes, the next job will begin executing
-immediately.
-Refer to :ref:`Job Queue <job-queue>` for more details on SkyPilot's scheduling behavior.
-
-
-Multiple trials per GPU
------------------------
-
-To run multiple trials per GPU, use *fractional GPUs* in the resource requirement.
-For example, use :code:`--gpus V100:0.5` to make 2 trials share 1 GPU:
-
-.. code-block:: bash
-
-  $ sky exec mycluster --gpus V100:0.5 -d -- python train.py --lr 1e-3
-  $ sky exec mycluster --gpus V100:0.5 -d -- python train.py --lr 3e-3
-  ...
-
-When sharing a GPU, ensure that the GPU's memory is not oversubscribed
-(otherwise, out-of-memory errors could occur).
diff --git a/docs/source/running-jobs/index.rst b/docs/source/running-jobs/index.rst
index 0f7541b35d9..04c921d1022 100644
--- a/docs/source/running-jobs/index.rst
+++ b/docs/source/running-jobs/index.rst
@@ -4,5 +4,4 @@ More User Guides
 .. toctree::
 
    distributed-jobs
-   grid-search
    environment-variables