From bf33602836372dd70c2442c39d92e0cc27fa0e5b Mon Sep 17 00:00:00 2001 From: Zongheng Yang Date: Mon, 16 Oct 2023 22:29:04 -0700 Subject: [PATCH] Docs: revamp. (#2693) * Docs: revamp. * Updates --- docs/source/examples/docker-containers.rst | 14 +-- .../source/examples/iterative-dev-project.rst | 85 ------------------- docs/source/examples/ports.rst | 12 +-- docs/source/examples/spot-jobs.rst | 8 ++ docs/source/getting-started/quickstart.rst | 8 +- docs/source/index.rst | 8 +- docs/source/reference/interactive-nodes.rst | 25 ++++-- docs/source/reference/job-queue.rst | 53 +++++++++++- docs/source/reference/kubernetes/index.rst | 2 +- docs/source/running-jobs/distributed-jobs.rst | 19 +++-- docs/source/running-jobs/grid-search.rst | 40 --------- docs/source/running-jobs/index.rst | 1 - 12 files changed, 110 insertions(+), 165 deletions(-) delete mode 100644 docs/source/examples/iterative-dev-project.rst delete mode 100644 docs/source/running-jobs/grid-search.rst diff --git a/docs/source/examples/docker-containers.rst b/docs/source/examples/docker-containers.rst index 99a967fd972..27506b1d81c 100644 --- a/docs/source/examples/docker-containers.rst +++ b/docs/source/examples/docker-containers.rst @@ -42,9 +42,9 @@ For Docker images hosted on private registries, you can provide the registry aut # SKYPILOT_DOCKER_PASSWORD: SKYPILOT_DOCKER_SERVER: .dkr.ecr.us-east-1.amazonaws.com -We suggest to setting the :code:`SKYPILOT_DOCKER_PASSWORD` environment variable through the CLI (see :ref:`passing secrets `): +We suggest setting the :code:`SKYPILOT_DOCKER_PASSWORD` environment variable through the CLI (see :ref:`passing secrets `): -.. code-block:: bash +.. code-block:: console $ export SKYPILOT_DOCKER_PASSWORD=$(aws ecr get-login-password --region us-east-1) $ sky launch ecr_private_docker.yaml --env SKYPILOT_DOCKER_PASSWORD @@ -89,14 +89,14 @@ The :code:`echo_app` `example `_ via SkyPilot. \ No newline at end of file +Our GitHub repository has more examples, including running `Detectron2 in a Docker container `_ via SkyPilot. diff --git a/docs/source/examples/iterative-dev-project.rst b/docs/source/examples/iterative-dev-project.rst deleted file mode 100644 index 39f81f64f62..00000000000 --- a/docs/source/examples/iterative-dev-project.rst +++ /dev/null @@ -1,85 +0,0 @@ -.. _iter-dev: - -Iteratively Developing a Project -==================================== - -This page shows a typical workflow for iteratively developing and running a -project on SkyPilot. - -Getting an interactive node ----------------------------- -:ref:`Interactive nodes ` are easy-to-spin-up VMs that enable **fast development and interactive debugging**. - -To provision a GPU interactive node named :code:`dev`, run - -.. code-block:: console - - $ # Provisions/reuses an interactive node with a single K80 GPU. - $ sky gpunode -c dev --gpus K80 - -See the :ref:`CLI reference ` for all flags such as changing the GPU type and count. - -Running code --------------------- -To run a command or a script on the cluster, use :code:`sky exec`: - -.. code-block:: console - - $ # If the user has written a task.yaml, this directly - $ # executes the `run` section in the task YAML: - $ sky exec dev task.yaml - - $ # Run a script inside the workdir. - $ # Workdir contents are synced to the cluster (~/sky_workdir/). - $ sky exec dev -- python train_cpu.py - $ sky exec dev --gpus=V100:1 -- python train_gpu.py - -Alternatively, the user can directly :code:`ssh` into the cluster's nodes and run commands: - -.. code-block:: console - - $ # SSH into head node - $ ssh dev - - $ # SSH into worker nodes - $ ssh dev-worker1 - $ ssh dev-worker2 - -SkyPilot provides easy password-less SSH access by automatically creating entries for each cluster in :code:`~/.ssh/config`. -Referring to clusters by names also allows for seamless integration with common tools -such as :code:`scp`, :code:`rsync`, and `Visual Studio Code Remote -`_. - -.. note:: - - Refer to :ref:`Syncing Code and Artifacts ` for more details - on how to upload code and download outputs from the cluster. - -Ending a development session ------------------------------ -To end a development session: - -.. code-block:: console - - $ # Stop at the end of the work day: - $ sky stop dev - - $ # Or, to terminate: - $ sky down dev - -To restart a stopped cluster: - -.. code-block:: console - - $ # Restart it the next morning: - $ sky start dev - -.. note:: - - Stopping a cluster does not lose data on the attached disks (billing for the - instances will stop while the disks will still be charged). Those disks - will be reattached when restarting the cluster. - - Terminating a cluster will delete all associated resources (all billing - stops), and any data on the attached disks will be lost. Terminated - clusters cannot be restarted. diff --git a/docs/source/examples/ports.rst b/docs/source/examples/ports.rst index db94c8632f3..8452efe6967 100644 --- a/docs/source/examples/ports.rst +++ b/docs/source/examples/ports.rst @@ -9,7 +9,7 @@ At times, it might be crucial to expose specific ports on your cluster to the pu - **Creating Web Services**: Whether you're setting up a web server, database, or another service, they all communicate via specific ports that need to be accessible. - **Collaborative Tools**: Some tools and platforms may require port openings to enable collaboration with teammates or to integrate with other services. -Opening Ports for SkyPilot cluster +Opening Ports on a Cluster ---------------------------------- To open a port on a SkyPilot cluster, specify :code:`ports` in the :code:`resources` section of your task. For example, here is a YAML configuration to expose a Jupyter Lab server: @@ -26,13 +26,13 @@ To open a port on a SkyPilot cluster, specify :code:`ports` in the :code:`resour In this example, the :code:`run` command will start the Jupyter Lab server on port 8888. By specifying :code:`ports: 8888`, SkyPilot will expose port 8888 on the cluster, making the jupyter server publicly accessible. To launch and access the server, run: -.. code-block:: bash +.. code-block:: console $ sky launch -c jupyter jupyter_lab.yaml and look in for the logs for some output like: -.. code-block:: bash +.. code-block:: console Jupyter Server 2.7.0 is running at: http://127.0.0.1:8888/lab?token= @@ -40,8 +40,8 @@ and look in for the logs for some output like: To get the public IP address of the head node of the cluster, run :code:`sky status --ip jupyter`: -.. code-block:: bash - +.. code-block:: console + $ sky status --ip jupyter 35.223.97.21 @@ -59,7 +59,7 @@ If you want to expose multiple ports, you can specify a list of ports or port ra SkyPilot also support opening ports through the CLI: -.. code-block:: bash +.. code-block:: console $ sky launch -c jupyter --ports 8888 jupyter_lab.yaml diff --git a/docs/source/examples/spot-jobs.rst b/docs/source/examples/spot-jobs.rst index 43cfb9eaefa..c62246f97dc 100644 --- a/docs/source/examples/spot-jobs.rst +++ b/docs/source/examples/spot-jobs.rst @@ -3,6 +3,10 @@ Managed Spot Jobs ================================================ +.. tip:: + + This feature is great for scaling out: running a single job for long durations, or running many jobs. + SkyPilot supports managed spot jobs that can **automatically recover from preemptions**. This feature **saves significant cost** (e.g., up to 70\% for GPU VMs) by making preemptible spot instances practical for long-running jobs. @@ -185,6 +189,10 @@ cost savings from spot instances without worrying about preemption or losing pro $ sky spot launch -n bert-qa bert_qa.yaml +.. tip:: + + Try copy-paste this example and adapt it to your own job. + Useful CLIs ----------- diff --git a/docs/source/getting-started/quickstart.rst b/docs/source/getting-started/quickstart.rst index e12022c4eb9..38f1b1ab129 100644 --- a/docs/source/getting-started/quickstart.rst +++ b/docs/source/getting-started/quickstart.rst @@ -6,9 +6,9 @@ Quickstart This guide will walk you through: -- defining a task in a simple YAML format -- provisioning a cluster and running a task -- using the core SkyPilot CLI commands +- Defining a task in a simple YAML format +- Provisioning a cluster and running a task +- Using the core SkyPilot CLI commands Be sure to complete the :ref:`installation instructions ` first before continuing with this guide. @@ -127,6 +127,8 @@ This may show multiple clusters, if you have created several: mycluster 4 mins ago 1x AWS(p3.2xlarge) sky exec mycluster hello_sky.yaml UP +.. _ssh: + SSH into clusters ================= Simply run :code:`ssh ` to log into a cluster: diff --git a/docs/source/index.rst b/docs/source/index.rst index 9c11a67b8a1..e6e48e8a0eb 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -104,9 +104,10 @@ Documentation :maxdepth: 1 :caption: Running Jobs + examples/spot-jobs reference/job-queue - reference/tpu examples/auto-failover + reference/kubernetes/index running-jobs/index .. toctree:: @@ -130,11 +131,10 @@ Documentation examples/docker-containers examples/ports - examples/iterative-dev-project + reference/tpu reference/interactive-nodes - reference/faq reference/logging - reference/kubernetes/index + reference/faq .. toctree:: :maxdepth: 1 diff --git a/docs/source/reference/interactive-nodes.rst b/docs/source/reference/interactive-nodes.rst index 1ba83100d9f..f92347abba5 100644 --- a/docs/source/reference/interactive-nodes.rst +++ b/docs/source/reference/interactive-nodes.rst @@ -86,6 +86,16 @@ Interactive nodes can be stopped, restarted, and terminated, like any other clus $ # Terminate entirely: $ sky down sky-gpunode- +.. note:: + + Stopping a cluster does not lose data on the attached disks (billing for the + instances will stop while the disks will still be charged). Those disks + will be reattached when restarting the cluster. + + Terminating a cluster will delete all associated resources (all billing + stops), and any data on the attached disks will be lost. Terminated + clusters cannot be restarted. + .. note:: Since :code:`sky start` restarts a stopped cluster, :ref:`auto-failover @@ -96,8 +106,13 @@ Interactive nodes can be stopped, restarted, and terminated, like any other clus Getting multiple nodes ---------------------- By default, interactive clusters are a single node. If you require a cluster -with multiple nodes (e.g., for hyperparameter tuning or distributed training), -use :code:`num_nodes` in a YAML spec: +with multiple nodes, use ``sky launch`` directly: + +.. code-block:: console + + $ sky launch -c my-cluster --num-nodes 16 --gpus V100:8 + +The same can be achieved with a YAML spec: .. code-block:: yaml @@ -110,8 +125,4 @@ use :code:`num_nodes` in a YAML spec: $ sky launch -c my-cluster multi_node.yaml -To log in to the head node: - -.. code-block:: console - - $ ssh my-cluster +You can then :ref:`SSH into any node ` of the cluster by name. diff --git a/docs/source/reference/job-queue.rst b/docs/source/reference/job-queue.rst index b969fddc87f..a42248239ad 100644 --- a/docs/source/reference/job-queue.rst +++ b/docs/source/reference/job-queue.rst @@ -4,10 +4,12 @@ Job Queue ========= SkyPilot's **job queue** allows multiple jobs to be scheduled on a cluster. -This enables parallel experiments or :ref:`hyperparameter tuning `. + +Getting started +-------------------------------- Each task submitted by :code:`sky exec` is automatically queued and scheduled -for execution: +for execution on an existing cluster: .. code-block:: bash @@ -90,10 +92,53 @@ For example, ``task.yaml`` above launches a 4-GPU task on each node that has 8 GPUs, so the task's ``run`` commands will be invoked with ``CUDA_VISIBLE_DEVICES`` populated with 4 device IDs. -If your ``run`` commands use Docker/``docker run``, simply pass ``--gpus=all`` -as the correct environment variable is made available to the container (only the +If your ``run`` commands use Docker/``docker run``, simply pass ``--gpus=all``; +the correct environment variable would be set inside the container (only the allocated device IDs will be set). +Example: Grid Search +---------------------- + +To submit multiple trials with different hyperparameters to a cluster: + +.. code-block:: bash + + $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-3 + $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 3e-3 + $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-4 + $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-2 + $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-6 + +Options used: + +- :code:`--gpus`: specify the resource requirement for each job. +- :code:`-d` / :code:`--detach`: detach the run and logging from the terminal, allowing multiple trials to run concurrently. + +If there are only 4 V100 GPUs on the cluster, SkyPilot will queue 1 job while the +other 4 run in parallel. Once a job finishes, the next job will begin executing +immediately. +See :ref:`below ` for more details on SkyPilot's scheduling behavior. + +.. tip:: + + You can also use :ref:`environment variables ` to set different arguments for each trial. + +Example: Fractional GPUs +------------------------- + +To run multiple trials per GPU, use *fractional GPUs* in the resource requirement. +For example, use :code:`--gpus V100:0.5` to make 2 trials share 1 GPU: + +.. code-block:: bash + + $ sky exec mycluster --gpus V100:0.5 -d -- python train.py --lr 1e-3 + $ sky exec mycluster --gpus V100:0.5 -d -- python train.py --lr 3e-3 + ... + +When sharing a GPU, ensure that the GPU's memory is not oversubscribed +(otherwise, out-of-memory errors could occur). + +.. _scheduling-behavior: Scheduling behavior -------------------------------- diff --git a/docs/source/reference/kubernetes/index.rst b/docs/source/reference/kubernetes/index.rst index 7517c70d211..f3f950961aa 100644 --- a/docs/source/reference/kubernetes/index.rst +++ b/docs/source/reference/kubernetes/index.rst @@ -1,6 +1,6 @@ .. _kubernetes-overview: -Running on Kubernetes (Alpha) +Running on Kubernetes ============================= .. note:: diff --git a/docs/source/running-jobs/distributed-jobs.rst b/docs/source/running-jobs/distributed-jobs.rst index e356cca37a4..0f9ce453561 100644 --- a/docs/source/running-jobs/distributed-jobs.rst +++ b/docs/source/running-jobs/distributed-jobs.rst @@ -9,7 +9,7 @@ provisioning and distributed execution on many VMs. For example, here is a simple PyTorch Distributed training example: .. code-block:: yaml - :emphasize-lines: 6-6 + :emphasize-lines: 6-6,21-22,24-25 name: resnet-distributed-app @@ -33,13 +33,18 @@ For example, here is a simple PyTorch Distributed training example: num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l` master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1` - python3 -m torch.distributed.launch --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \ - --nnodes=$num_nodes --node_rank=${SKYPILOT_NODE_RANK} --master_addr=$master_addr \ - --master_port=8008 resnet_ddp.py --num_epochs 20 + python3 -m torch.distributed.launch \ + --nproc_per_node=${SKYPILOT_NUM_GPUS_PER_NODE} \ + --node_rank=${SKYPILOT_NODE_RANK} \ + --nnodes=$num_nodes \ + --master_addr=$master_addr \ + --master_port=8008 \ + resnet_ddp.py --num_epochs 20 -In the above, :code:`num_nodes: 2` specifies that this task is to be run on 2 -nodes, with each node having 4 V100s. +In the above, +- :code:`num_nodes: 2` specifies that this task is to be run on 2 nodes, with each node having 4 V100s; +- The highlighted lines in the ``run`` section show common environment variables that are useful for launching distributed training, explained below. Environment variables ----------------------------------------- @@ -120,4 +125,4 @@ This allows you directly to SSH into the worker nodes, if required. # Worker nodes. $ ssh mycluster-worker1 - $ ssh mycluster-worker2 \ No newline at end of file + $ ssh mycluster-worker2 diff --git a/docs/source/running-jobs/grid-search.rst b/docs/source/running-jobs/grid-search.rst deleted file mode 100644 index 4956bf6840a..00000000000 --- a/docs/source/running-jobs/grid-search.rst +++ /dev/null @@ -1,40 +0,0 @@ -.. _grid-search: - -Grid Search -=========== - -To submit multiple trials with different hyperparameters to a cluster: - -.. code-block:: bash - - $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-3 - $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 3e-3 - $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-4 - $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-2 - $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-6 - -Options used: - -- :code:`--gpus`: specify the resource requirement for each job. -- :code:`-d` / :code:`--detach`: detach the run and logging from the terminal, allowing multiple trials to run concurrently. - -If there are only 4 V100 GPUs on the cluster, SkyPilot will queue 1 job while the -other 4 run in parallel. Once a job finishes, the next job will begin executing -immediately. -Refer to :ref:`Job Queue ` for more details on SkyPilot's scheduling behavior. - - -Multiple trials per GPU ------------------------ - -To run multiple trials per GPU, use *fractional GPUs* in the resource requirement. -For example, use :code:`--gpus V100:0.5` to make 2 trials share 1 GPU: - -.. code-block:: bash - - $ sky exec mycluster --gpus V100:0.5 -d -- python train.py --lr 1e-3 - $ sky exec mycluster --gpus V100:0.5 -d -- python train.py --lr 3e-3 - ... - -When sharing a GPU, ensure that the GPU's memory is not oversubscribed -(otherwise, out-of-memory errors could occur). diff --git a/docs/source/running-jobs/index.rst b/docs/source/running-jobs/index.rst index 0f7541b35d9..04c921d1022 100644 --- a/docs/source/running-jobs/index.rst +++ b/docs/source/running-jobs/index.rst @@ -4,5 +4,4 @@ More User Guides .. toctree:: distributed-jobs - grid-search environment-variables