From 8674520859c856d83d5796010b28cef2a3ede46e Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Thu, 4 Jul 2024 10:58:06 -0700 Subject: [PATCH] [Docs] Revamp docs for secrets and distributed jobs (#3715) * Add docs for secrets and env var * add * revamp docs * partially * Update docs/source/running-jobs/distributed-jobs.rst Co-authored-by: Zongheng Yang * Update docs/source/running-jobs/environment-variables.rst Co-authored-by: Zongheng Yang * Update docs/source/running-jobs/distributed-jobs.rst Co-authored-by: Zongheng Yang * revert multi-node instruction * Update docs/source/running-jobs/environment-variables.rst Co-authored-by: Zongheng Yang * Update docs/source/running-jobs/environment-variables.rst Co-authored-by: Zongheng Yang * Update docs/source/running-jobs/environment-variables.rst Co-authored-by: Zongheng Yang * Update docs/source/running-jobs/environment-variables.rst Co-authored-by: Zongheng Yang * Update docs/source/running-jobs/environment-variables.rst Co-authored-by: Zongheng Yang * Update docs/source/running-jobs/environment-variables.rst Co-authored-by: Zongheng Yang * Update docs/source/getting-started/quickstart.rst Co-authored-by: Zongheng Yang * address * Rephrase * Update docs/source/running-jobs/environment-variables.rst Co-authored-by: Zongheng Yang * fix * move * Update docs/source/running-jobs/environment-variables.rst Co-authored-by: Zongheng Yang --------- Co-authored-by: Zongheng Yang --- docs/source/docs/index.rst | 4 +- docs/source/getting-started/quickstart.rst | 4 +- docs/source/running-jobs/distributed-jobs.rst | 44 +++--- .../running-jobs/environment-variables.rst | 129 ++++++++++++------ docs/source/running-jobs/index.rst | 7 - 5 files changed, 109 insertions(+), 79 deletions(-) delete mode 100644 docs/source/running-jobs/index.rst diff --git a/docs/source/docs/index.rst b/docs/source/docs/index.rst index 47c98d7bef7..5a648dbcda4 100644 --- a/docs/source/docs/index.rst +++ b/docs/source/docs/index.rst @@ -126,7 +126,7 @@ Contents ../reference/job-queue ../examples/auto-failover ../reference/kubernetes/index - ../running-jobs/index + ../running-jobs/distributed-jobs .. toctree:: :maxdepth: 1 @@ -155,12 +155,14 @@ Contents :maxdepth: 1 :caption: User Guides + ../running-jobs/environment-variables ../examples/docker-containers ../examples/ports ../reference/tpu ../reference/logging ../reference/faq + .. toctree:: :maxdepth: 1 :caption: Developer Guides diff --git a/docs/source/getting-started/quickstart.rst b/docs/source/getting-started/quickstart.rst index bb281087736..bfc6fd17e05 100644 --- a/docs/source/getting-started/quickstart.rst +++ b/docs/source/getting-started/quickstart.rst @@ -72,6 +72,8 @@ To launch a cluster and run a task, use :code:`sky launch`: You can use the ``-c`` flag to give the cluster an easy-to-remember name. If not specified, a name is autogenerated. + If the cluster name is an existing cluster shown in ``sky status``, the cluster will be reused. + The ``sky launch`` command performs much heavy-lifting: - selects an appropriate cloud and VM based on the specified resource constraints; @@ -208,7 +210,7 @@ Managed spot jobs run on much cheaper spot instances, with automatic preemption .. code-block:: console - $ sky spot launch hello_sky.yaml + $ sky jobs launch --use-spot hello_sky.yaml Next steps ----------- diff --git a/docs/source/running-jobs/distributed-jobs.rst b/docs/source/running-jobs/distributed-jobs.rst index fb20b7ca988..9eb590c10bc 100644 --- a/docs/source/running-jobs/distributed-jobs.rst +++ b/docs/source/running-jobs/distributed-jobs.rst @@ -1,15 +1,15 @@ .. _dist-jobs: -Distributed Jobs on Many VMs +Distributed Jobs on Many Nodes ================================================ SkyPilot supports multi-node cluster -provisioning and distributed execution on many VMs. +provisioning and distributed execution on many nodes. For example, here is a simple PyTorch Distributed training example: .. code-block:: yaml - :emphasize-lines: 6-6,21-22,24-25 + :emphasize-lines: 6-6,21-21,23-26 name: resnet-distributed-app @@ -31,14 +31,13 @@ For example, here is a simple PyTorch Distributed training example: run: | cd pytorch-distributed-resnet - num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l` - master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1` - python3 -m torch.distributed.launch \ - --nproc_per_node=${SKYPILOT_NUM_GPUS_PER_NODE} \ - --node_rank=${SKYPILOT_NODE_RANK} \ - --nnodes=$num_nodes \ - --master_addr=$master_addr \ - --master_port=8008 \ + MASTER_ADDR=`echo "$SKYPILOT_NODE_IPS" | head -n1` + torchrun \ + --nnodes=$SKPILOT_NUM_NODES \ + --master_addr=$MASTER_ADDR \ + --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \ + --node_rank=$SKYPILOT_NODE_RANK \ + --master_port=12375 \ resnet_ddp.py --num_epochs 20 In the above, @@ -66,16 +65,11 @@ SkyPilot exposes these environment variables that can be accessed in a task's `` the node executing the task. - :code:`SKYPILOT_NODE_IPS`: a string of IP addresses of the nodes reserved to execute the task, where each line contains one IP address. - - - You can retrieve the number of nodes by :code:`echo "$SKYPILOT_NODE_IPS" | wc -l` - and the IP address of the third node by :code:`echo "$SKYPILOT_NODE_IPS" | sed -n - 3p`. - - - To manipulate these IP addresses, you can also store them to a file in the - :code:`run` command with :code:`echo $SKYPILOT_NODE_IPS >> ~/sky_node_ips`. +- :code:`SKYPILOT_NUM_NODES`: number of nodes reserved for the task, which can be specified by ``num_nodes: ``. Same value as :code:`echo "$SKYPILOT_NODE_IPS" | wc -l`. - :code:`SKYPILOT_NUM_GPUS_PER_NODE`: number of GPUs reserved on each node to execute the task; the same as the count in ``accelerators: :`` (rounded up if a fraction). +See :ref:`sky-env-vars` for more details. Launching a multi-node task (new cluster) ------------------------------------------------- @@ -106,7 +100,7 @@ The following happens in sequence: and step 4). Executing a task on the head node only ------------------------------------------ +-------------------------------------- To execute a task on the head node only (a common scenario for tools like ``mpirun``), use the ``SKYPILOT_NODE_RANK`` environment variable as follows: @@ -141,7 +135,7 @@ This allows you directly to SSH into the worker nodes, if required. Executing a Distributed Ray Program ------------------------------------ -To execute a distributed Ray program on many VMs, you can download the `training script `_ and launch the `task yaml `_: +To execute a distributed Ray program on many nodes, you can download the `training script `_ and launch the `task yaml `_: .. code-block:: console @@ -171,19 +165,17 @@ To execute a distributed Ray program on many VMs, you can download the `training run: | sudo chmod 777 -R /var/tmp - head_ip=`echo "$SKYPILOT_NODE_IPS" | head -n1` - num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l` + HEAD_IP=`echo "$SKYPILOT_NODE_IPS" | head -n1` if [ "$SKYPILOT_NODE_RANK" == "0" ]; then ps aux | grep ray | grep 6379 &> /dev/null || ray start --head --disable-usage-stats --port 6379 sleep 5 - python train.py --num-workers $num_nodes + python train.py --num-workers $SKYPILOT_NUM_NODES else sleep 5 - ps aux | grep ray | grep 6379 &> /dev/null || ray start --address $head_ip:6379 --disable-usage-stats + ps aux | grep ray | grep 6379 &> /dev/null || ray start --address $HEAD_IP:6379 --disable-usage-stats fi .. warning:: - **Avoid Installing Ray in Base Environment**: Before proceeding with the execution of a distributed Ray program, it is crucial to ensure that Ray is **not** installed in the *base* environment. Installing a different version of Ray in the base environment can lead to abnormal cluster status. - It is highly recommended to **create a dedicated virtual environment** (as above) for Ray and its dependencies, and avoid calling `ray stop` as that will also cause issue with the cluster. + When using Ray, avoid calling ``ray stop`` as that will also cause the SkyPilot runtime to be stopped. diff --git a/docs/source/running-jobs/environment-variables.rst b/docs/source/running-jobs/environment-variables.rst index 7f91720f9b5..f7138af95fa 100644 --- a/docs/source/running-jobs/environment-variables.rst +++ b/docs/source/running-jobs/environment-variables.rst @@ -1,23 +1,38 @@ .. _env-vars: -Using Environment Variables +Secrets and Environment Variables ================================================ +Environment variables are a powerful way to pass configuration and secrets to your tasks. There are two types of environment variables in SkyPilot: + +- :ref:`User-specified environment variables `: Passed by users to tasks, useful for secrets and configurations. +- :ref:`SkyPilot environment variables `: Predefined by SkyPilot with information about the current cluster and task. + +.. _user-specified-env-vars: + User-specified environment variables ------------------------------------------------------------------ +User-specified environment variables are useful for passing secrets and any arguments or configurations needed for your tasks. They are made available in ``file_mounts``, ``setup``, and ``run``. + You can specify environment variables to be made available to a task in two ways: -- The ``envs`` field (dict) in a :ref:`task YAML ` -- The ``--env`` flag in the ``sky launch/exec`` :ref:`CLI ` (takes precedence over the above) +- ``envs`` field (dict) in a :ref:`task YAML `: + + .. code-block:: yaml + + envs: + MYVAR: val + +- ``--env`` flag in ``sky launch/exec`` :ref:`CLI ` (takes precedence over the above) .. tip:: - If an environment variable is required to be specified with `--env` during - ``sky launch/exec``, you can set it to ``null`` in task YAML to raise an - error when it is forgotten to be specified. For example, the ``WANDB_API_KEY`` - and ``HF_TOKEN`` in the following task YAML: + To mark an environment variable as required and make SkyPilot forcefully check + its existence (errors out if not specified), set it to an empty string or + ``null`` in the task YAML. For example, ``WANDB_API_KEY`` and ``HF_TOKEN`` in + the following task YAML are marked as required: .. code-block:: yaml @@ -28,6 +43,26 @@ You can specify environment variables to be made available to a task in two ways The ``file_mounts``, ``setup``, and ``run`` sections of a task YAML can access the variables via the ``${MYVAR}`` syntax. +.. _passing-secrets: + +Passing secrets +~~~~~~~~~~~~~~~ + +We recommend passing secrets to any node(s) executing your task by first making +it available in your current shell, then using ``--env SECRET`` to pass it to SkyPilot: + +.. code-block:: console + + $ sky launch -c mycluster --env HF_TOKEN --env WANDB_API_KEY task.yaml + $ sky exec mycluster --env WANDB_API_KEY task.yaml + +.. tip:: + + You do not need to pass the value directly such as ``--env + WANDB_API_KEY=1234``. When the value is not specified (e.g., ``--env WANDB_API_KEY``), + SkyPilot reads it from local environment variables. + + Using in ``file_mounts`` ~~~~~~~~~~~~~~~~~~~~~~~~ @@ -77,40 +112,29 @@ For example, this is useful for passing secrets (see below) or passing configura See complete examples at `llm/vllm/serve.yaml `_ and `llm/vicuna/train.yaml `_. -.. _passing-secrets: - -Passing secrets -~~~~~~~~~~~~~~~~~~~~~~~~ - -We recommend passing secrets to any node(s) executing your task by first making -it available in your current shell, then using ``--env`` to pass it to SkyPilot: - -.. code-block:: console - - $ sky launch -c mycluster --env WANDB_API_KEY task.yaml - $ sky exec mycluster --env WANDB_API_KEY task.yaml - -.. tip:: - - In other words, you do not need to pass the value directly such as ``--env - WANDB_API_KEY=1234``. +.. _sky-env-vars: +SkyPilot environment variables +------------------------------------------------------------------ +SkyPilot exports several predefined environment variables made available during a task's execution. These variables contain information about the current cluster or task, which can be useful for distributed frameworks such as +torch.distributed, OpenMPI, etc. See examples in :ref:`dist-jobs` and :ref:`managed-jobs`. +The values of these variables are filled in by SkyPilot at task execution time. +You can access these variables in the following ways: -SkyPilot environment variables ------------------------------------------------------------------- +* In the task YAML's ``setup``/``run`` commands (a Bash script), access them using the ``${MYVAR}`` syntax; +* In the program(s) launched in ``setup``/``run``, access them using the language's standard method (e.g., ``os.environ`` for Python). -SkyPilot exports these environment variables for a task's execution. ``setup`` -and ``run`` stages have different environment variables available. +The ``setup`` and ``run`` stages can access different sets of SkyPilot environment variables: Environment variables for ``setup`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. list-table:: - :widths: 20 60 10 + :widths: 20 40 10 :header-rows: 1 * - Name @@ -120,9 +144,15 @@ Environment variables for ``setup`` - Rank (an integer ID from 0 to :code:`num_nodes-1`) of the node being set up. - 0 * - ``SKYPILOT_SETUP_NODE_IPS`` - - A string of IP addresses of the nodes in the cluster with the same order as the node ranks, where each line contains one IP address. Note that this is not necessarily the same as the nodes in ``run`` stage, as the ``setup`` stage runs on all nodes of the cluster, while the ``run`` stage can run on a subset of nodes. - - 1.2.3.4 - 3.4.5.6 + - A string of IP addresses of the nodes in the cluster with the same order as the node ranks, where each line contains one IP address. + + Note that this is not necessarily the same as the nodes in ``run`` stage: the ``setup`` stage runs on all nodes of the cluster, while the ``run`` stage can run on a subset of nodes. + - + .. code-block:: text + + 1.2.3.4 + 3.4.5.6 + * - ``SKYPILOT_NUM_NODES`` - Number of nodes in the cluster. Same value as ``$(echo "$SKYPILOT_NODE_IPS" | wc -l)``. - 2 @@ -137,7 +167,15 @@ Environment variables for ``setup`` For managed spot jobs: sky-managed-2023-07-06-21-18-31-563597_my-job-name_1-0 * - ``SKYPILOT_CLUSTER_INFO`` - - A JSON string containing information about the cluster. To access the information, you could parse the JSON string in bash ``echo $SKYPILOT_CLUSTER_INFO | jq .cloud`` or in Python ``json.loads(os.environ['SKYPILOT_CLUSTER_INFO'])['cloud']``. + - A JSON string containing information about the cluster. To access the information, you could parse the JSON string in bash ``echo $SKYPILOT_CLUSTER_INFO | jq .cloud`` or in Python : + + .. code-block:: python + + import json + json.loads( + os.environ['SKYPILOT_CLUSTER_INFO'] + )['cloud'] + - {"cluster_name": "my-cluster-name", "cloud": "GCP", "region": "us-central1", "zone": "us-central1-a"} * - ``SKYPILOT_SERVE_REPLICA_ID`` - The ID of a replica within the service (starting from 1). Available only for a :ref:`service `'s replica task. @@ -151,7 +189,7 @@ Environment variables for ``run`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. list-table:: - :widths: 20 60 10 + :widths: 20 40 10 :header-rows: 1 * - Name @@ -162,7 +200,11 @@ Environment variables for ``run`` - 0 * - ``SKYPILOT_NODE_IPS`` - A string of IP addresses of the nodes reserved to execute the task, where each line contains one IP address. Read more :ref:`here `. - - 1.2.3.4 + - + .. code-block:: text + + 1.2.3.4 + * - ``SKYPILOT_NUM_NODES`` - Number of nodes assigned to execute the current task. Same value as ``$(echo "$SKYPILOT_NODE_IPS" | wc -l)``. Read more :ref:`here `. - 1 @@ -182,16 +224,15 @@ Environment variables for ``run`` For managed spot jobs: sky-managed-2023-07-06-21-18-31-563597_my-job-name_1-0 * - ``SKYPILOT_CLUSTER_INFO`` - - A JSON string containing information about the cluster. To access the information, you could parse the JSON string in bash ``echo $SKYPILOT_CLUSTER_INFO | jq .cloud`` or in Python ``json.loads(os.environ['SKYPILOT_CLUSTER_INFO'])['cloud']``. + - A JSON string containing information about the cluster. To access the information, you could parse the JSON string in bash ``echo $SKYPILOT_CLUSTER_INFO | jq .cloud`` or in Python : + + .. code-block:: python + + import json + json.loads( + os.environ['SKYPILOT_CLUSTER_INFO'] + )['cloud'] - {"cluster_name": "my-cluster-name", "cloud": "GCP", "region": "us-central1", "zone": "us-central1-a"} * - ``SKYPILOT_SERVE_REPLICA_ID`` - The ID of a replica within the service (starting from 1). Available only for a :ref:`service `'s replica task. - 1 - -The values of these variables are filled in by SkyPilot at task execution time. - -You can access these variables in the following ways: - -* In the task YAML's ``setup``/``run`` commands (a Bash script), access them using the ``${MYVAR}`` syntax; -* In the program(s) launched in ``setup``/``run``, access them using the - language's standard method (e.g., ``os.environ`` for Python). diff --git a/docs/source/running-jobs/index.rst b/docs/source/running-jobs/index.rst deleted file mode 100644 index 04c921d1022..00000000000 --- a/docs/source/running-jobs/index.rst +++ /dev/null @@ -1,7 +0,0 @@ -More User Guides -================================================ - -.. toctree:: - - distributed-jobs - environment-variables