Skip to content

Commit

Permalink
revamp docs
Browse files Browse the repository at this point in the history
  • Loading branch information
Michaelvll committed Jul 2, 2024
1 parent 907aaee commit 7fba40b
Show file tree
Hide file tree
Showing 4 changed files with 102 additions and 74 deletions.
5 changes: 4 additions & 1 deletion docs/source/getting-started/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,9 @@ To launch a cluster and run a task, use :code:`sky launch`:

You can use the ``-c`` flag to give the cluster an easy-to-remember name. If not specified, a name is autogenerated.

If the cluster name is an existing cluster shown in ``sky status``, the cluster will be reused, without provisioning
a new one.

The ``sky launch`` command performs much heavy-lifting:

- selects an appropriate cloud and VM based on the specified resource constraints;
Expand Down Expand Up @@ -208,7 +211,7 @@ Managed spot jobs run on much cheaper spot instances, with automatic preemption

.. code-block:: console
$ sky spot launch hello_sky.yaml
$ sky jobs launch --use-spot hello_sky.yaml
Next steps
-----------
Expand Down
Empty file removed docs/source/reference/index.rst
Empty file.
60 changes: 21 additions & 39 deletions docs/source/running-jobs/distributed-jobs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ provisioning and distributed execution on many nodes.
For example, here is a simple PyTorch Distributed training example:

.. code-block:: yaml
:emphasize-lines: 6-6,21-22,24-25
:emphasize-lines: 6-6,21-21,23-26
name: resnet-distributed-app
Expand All @@ -31,14 +31,13 @@ For example, here is a simple PyTorch Distributed training example:
run: |
cd pytorch-distributed-resnet
num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1`
python3 -m torch.distributed.launch \
--nproc_per_node=${SKYPILOT_NUM_GPUS_PER_NODE} \
--node_rank=${SKYPILOT_NODE_RANK} \
--nnodes=$num_nodes \
--master_addr=$master_addr \
--master_port=8008 \
MASTER_ADDR=`echo "$SKYPILOT_NODE_IPS" | head -n1`
torchrun \
--nnodes=$SKPILOT_NUM_NODES \
--master_addr=$MASTER_ADDR \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--node_rank=$SKYPILOT_NODE_RANK \
--master_port=12375 \
resnet_ddp.py --num_epochs 20
In the above,
Expand Down Expand Up @@ -66,47 +65,32 @@ SkyPilot exposes these environment variables that can be accessed in a task's ``
the node executing the task.
- :code:`SKYPILOT_NODE_IPS`: a string of IP addresses of the nodes reserved to execute
the task, where each line contains one IP address.

- You can retrieve the number of nodes by :code:`echo "$SKYPILOT_NODE_IPS" | wc -l`
and the IP address of the third node by :code:`echo "$SKYPILOT_NODE_IPS" | sed -n
3p`.

- To manipulate these IP addresses, you can also store them to a file in the
:code:`run` command with :code:`echo $SKYPILOT_NODE_IPS >> ~/sky_node_ips`.
- :code:`SKYPILOT_NUM_NODES`: number of nodes reserved for the task, specified by ``num_nodes: <n>``,
which is equivalent to :code:`echo "$SKYPILOT_NODE_IPS" | wc -l`
- :code:`SKYPILOT_NUM_GPUS_PER_NODE`: number of GPUs reserved on each node to execute the
task; the same as the count in ``accelerators: <name>:<count>`` (rounded up if a fraction).

See :ref:`env-vars` for more details.


Launching a multi-node task (new cluster)
-------------------------------------------------
Launching a multi-node task
---------------------------

When using ``sky launch`` to launch a multi-node task on **a new cluster**, the following happens in sequence:
When using ``sky launch`` to launch a multi-node task, the following happens in sequence:

1. Nodes are provisioned. (barrier)
1. Provision (if not exists) or verify (if exists) a multi-node cluster. (gang-scheduled)
2. Workdir/file_mounts are synced to all nodes. (barrier)
3. ``setup`` commands are executed on all nodes. (barrier)
4. ``run`` commands are executed on all nodes.

Launching a multi-node task (existing cluster)
-------------------------------------------------

When using ``sky launch`` to launch a multi-node task on **an existing cluster**, the cluster may have more nodes than the current task's ``num_nodes`` requirement.

The following happens in sequence:

1. SkyPilot checks the runtime on all nodes are up-to-date. (barrier)
2. Workdir/file_mounts are synced to all nodes. (barrier)
3. ``setup`` commands are executed on **all nodes** of the cluster. (barrier)
4. ``run`` commands are executed on **the subset of nodes** scheduled to execute the task, which may be fewer than the cluster size.

.. tip::

To skip rerunning the setup commands, use either ``sky launch --no-setup ...``
(performs steps 1, 2, 4 above) or ``sky exec`` (performs step 2 (workdir only)
and step 4).

Executing a task on the head node only
-----------------------------------------
--------------------------------------
To execute a task on the head node only (a common scenario for tools like
``mpirun``), use the ``SKYPILOT_NODE_RANK`` environment variable as follows:

Expand Down Expand Up @@ -171,19 +155,17 @@ To execute a distributed Ray program on many nodes, you can download the `traini
run: |
sudo chmod 777 -R /var/tmp
head_ip=`echo "$SKYPILOT_NODE_IPS" | head -n1`
num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
HEAD_IP=`echo "$SKYPILOT_NODE_IPS" | head -n1`
if [ "$SKYPILOT_NODE_RANK" == "0" ]; then
ps aux | grep ray | grep 6379 &> /dev/null || ray start --head --disable-usage-stats --port 6379
sleep 5
python train.py --num-workers $num_nodes
python train.py --num-workers $SKYPILOT_NUM_NODES
else
sleep 5
ps aux | grep ray | grep 6379 &> /dev/null || ray start --address $head_ip:6379 --disable-usage-stats
ps aux | grep ray | grep 6379 &> /dev/null || ray start --address $HEAD_IP:6379 --disable-usage-stats
fi
.. warning::
**Avoid Installing Ray in Base Environment**: Before proceeding with the execution of a distributed Ray program, it is crucial to ensure that Ray is **not** installed in the *base* environment. Installing a different version of Ray in the base environment can lead to abnormal cluster status.

It is highly recommended to **create a dedicated virtual environment** (as above) for Ray and its dependencies, and avoid calling `ray stop` as that will also cause issue with the cluster.
When using ray, please avoid calling `ray stop` as that will also cause SkyPilot runtime to be stopped as well.

111 changes: 77 additions & 34 deletions docs/source/running-jobs/environment-variables.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,30 +4,64 @@
Secrets and Environment Variables
================================================

Environment variables are a powerful way to pass configuration and secrets to your tasks. There are two types of environment variables in SkyPilot:

- :ref:`User-specified environment variables <user-specified-env-vars>`: Passed by users to tasks, for secrets and arguments.
- :ref:`SkyPilot environment variables <sky-env-vars>`: Predefined by SkyPilot with information about the cluster and task.

.. _user-specified-env-vars:

User-specified environment variables
------------------------------------------------------------------

Passing user-specified environment variables is useful for passing secrets and arguments for ``file_mounts``, ``setup``, and ``run``.

You can specify environment variables to be made available to a task in two ways:

- The ``envs`` field (dict) in a :ref:`task YAML <yaml-spec>`
- The ``--env`` flag in the ``sky launch/exec`` :ref:`CLI <cli>` (takes precedence over the above)
- ``envs`` field (dict) in a :ref:`task YAML <yaml-spec>`

.. code-block:: yaml
envs:
MYVAR: val
- ``--env`` flag in ``sky launch/exec`` :ref:`CLI <cli>` (takes precedence over the above)

.. tip::

If an environment variable is required to be specified with `--env` during
``sky launch/exec``, you can set it to ``null`` in task YAML to raise an
``sky launch/exec``, you can set it to empty or ``null`` in task YAML to raise an
error when it is forgotten to be specified. For example, the ``WANDB_API_KEY``
and ``HF_TOKEN`` in the following task YAML:

.. code-block:: yaml
envs:
WANDB_API_KEY:
HF_TOKEN: null
WANDB_API_KEY:
MYVAR: val
The ``file_mounts``, ``setup``, and ``run`` sections of a task YAML can access the variables via the ``${MYVAR}`` syntax.

.. _passing-secrets:

Passing secrets
~~~~~~~~~~~~~~~

We recommend passing secrets to any node(s) executing your task by first making
it available in your current shell, then using ``--env`` to pass it to SkyPilot:

.. code-block:: console
$ sky launch -c mycluster --env HF_TOKEN --env WANDB_API_KEY task.yaml
$ sky exec mycluster --env WANDB_API_KEY task.yaml
.. tip::

When value is not specified for ``--env HF_TOKEN``, SkyPilot reads it from
local environment variables


Using in ``file_mounts``
~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down Expand Up @@ -77,40 +111,24 @@ For example, this is useful for passing secrets (see below) or passing configura
See complete examples at `llm/vllm/serve.yaml <https://github.com/skypilot-org/skypilot/blob/596c1415b5039adec042594f45b342374e5e6a00/llm/vllm/serve.yaml#L4-L5>`_ and `llm/vicuna/train.yaml <https://github.com/skypilot-org/skypilot/blob/596c1415b5039adec042594f45b342374e5e6a00/llm/vicuna/train.yaml#L111-L116>`_.


.. _passing-secrets:

Passing secrets
~~~~~~~~~~~~~~~~~~~~~~~~

We recommend passing secrets to any node(s) executing your task by first making
it available in your current shell, then using ``--env`` to pass it to SkyPilot:

.. code-block:: console
$ sky launch -c mycluster --env WANDB_API_KEY task.yaml
$ sky exec mycluster --env WANDB_API_KEY task.yaml
.. tip::

In other words, you do not need to pass the value directly such as ``--env
WANDB_API_KEY=1234``.




.. _sky-env-vars:

SkyPilot environment variables
------------------------------------------------------------------

SkyPilot exports these environment variables for a task's execution. ``setup``
and ``run`` stages have different environment variables available.
SkyPilot exports several predefined environment variables for a task's execution, which
are useful for frameworks that require information about the cluster or task, such as
torch.distributed, OpenMPI, etc. See examples in :ref:`dist-jobs` and :ref:`managed-jobs`.

``setup`` and ``run`` stages have different environment variables available:

Environment variables for ``setup``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


.. list-table::
:widths: 20 60 10
:widths: 20 40 10
:header-rows: 1

* - Name
Expand All @@ -120,9 +138,15 @@ Environment variables for ``setup``
- Rank (an integer ID from 0 to :code:`num_nodes-1`) of the node being set up.
- 0
* - ``SKYPILOT_SETUP_NODE_IPS``
- A string of IP addresses of the nodes in the cluster with the same order as the node ranks, where each line contains one IP address. Note that this is not necessarily the same as the nodes in ``run`` stage, as the ``setup`` stage runs on all nodes of the cluster, while the ``run`` stage can run on a subset of nodes.
- 1.2.3.4
3.4.5.6
- A string of IP addresses of the nodes in the cluster with the same order as the node ranks, where each line contains one IP address.

Note that this is not necessarily the same as the nodes in ``run`` stage, as the ``setup`` stage runs on all nodes of the cluster, while the ``run`` stage can run on a subset of nodes.
-
.. code-block:: text
1.2.3.4
3.4.5.6
* - ``SKYPILOT_NUM_NODES``
- Number of nodes in the cluster. Same value as ``$(echo "$SKYPILOT_NODE_IPS" | wc -l)``.
- 2
Expand All @@ -137,7 +161,15 @@ Environment variables for ``setup``

For managed spot jobs: sky-managed-2023-07-06-21-18-31-563597_my-job-name_1-0
* - ``SKYPILOT_CLUSTER_INFO``
- A JSON string containing information about the cluster. To access the information, you could parse the JSON string in bash ``echo $SKYPILOT_CLUSTER_INFO | jq .cloud`` or in Python ``json.loads(os.environ['SKYPILOT_CLUSTER_INFO'])['cloud']``.
- A JSON string containing information about the cluster. To access the information, you could parse the JSON string in bash ``echo $SKYPILOT_CLUSTER_INFO | jq .cloud`` or in Python :

.. code-block:: python
import json
json.loads(
os.environ['SKYPILOT_CLUSTER_INFO']
)['cloud']
- {"cluster_name": "my-cluster-name", "cloud": "GCP", "region": "us-central1", "zone": "us-central1-a"}
* - ``SKYPILOT_SERVE_REPLICA_ID``
- The ID of a replica within the service (starting from 1). Available only for a :ref:`service <sky-serve>`'s replica task.
Expand All @@ -151,7 +183,7 @@ Environment variables for ``run``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. list-table::
:widths: 20 60 10
:widths: 20 40 10
:header-rows: 1

* - Name
Expand All @@ -162,7 +194,11 @@ Environment variables for ``run``
- 0
* - ``SKYPILOT_NODE_IPS``
- A string of IP addresses of the nodes reserved to execute the task, where each line contains one IP address. Read more :ref:`here <dist-jobs>`.
- 1.2.3.4
-
.. code-block:: text
1.2.3.4
* - ``SKYPILOT_NUM_NODES``
- Number of nodes assigned to execute the current task. Same value as ``$(echo "$SKYPILOT_NODE_IPS" | wc -l)``. Read more :ref:`here <dist-jobs>`.
- 1
Expand All @@ -182,7 +218,14 @@ Environment variables for ``run``

For managed spot jobs: sky-managed-2023-07-06-21-18-31-563597_my-job-name_1-0
* - ``SKYPILOT_CLUSTER_INFO``
- A JSON string containing information about the cluster. To access the information, you could parse the JSON string in bash ``echo $SKYPILOT_CLUSTER_INFO | jq .cloud`` or in Python ``json.loads(os.environ['SKYPILOT_CLUSTER_INFO'])['cloud']``.
- A JSON string containing information about the cluster. To access the information, you could parse the JSON string in bash ``echo $SKYPILOT_CLUSTER_INFO | jq .cloud`` or in Python :

.. code-block:: python
import json
json.loads(
os.environ['SKYPILOT_CLUSTER_INFO']
)['cloud']
- {"cluster_name": "my-cluster-name", "cloud": "GCP", "region": "us-central1", "zone": "us-central1-a"}
* - ``SKYPILOT_SERVE_REPLICA_ID``
- The ID of a replica within the service (starting from 1). Available only for a :ref:`service <sky-serve>`'s replica task.
Expand Down

0 comments on commit 7fba40b

Please sign in to comment.