Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: Add warning against installing Ray in base environment #3267

Merged
merged 11 commits into from
Mar 14, 2024
Merged
50 changes: 50 additions & 0 deletions docs/source/running-jobs/distributed-jobs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -137,3 +137,53 @@ This allows you directly to SSH into the worker nodes, if required.
# Worker nodes.
$ ssh mycluster-worker1
$ ssh mycluster-worker2


Executing a Distributed Ray Program
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For discussion: We may want to add a dedicated guide on how to use Ray as part of the user job. That page can use a tip like this. A full YAML example #3195 may help promote this recommendation more directly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let’s add a yaml example of starting a distributed job with ray as mentioned above. We can try to move it to a separate page later after this pr is merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for confirmation: I should add the command to run the example in distributed_ray_train, the one that uses FashionMNIST and add the command for tha yaml file and its output as example in the documentation ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can add the yaml file inline with the command to run that yaml : )

------------------------------------
To execute a distributed Ray program on many VMs, you can download the `training script <https://github.com/skypilot-org/skypilot/blob/master/examples/distributed_ray_train/train.py>`_ and launch the `task yaml <https://github.com/skypilot-org/skypilot/blob/master/examples/distributed_ray_train/ray_train.yaml>`_:

.. code-block:: console

$ wget https://raw.githubusercontent.com/skypilot-org/skypilot/master/examples/distributed_ray_train/train.py
$ sky launch ray_train.yaml

.. code-block:: yaml

resources:
accelerators: L4:2
memory: 64+

num_nodes: 2

workdir: .

setup: |
conda activate ray
if [ $? -ne 0 ]; then
conda create -n ray python=3.10 -y
conda activate ray
fi

MysteryManav marked this conversation as resolved.
Show resolved Hide resolved
pip install "ray[train]"
MysteryManav marked this conversation as resolved.
Show resolved Hide resolved
pip install tqdm
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

run: |
sudo chmod 777 -R /var/tmp
head_ip=`echo "$SKYPILOT_NODE_IPS" | head -n1`
num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
if [ "$SKYPILOT_NODE_RANK" == "0" ]; then
ps aux | grep ray | grep 6379 &> /dev/null || ray start --head --disable-usage-stats --port 6379
sleep 5
python train.py --num-workers $num_nodes
else
sleep 5
ps aux | grep ray | grep 6379 &> /dev/null || ray start --address $head_ip:6379 --disable-usage-stats
fi

.. warning::
**Avoid Installing Ray in Base Environment**: Before proceeding with the execution of a distributed Ray program, it is crucial to ensure that Ray is **not** installed in the *base* environment. Installing a different version of Ray in the base environment can lead to abnormal cluster status.

It is highly recommended to **create a dedicated virtual environment** (as above) for Ray and its dependencies, and avoid calling `ray stop` as that will also cause issue with the cluster.

Loading