From c281c0dea3439ab6bf474c166008c7b596260c4f Mon Sep 17 00:00:00 2001 From: MysteryManav Date: Sun, 3 Mar 2024 16:30:17 +0530 Subject: [PATCH 01/11] Docs: Add warning against installing Ray in base environment --- docs/source/running-jobs/distributed-jobs.rst | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/docs/source/running-jobs/distributed-jobs.rst b/docs/source/running-jobs/distributed-jobs.rst index b9a5a9f01eb..1fa966ebe89 100644 --- a/docs/source/running-jobs/distributed-jobs.rst +++ b/docs/source/running-jobs/distributed-jobs.rst @@ -137,3 +137,14 @@ This allows you directly to SSH into the worker nodes, if required. # Worker nodes. $ ssh mycluster-worker1 $ ssh mycluster-worker2 + + +Executing a Distributed Ray Program +------------------------------------ +.. warning:: + **Avoid Installing Ray in Base Environment** + + Before proceeding with the execution of a distributed Ray program, it is crucial to ensure that Ray is **not** installed in the base environment. Installing a different version of Ray in the base environment can lead to compatibility issues, conflicts, and unintended consequences. + + To maintain a clean and stable environment for your distributed Ray program, it is highly recommended to **create a dedicated virtual environment** for Ray and its dependencies. This helps isolate the Ray installation and prevents interference with other packages in your base environment. + From bf601dbbcf62f137acd8f3bbd22e14d41aa996a4 Mon Sep 17 00:00:00 2001 From: MysteryManav Date: Sat, 9 Mar 2024 09:49:34 +0530 Subject: [PATCH 02/11] Docs: Add warning against installing Ray in base environment --- docs/source/running-jobs/distributed-jobs.rst | 34 +++++++++++++++++++ 1 file changed, 34 insertions(+) diff --git a/docs/source/running-jobs/distributed-jobs.rst b/docs/source/running-jobs/distributed-jobs.rst index 1fa966ebe89..d772b805683 100644 --- a/docs/source/running-jobs/distributed-jobs.rst +++ b/docs/source/running-jobs/distributed-jobs.rst @@ -141,6 +141,40 @@ This allows you directly to SSH into the worker nodes, if required. Executing a Distributed Ray Program ------------------------------------ +To execute a distributed Ray program on many VMs, you can use the following example: + +.. code-block:: console + + $ sky launch ray_train.yaml + +.. code-block:: yaml + :emphasize-lines: 6-6,21-22,24-25 + + resources: + accelerators: L4:2 + memory: 64+ + + num_nodes: 2 + + setup: | + pip install "ray[train]" + pip install tqdm + pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 + + run: | + sudo chmod 777 -R /var/tmp + head_ip=`echo "$SKYPILOT_NODE_IPS" | head -n1` + num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l` + ray start --head --disable-usage-stats --port 6379 + if [ "$SKYPILOT_NODE_RANK" == "0" ]; then + ps aux | grep ray | grep 6379 &> /dev/null || ray start --head --disable-usage-stats --port 6379 + sleep 5 + python train.py --num-workers $num_nodes + else + sleep 5 + ps aux | grep ray | grep 6379 &> /dev/null || ray start --address $head_ip:6379 --disable-usage-stats + fi + .. warning:: **Avoid Installing Ray in Base Environment** From 8f6f7ae619463bb7caad21d2d61dd5b99e6c8037 Mon Sep 17 00:00:00 2001 From: Madhur Prajapati <96643023+MysteryManav@users.noreply.github.com> Date: Sun, 10 Mar 2024 15:31:03 +0530 Subject: [PATCH 03/11] Update docs/source/running-jobs/distributed-jobs.rst Co-authored-by: Zhanghao Wu --- docs/source/running-jobs/distributed-jobs.rst | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/source/running-jobs/distributed-jobs.rst b/docs/source/running-jobs/distributed-jobs.rst index d772b805683..1148272d4e9 100644 --- a/docs/source/running-jobs/distributed-jobs.rst +++ b/docs/source/running-jobs/distributed-jobs.rst @@ -148,7 +148,6 @@ To execute a distributed Ray program on many VMs, you can use the following exam $ sky launch ray_train.yaml .. code-block:: yaml - :emphasize-lines: 6-6,21-22,24-25 resources: accelerators: L4:2 From 1f3b63f459657d53d54612eb225a537d0011f91a Mon Sep 17 00:00:00 2001 From: Madhur Prajapati <96643023+MysteryManav@users.noreply.github.com> Date: Sun, 10 Mar 2024 15:31:54 +0530 Subject: [PATCH 04/11] Update docs/source/running-jobs/distributed-jobs.rst Co-authored-by: Zhanghao Wu --- docs/source/running-jobs/distributed-jobs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/running-jobs/distributed-jobs.rst b/docs/source/running-jobs/distributed-jobs.rst index 1148272d4e9..53d08affe92 100644 --- a/docs/source/running-jobs/distributed-jobs.rst +++ b/docs/source/running-jobs/distributed-jobs.rst @@ -141,7 +141,7 @@ This allows you directly to SSH into the worker nodes, if required. Executing a Distributed Ray Program ------------------------------------ -To execute a distributed Ray program on many VMs, you can use the following example: +To execute a distributed Ray program on many VMs, you can download the `training script `_ and launch the `task yaml `_: .. code-block:: console From a132beb7caca12d2cb0b2ae94a0c5bb87d858632 Mon Sep 17 00:00:00 2001 From: Madhur Prajapati <96643023+MysteryManav@users.noreply.github.com> Date: Sun, 10 Mar 2024 15:32:02 +0530 Subject: [PATCH 05/11] Update docs/source/running-jobs/distributed-jobs.rst Co-authored-by: Zhanghao Wu --- docs/source/running-jobs/distributed-jobs.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/source/running-jobs/distributed-jobs.rst b/docs/source/running-jobs/distributed-jobs.rst index 53d08affe92..0032f82918f 100644 --- a/docs/source/running-jobs/distributed-jobs.rst +++ b/docs/source/running-jobs/distributed-jobs.rst @@ -145,6 +145,7 @@ To execute a distributed Ray program on many VMs, you can download the `training .. code-block:: console + $ wget https://raw.githubusercontent.com/skypilot-org/skypilot/master/examples/distributed_ray_train/train.py $ sky launch ray_train.yaml .. code-block:: yaml From 0b77d56d63421a2cc25e4ee1127b0f060fd3c31b Mon Sep 17 00:00:00 2001 From: Madhur Prajapati <96643023+MysteryManav@users.noreply.github.com> Date: Sun, 10 Mar 2024 15:32:10 +0530 Subject: [PATCH 06/11] Update docs/source/running-jobs/distributed-jobs.rst Co-authored-by: Zhanghao Wu --- docs/source/running-jobs/distributed-jobs.rst | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/source/running-jobs/distributed-jobs.rst b/docs/source/running-jobs/distributed-jobs.rst index 0032f82918f..e1c364d8c67 100644 --- a/docs/source/running-jobs/distributed-jobs.rst +++ b/docs/source/running-jobs/distributed-jobs.rst @@ -155,7 +155,9 @@ To execute a distributed Ray program on many VMs, you can download the `training memory: 64+ num_nodes: 2 - + + workdir: . + setup: | pip install "ray[train]" pip install tqdm From 306c7f36ba8672629afe5b61e5b638d5f64c9667 Mon Sep 17 00:00:00 2001 From: Madhur Prajapati <96643023+MysteryManav@users.noreply.github.com> Date: Sun, 10 Mar 2024 15:32:17 +0530 Subject: [PATCH 07/11] Update docs/source/running-jobs/distributed-jobs.rst Co-authored-by: Zhanghao Wu --- docs/source/running-jobs/distributed-jobs.rst | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/docs/source/running-jobs/distributed-jobs.rst b/docs/source/running-jobs/distributed-jobs.rst index e1c364d8c67..c5d821a615f 100644 --- a/docs/source/running-jobs/distributed-jobs.rst +++ b/docs/source/running-jobs/distributed-jobs.rst @@ -159,6 +159,12 @@ To execute a distributed Ray program on many VMs, you can download the `training workdir: . setup: | + conda activate ray + if [ $? -ne 0 ]; then + conda create -n ray python=3.10 -y + conda activate ray + fi + pip install "ray[train]" pip install tqdm pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 From 4cf16b060bb43e0f1d57b40e746dcc9660b8e7d7 Mon Sep 17 00:00:00 2001 From: Madhur Prajapati <96643023+MysteryManav@users.noreply.github.com> Date: Sun, 10 Mar 2024 15:32:25 +0530 Subject: [PATCH 08/11] Update docs/source/running-jobs/distributed-jobs.rst Co-authored-by: Zhanghao Wu --- docs/source/running-jobs/distributed-jobs.rst | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/docs/source/running-jobs/distributed-jobs.rst b/docs/source/running-jobs/distributed-jobs.rst index c5d821a615f..d2855554ca5 100644 --- a/docs/source/running-jobs/distributed-jobs.rst +++ b/docs/source/running-jobs/distributed-jobs.rst @@ -184,9 +184,7 @@ To execute a distributed Ray program on many VMs, you can download the `training fi .. warning:: - **Avoid Installing Ray in Base Environment** + **Avoid Installing Ray in Base Environment**: Before proceeding with the execution of a distributed Ray program, it is crucial to ensure that Ray is **not** installed in the *base* environment. Installing a different version of Ray in the base environment can lead to abnormal cluster status. - Before proceeding with the execution of a distributed Ray program, it is crucial to ensure that Ray is **not** installed in the base environment. Installing a different version of Ray in the base environment can lead to compatibility issues, conflicts, and unintended consequences. - - To maintain a clean and stable environment for your distributed Ray program, it is highly recommended to **create a dedicated virtual environment** for Ray and its dependencies. This helps isolate the Ray installation and prevents interference with other packages in your base environment. + It is highly recommended to **create a dedicated virtual environment** (as above) for Ray and its dependencies, and avoid calling `ray stop` as that will also cause issue with the cluster. From 86596e4293f3b70f995fe40649bac4bf15706697 Mon Sep 17 00:00:00 2001 From: Madhur Prajapati <96643023+MysteryManav@users.noreply.github.com> Date: Sun, 10 Mar 2024 15:32:43 +0530 Subject: [PATCH 09/11] Update docs/source/running-jobs/distributed-jobs.rst Co-authored-by: Zhanghao Wu --- docs/source/running-jobs/distributed-jobs.rst | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/docs/source/running-jobs/distributed-jobs.rst b/docs/source/running-jobs/distributed-jobs.rst index d2855554ca5..960c73f2214 100644 --- a/docs/source/running-jobs/distributed-jobs.rst +++ b/docs/source/running-jobs/distributed-jobs.rst @@ -165,6 +165,12 @@ To execute a distributed Ray program on many VMs, you can download the `training conda activate ray fi + conda activate ray + if [ $? -ne 0 ]; then + conda create -n ray python=3.10 -y + conda activate ray + fi + pip install "ray[train]" pip install tqdm pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 From 7d6973545db444ac7fd338513f1219949c179b65 Mon Sep 17 00:00:00 2001 From: Madhur Prajapati <96643023+MysteryManav@users.noreply.github.com> Date: Sun, 10 Mar 2024 15:43:43 +0530 Subject: [PATCH 10/11] Update docs/source/running-jobs/distributed-jobs.rst Co-authored-by: Zhanghao Wu --- docs/source/running-jobs/distributed-jobs.rst | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/source/running-jobs/distributed-jobs.rst b/docs/source/running-jobs/distributed-jobs.rst index 960c73f2214..2d6aa938ce9 100644 --- a/docs/source/running-jobs/distributed-jobs.rst +++ b/docs/source/running-jobs/distributed-jobs.rst @@ -179,7 +179,6 @@ To execute a distributed Ray program on many VMs, you can download the `training sudo chmod 777 -R /var/tmp head_ip=`echo "$SKYPILOT_NODE_IPS" | head -n1` num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l` - ray start --head --disable-usage-stats --port 6379 if [ "$SKYPILOT_NODE_RANK" == "0" ]; then ps aux | grep ray | grep 6379 &> /dev/null || ray start --head --disable-usage-stats --port 6379 sleep 5 From 683f10319ea2d490729ff82fc4f359070ae0d023 Mon Sep 17 00:00:00 2001 From: Madhur Prajapati <96643023+MysteryManav@users.noreply.github.com> Date: Thu, 14 Mar 2024 12:24:50 +0530 Subject: [PATCH 11/11] Update docs/source/running-jobs/distributed-jobs.rst Co-authored-by: Zhanghao Wu --- docs/source/running-jobs/distributed-jobs.rst | 6 ------ 1 file changed, 6 deletions(-) diff --git a/docs/source/running-jobs/distributed-jobs.rst b/docs/source/running-jobs/distributed-jobs.rst index 2d6aa938ce9..fb20b7ca988 100644 --- a/docs/source/running-jobs/distributed-jobs.rst +++ b/docs/source/running-jobs/distributed-jobs.rst @@ -165,12 +165,6 @@ To execute a distributed Ray program on many VMs, you can download the `training conda activate ray fi - conda activate ray - if [ $? -ne 0 ]; then - conda create -n ray python=3.10 -y - conda activate ray - fi - pip install "ray[train]" pip install tqdm pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118