Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Install SkyPilot runtime in separate env #3575

Merged
merged 41 commits into from
May 23, 2024
Merged
Show file tree
Hide file tree
Changes from 37 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
5b82fd7
Quote the command correctly when source_bashrc is not set
Michaelvll May 21, 2024
9822e44
Remove unnecessary source bashrc
Michaelvll May 21, 2024
4b045ea
format
Michaelvll May 21, 2024
1e2576e
Fix setup script for conda
Michaelvll May 21, 2024
f25f623
Add comment
Michaelvll May 21, 2024
68f7ebd
format
Michaelvll May 21, 2024
f2dd3d8
Separate env for skypilot
Michaelvll May 21, 2024
01216cd
add test smoke
Michaelvll May 21, 2024
a6f6996
add system site-packages
Michaelvll May 21, 2024
bc89396
add test for default to non-base conda env
Michaelvll May 21, 2024
713fed7
Fix controllers and ray node providers
Michaelvll May 22, 2024
75f7833
move activate to maybe_skylet
Michaelvll May 22, 2024
cd85d42
Make axolotl example work for kubernetes
Michaelvll May 22, 2024
d98dbd4
fix axolotl
Michaelvll May 22, 2024
daf7461
add test for 3.12
Michaelvll May 22, 2024
833d735
format
Michaelvll May 22, 2024
44bcb6b
Fix docker PATH
Michaelvll May 22, 2024
abb0f8d
format
Michaelvll May 22, 2024
b996cf9
add axolotl image in test
Michaelvll May 22, 2024
6bf68fc
address comments
Michaelvll May 22, 2024
4794318
Merge branch 'master' of https://github.com/skypilot-org/skypilot int…
Michaelvll May 22, 2024
9866874
Merge branch 'skypilot-runtime-env' of https://github.com/skypilot-or…
Michaelvll May 22, 2024
7ad8c34
revert grpcio version as it is only installed in our runtime env
Michaelvll May 22, 2024
536c7ad
refactor command for env set up
Michaelvll May 22, 2024
7133cbc
switch to curl as CentOS may not have wget installed but have curl
Michaelvll May 22, 2024
006670c
add l4 in command
Michaelvll May 22, 2024
7673777
fix dependency for test
Michaelvll May 22, 2024
32db638
fix python path for ray executable
Michaelvll May 23, 2024
4358afb
Fix azure launch
Michaelvll May 23, 2024
16e63b5
add comments
Michaelvll May 23, 2024
1a105b2
fix test
Michaelvll May 23, 2024
ce2e1e5
fix smoke
Michaelvll May 23, 2024
349a3ca
fix name
Michaelvll May 23, 2024
0d0638f
fix
Michaelvll May 23, 2024
b19c8f5
fix usage
Michaelvll May 23, 2024
1cbafeb
fix usage for accelerators
Michaelvll May 23, 2024
994c4fc
fix event
Michaelvll May 23, 2024
5ec9c1f
fix name
Michaelvll May 23, 2024
647c59e
fix
Michaelvll May 23, 2024
886dd54
Merge branch 'skypilot-runtime-env' of github.com:skypilot-org/skypil…
Michaelvll May 23, 2024
71ba100
address comments
Michaelvll May 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions llm/axolotl/axolotl-docker.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Usage:
# HF_TOKEN=abc sky launch -c axolotl axolotl.yaml --env HF_TOKEN -y -i30 --down

name: axolotl

resources:
accelerators: L4:1
cloud: gcp # optional

workdir: mistral

setup: |
docker pull winglian/axolotl:main-py3.10-cu118-2.0.1

run: |
docker run --gpus all \
-v ~/sky_workdir:/sky_workdir \
-v /root/.cache:/root/.cache \
winglian/axolotl:main-py3.10-cu118-2.0.1 \
huggingface-cli login --token ${HF_TOKEN}

docker run --gpus all \
-v ~/sky_workdir:/sky_workdir \
-v /root/.cache:/root/.cache \
winglian/axolotl:main-py3.10-cu118-2.0.1 \
accelerate launch -m axolotl.cli.train /sky_workdir/qlora.yaml

envs:
HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.






Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
18 changes: 4 additions & 14 deletions llm/axolotl/axolotl-spot.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ resources:
accelerators: A100:1
cloud: gcp # optional
use_spot: True
image_id: docker:winglian/axolotl:main-py3.10-cu118-2.0.1

workdir: mistral

Expand All @@ -20,22 +21,10 @@ file_mounts:
name: ${BUCKET}
mode: MOUNT

setup: |
docker pull winglian/axolotl:main-py3.10-cu118-2.0.1

run: |
docker run --gpus all \
-v ~/sky_workdir:/sky_workdir \
-v /root/.cache:/root/.cache \
winglian/axolotl:main-py3.10-cu118-2.0.1 \
huggingface-cli login --token ${HF_TOKEN}
huggingface-cli login --token ${HF_TOKEN}

docker run --gpus all \
-v ~/sky_workdir:/sky_workdir \
-v /root/.cache:/root/.cache \
-v /sky-notebook:/sky-notebook \
winglian/axolotl:main-py3.10-cu118-2.0.1 \
accelerate launch -m axolotl.cli.train /sky_workdir/qlora-checkpoint.yaml
accelerate launch -m axolotl.cli.train qlora-checkpoint.yaml

envs:
HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
Expand All @@ -46,3 +35,4 @@ envs:



4
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
19 changes: 4 additions & 15 deletions llm/axolotl/axolotl.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update Axolotl's GitHub readme with this, maybe after 0.6.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will update it in a separate PR.

Original file line number Diff line number Diff line change
Expand Up @@ -5,25 +5,14 @@ name: axolotl

resources:
accelerators: L4:1
cloud: gcp # optional
image_id: docker:winglian/axolotl:main-py3.10-cu118-2.0.1

workdir: mistral

setup: |
docker pull winglian/axolotl:main-py3.10-cu118-2.0.1

run: |
docker run --gpus all \
-v ~/sky_workdir:/sky_workdir \
-v /root/.cache:/root/.cache \
winglian/axolotl:main-py3.10-cu118-2.0.1 \
huggingface-cli login --token ${HF_TOKEN}

docker run --gpus all \
-v ~/sky_workdir:/sky_workdir \
-v /root/.cache:/root/.cache \
winglian/axolotl:main-py3.10-cu118-2.0.1 \
accelerate launch -m axolotl.cli.train /sky_workdir/qlora.yaml
huggingface-cli login --token ${HF_TOKEN}

accelerate launch -m axolotl.cli.train qlora.yaml

envs:
HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
Expand Down
3 changes: 2 additions & 1 deletion llm/axolotl/mistral/qlora-checkpoint.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ warmup_steps: 10
eval_steps: 0.05
eval_table_size:
eval_table_max_new_tokens: 128
eval_sample_packing: false
save_steps: 2 ## increase based on your dataset
save_strategy: steps
debug:
Expand All @@ -81,4 +82,4 @@ fsdp_config:
special_tokens:
bos_token: "<s>"
eos_token: "</s>"
unk_token: "<unk>"
unk_token: "<unk>"
3 changes: 2 additions & 1 deletion llm/axolotl/mistral/qlora.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ warmup_steps: 10
eval_steps: 0.05
eval_table_size:
eval_table_max_new_tokens: 128
eval_sample_packing: false
save_steps:
debug:
deepspeed:
Expand All @@ -78,4 +79,4 @@ fsdp_config:
special_tokens:
bos_token: "<s>"
eos_token: "</s>"
unk_token: "<unk>"
unk_token: "<unk>"
9 changes: 8 additions & 1 deletion sky/backends/backend_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -925,7 +925,14 @@ def write_cluster_config(
'dump_port_command': dump_port_command,
# Sky-internal constants.
'sky_ray_cmd': constants.SKY_RAY_CMD,
'sky_pip_cmd': constants.SKY_PIP_CMD,
# pip install needs to have python env activated to make sure
# installed packages are within the env path.
'sky_pip_cmd': f'{constants.SKY_PIP_CMD}',
# Activate the SkyPilot runtime environment when starting ray
# cluster, so that ray autoscaler can access cloud SDK and CLIs
# on remote
'sky_activate_python_env':
constants.ACTIVATE_SKY_REMOTE_PYTHON_ENV,
'ray_version': constants.SKY_REMOTE_RAY_VERSION,
# Command for waiting ray cluster to be ready on head.
'ray_head_wait_initialized_command':
Expand Down
1 change: 0 additions & 1 deletion sky/jobs/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,6 @@ def launch(
'dag_name': dag.name,
'retry_until_up': retry_until_up,
'remote_user_config_path': remote_user_config_path,
'sky_python_cmd': skylet_constants.SKY_PYTHON_CMD,
'modified_catalogs':
service_catalog_common.get_modified_catalog_file_mounts(),
**controller_utils.shared_controller_vars_to_fill(
Expand Down
16 changes: 15 additions & 1 deletion sky/provision/docker_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,17 @@
DOCKER_PERMISSION_DENIED_STR = ('permission denied while trying to connect to '
'the Docker daemon socket')

# Configure environment variables. A docker image can have environment variables
# set in the Dockerfile with `ENV``. We need to export these variables to the
# shell environment, so that our ssh session can access them.
SETUP_ENV_VARS_CMD = (
'prefix_cmd() '
'{ if [ $(id -u) -ne 0 ]; then echo "sudo"; else echo ""; fi; } && '
'printenv | while IFS=\'=\' read -r key value; do echo "export $key=\\\"$value\\\""; done > ' # pylint: disable=line-too-long
'~/container_env_var.sh && '
'$(prefix_cmd) mv ~/container_env_var.sh /etc/profile.d/container_env_var.sh'
)


@dataclasses.dataclass
class DockerLoginConfig:
Expand Down Expand Up @@ -244,6 +255,8 @@ def initialize(self) -> str:
self._run(start_command)

# SkyPilot: Setup Commands.
# TODO(zhwu): the following setups should be aligned with the kubernetes
# pod setup, like provision.kubernetes.instance::_set_env_vars_in_pods
# TODO(tian): These setup commands assumed that the container is
# debian-based. We should make it more general.
# Most of docker images are using root as default user, so we set an
Expand Down Expand Up @@ -296,7 +309,8 @@ def initialize(self) -> str:
'mkdir -p ~/.ssh;'
'cat /tmp/host_ssh_authorized_keys >> ~/.ssh/authorized_keys;'
'sudo service ssh start;'
'sudo sed -i "s/mesg n/tty -s \&\& mesg n/" ~/.profile;',
'sudo sed -i "s/mesg n/tty -s \&\& mesg n/" ~/.profile;'
f'{SETUP_ENV_VARS_CMD}',
run_env='docker')

# SkyPilot: End of Setup Commands.
Expand Down
5 changes: 4 additions & 1 deletion sky/provision/instance_setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,10 @@
'done;')

# Restart skylet when the version does not match to keep the skylet up-to-date.
MAYBE_SKYLET_RESTART_CMD = (f'{constants.SKY_PYTHON_CMD} -m '
# We need to activate the python environment to make sure autostop in skylet
# can find the cloud SDK/CLI in PATH.
MAYBE_SKYLET_RESTART_CMD = (f'{constants.ACTIVATE_SKY_REMOTE_PYTHON_ENV}; '
f'{constants.SKY_PYTHON_CMD} -m '
'sky.skylet.attempt_skylet;')


Expand Down
13 changes: 5 additions & 8 deletions sky/provision/kubernetes/instance.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
from sky import status_lib
from sky.adaptors import kubernetes
from sky.provision import common
from sky.provision import docker_utils
from sky.provision.kubernetes import config as config_lib
from sky.provision.kubernetes import utils as kubernetes_utils
from sky.utils import common_utils
Expand Down Expand Up @@ -241,7 +242,7 @@ def _wait_for_pods_to_run(namespace, new_nodes):
'the node. Error details: '
f'{container_status.state.waiting.message}.')
# Reaching this point means that one of the pods had an issue,
# so break out of the loop
# so break out of the loop, and wait until next second.
break

if all_pods_running:
Expand Down Expand Up @@ -301,13 +302,7 @@ def _set_env_vars_in_pods(namespace: str, new_pods: List):
set_k8s_env_var_cmd = [
'/bin/sh',
'-c',
(
'prefix_cmd() '
'{ if [ $(id -u) -ne 0 ]; then echo "sudo"; else echo ""; fi; } && '
'printenv | while IFS=\'=\' read -r key value; do echo "export $key=\\\"$value\\\""; done > ' # pylint: disable=line-too-long
'~/k8s_env_var.sh && '
'mv ~/k8s_env_var.sh /etc/profile.d/k8s_env_var.sh || '
'$(prefix_cmd) mv ~/k8s_env_var.sh /etc/profile.d/k8s_env_var.sh')
docker_utils.SETUP_ENV_VARS_CMD,
]

for new_pod in new_pods:
Expand Down Expand Up @@ -540,6 +535,8 @@ def _create_pods(region: str, cluster_name_on_cloud: str,
_wait_for_pods_to_schedule(namespace, wait_pods, provision_timeout)
# Wait until the pods and their containers are up and running, and
# fail early if there is an error
logger.debug(f'run_instances: waiting for pods to be running (pulling '
f'images): {list(wait_pods_dict.keys())}')
_wait_for_pods_to_run(namespace, wait_pods)
logger.debug(f'run_instances: all pods are scheduled and running: '
f'{list(wait_pods_dict.keys())}')
Expand Down
2 changes: 2 additions & 0 deletions sky/skylet/attempt_skylet.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ def restart_skylet():
shell=True,
check=False)
subprocess.run(
# Activate python environment first to make sure skylet can find the
# cloud SDK for autostopping.
f'nohup {constants.SKY_PYTHON_CMD} -m sky.skylet.skylet'
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
' >> ~/.sky/skylet.log 2>&1 &',
shell=True,
Expand Down
39 changes: 30 additions & 9 deletions sky/skylet/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,18 @@
SKY_PYTHON_CMD = f'$({SKY_GET_PYTHON_PATH_CMD})'
SKY_PIP_CMD = f'{SKY_PYTHON_CMD} -m pip'
# Ray executable, e.g., /opt/conda/bin/ray
SKY_RAY_CMD = (f'$([ -s {SKY_RAY_PATH_FILE} ] && '
# We need to add SKY_PYTHON_CMD before ray executable because:
# The ray executable is a python script with a header like:
# #!/opt/conda/bin/python3
# When we create the skypilot-runtime venv, the previously installed ray
# executable will be reused (due to --system-site-packages), and that will cause
# running ray CLI commands to use the wrong python executable.
SKY_RAY_CMD = (f'{SKY_PYTHON_CMD} $([ -s {SKY_RAY_PATH_FILE} ] && '
f'cat {SKY_RAY_PATH_FILE} 2> /dev/null || which ray)')
# Separate env for SkyPilot runtime dependencies.
SKY_REMOTE_PYTHON_ENV_NAME = 'skypilot-runtime'
SKY_REMOTE_PYTHON_ENV = f'~/{SKY_REMOTE_PYTHON_ENV_NAME}'
ACTIVATE_SKY_REMOTE_PYTHON_ENV = f'source {SKY_REMOTE_PYTHON_ENV}/bin/activate'

# The name for the environment variable that stores the unique ID of the
# current task. This will stay the same across multiple recoveries of the
Expand Down Expand Up @@ -91,20 +101,29 @@
# AWS's Deep Learning AMI's default conda environment.
CONDA_INSTALLATION_COMMANDS = (
'which conda > /dev/null 2>&1 || '
'{ wget -nc https://repo.anaconda.com/miniconda/Miniconda3-py310_23.11.0-2-Linux-x86_64.sh -O Miniconda3-Linux-x86_64.sh && ' # pylint: disable=line-too-long
'{ curl https://repo.anaconda.com/miniconda/Miniconda3-py310_23.11.0-2-Linux-x86_64.sh -o Miniconda3-Linux-x86_64.sh && ' # pylint: disable=line-too-long
'bash Miniconda3-Linux-x86_64.sh -b && '
'eval "$(~/miniconda3/bin/conda shell.bash hook)" && conda init && '
'conda config --set auto_activate_base true && '
# Use $(echo ~) instead of ~ to avoid the error "no such file or directory".
# Also, not using $HOME to avoid the error HOME variable not set.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
f'echo "$(echo ~)/miniconda3/bin/python" > {SKY_PYTHON_PATH_FILE}; }}; '
f'conda activate base; }}; '
'grep "# >>> conda initialize >>>" ~/.bashrc || '
'{ conda init && source ~/.bashrc; };'
'(type -a python | grep -q python3) || '
'echo \'alias python=python3\' >> ~/.bashrc;'
'(type -a pip | grep -q pip3) || echo \'alias pip=pip3\' >> ~/.bashrc;'
# Writes Python path to file if it does not exist or the file is empty.
f'[ -s {SKY_PYTHON_PATH_FILE} ] || which python3 > {SKY_PYTHON_PATH_FILE};')
# If Python version is larger then equal to 3.12, create a new conda env
# with Python 3.10.
# We don't use a separate conda env for SkyPilot dependencies because it is
# costly to create a new conda env, and venv should be a lightweight and
# faster alternative when the python version satisfies the requirement.
'[[ $(python3 --version | cut -d " " -f 2 | cut -d "." -f 2) -ge 12 ]] && '
f'echo "Creating conda env with Python 3.10" && '
f'conda create -y -n {SKY_REMOTE_PYTHON_ENV_NAME} python=3.10 && '
f'conda activate {SKY_REMOTE_PYTHON_ENV_NAME};'
# Create a separate conda environment for SkyPilot dependencies.
f'[ -d {SKY_REMOTE_PYTHON_ENV} ] || '
f'{{ {SKY_PYTHON_CMD} -m venv {SKY_REMOTE_PYTHON_ENV} --system-site-packages && '
f'echo "$(echo {SKY_REMOTE_PYTHON_ENV})/bin/python" > {SKY_PYTHON_PATH_FILE}; }};'
)

_sky_version = str(version.parse(sky.__version__))
RAY_STATUS = f'RAY_ADDRESS=127.0.0.1:{SKY_REMOTE_RAY_PORT} {SKY_RAY_CMD} status'
Expand Down Expand Up @@ -142,7 +161,9 @@
# mentioned above are resolved.
'export PATH=$PATH:$HOME/.local/bin; '
# Writes ray path to file if it does not exist or the file is empty.
f'[ -s {SKY_RAY_PATH_FILE} ] || which ray > {SKY_RAY_PATH_FILE}; '
f'[ -s {SKY_RAY_PATH_FILE} ] || '
f'{{ {ACTIVATE_SKY_REMOTE_PYTHON_ENV} && '
f'which ray > {SKY_RAY_PATH_FILE} || exit 1; }}; '
# END ray package check and installation
f'{{ {SKY_PIP_CMD} list | grep "skypilot " && '
'[ "$(cat ~/.sky/wheels/current_sky_wheel_hash)" == "{sky_wheel_hash}" ]; } || ' # pylint: disable=line-too-long
Expand Down
15 changes: 13 additions & 2 deletions sky/skylet/events.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
import os
import re
import subprocess
import sys
import time
import traceback

Expand Down Expand Up @@ -193,7 +192,10 @@ def _stop_cluster(self, autostop_config):
# Passing env inherited from os.environ is technically not
# needed, because we call `python <script>` rather than `ray
# <cmd>`. We just need the {RAY_USAGE_STATS_ENABLED: 0} part.
subprocess.run([sys.executable, script], check=True, env=env)
subprocess.run(f'{constants.SKY_PYTHON_CMD} {script}',
check=True,
shell=True,
env=env)

logger.info('Running ray down.')
# Stop the workers first to avoid orphan workers.
Expand All @@ -206,6 +208,15 @@ def _stop_cluster(self, autostop_config):
# <cmd>`.
env=env)

# Stop the ray autoscaler to avoid scaling up, during
# stopping/terminating of the cluster. We do not rely `ray down`
# below for stopping ray cluster, as it will not use the correct
# ray path.
logger.info('Stopping the ray cluster.')
subprocess.run(f'{constants.SKY_RAY_CMD} stop',
shell=True,
check=True)

logger.info('Running final ray down.')
subprocess.run(
f'{constants.SKY_RAY_CMD} down -y {config_path}',
Expand Down
4 changes: 2 additions & 2 deletions sky/templates/azure-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -164,14 +164,14 @@ setup_commands:
# current num items (num SSH connections): 2
head_start_ray_commands:
# NOTE: --disable-usage-stats in `ray start` saves 10 seconds of idle wait.
- {{ sky_ray_cmd }} stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 {{ sky_ray_cmd }} start --disable-usage-stats --head --port={{ray_port}} --dashboard-port={{ray_dashboard_port}} --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml {{"--num-gpus=%s" % num_gpus if num_gpus}} {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1;
- {{ sky_activate_python_env }}; {{ sky_ray_cmd }} stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 {{ sky_ray_cmd }} start --disable-usage-stats --head --port={{ray_port}} --dashboard-port={{ray_dashboard_port}} --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml {{"--num-gpus=%s" % num_gpus if num_gpus}} {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1;
which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done;
{{dump_port_command}};
{{ray_head_wait_initialized_command}}

{%- if num_nodes > 1 %}
worker_start_ray_commands:
- {{ sky_ray_cmd }} stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 {{ sky_ray_cmd }} start --disable-usage-stats --address=$RAY_HEAD_IP:{{ray_port}} --object-manager-port=8076 {{"--num-gpus=%s" % num_gpus if num_gpus}} {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1;
- {{ sky_activate_python_env }}; {{ sky_ray_cmd }} stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 {{ sky_ray_cmd }} start --disable-usage-stats --address=$RAY_HEAD_IP:{{ray_port}} --object-manager-port=8076 {{"--num-gpus=%s" % num_gpus if num_gpus}} {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1;
which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done;
{%- else %}
worker_start_ray_commands: []
Expand Down
4 changes: 2 additions & 2 deletions sky/templates/ibm-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -118,13 +118,13 @@ head_start_ray_commands:
# NOTE: --disable-usage-stats in `ray start` saves 10 seconds of idle wait.
# Line "which prlimit ..": increase the limit of the number of open files for the raylet process, as the `ulimit` may not take effect at this point, because it requires
# all the sessions to be reloaded. This is a workaround.
- {{ sky_ray_cmd }} stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 {{ sky_ray_cmd }} start --disable-usage-stats --head --port={{ray_port}} --dashboard-port={{ray_dashboard_port}} --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1;
- {{ sky_activate_python_env }}; {{ sky_ray_cmd }} stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 {{ sky_ray_cmd }} start --disable-usage-stats --head --port={{ray_port}} --dashboard-port={{ray_dashboard_port}} --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1;
which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done;
{{dump_port_command}}; {{ray_head_wait_initialized_command}}

{%- if num_nodes > 1 %}
worker_start_ray_commands:
- {{ sky_ray_cmd }} stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 {{ sky_ray_cmd }} start --disable-usage-stats --address=$RAY_HEAD_IP:{{ray_port}} --object-manager-port=8076 {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1;
- {{ sky_activate_python_env }}; {{ sky_ray_cmd }} stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 {{ sky_ray_cmd }} start --disable-usage-stats --address=$RAY_HEAD_IP:{{ray_port}} --object-manager-port=8076 {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1;
which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done;
{%- else %}
worker_start_ray_commands: []
Expand Down
Loading
Loading