Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Add SKYPILOT_NUM_NODES env var #3656

Merged
merged 10 commits into from
Jun 13, 2024
9 changes: 8 additions & 1 deletion docs/source/running-jobs/environment-variables.rst
Original file line number Diff line number Diff line change
Expand Up @@ -120,8 +120,12 @@ Environment variables for ``setup``
- Rank (an integer ID from 0 to :code:`num_nodes-1`) of the node being set up.
- 0
* - ``SKYPILOT_SETUP_NODE_IPS``
- A string of IP addresses of the nodes in the cluster with the same order as the node ranks, where each line contains one IP address.
- A string of IP addresses of the nodes in the cluster with the same order as the node ranks, where each line contains one IP address. Note, this is not necesarily the same as the nodes in ``run`` stage, as the ``setup`` stage runs on all nodes of the cluster, while the ``run`` stage can run on a subset of nodes.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
- 1.2.3.4
3.4.5.6
* - ``SKYPILOT_NUM_NODES``
- Number of nodes in the cluster, which should be the same as ``$(echo "$SKYPILOT_NODE_IPS" | wc -l)``.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
- 2
* - ``SKYPILOT_TASK_ID``
- A unique ID assigned to each task.

Expand Down Expand Up @@ -159,6 +163,9 @@ Environment variables for ``run``
* - ``SKYPILOT_NODE_IPS``
- A string of IP addresses of the nodes reserved to execute the task, where each line contains one IP address. Read more :ref:`here <dist-jobs>`.
- 1.2.3.4
* - ``SKYPILOT_NUM_NODES``
- Number of nodes used for the execution of the current task, which should be the same as ``$(echo "$SKYPILOT_NODE_IPS" | wc -l)``. Read more :ref:`here <dist-jobs>`.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
- 1
* - ``SKYPILOT_NUM_GPUS_PER_NODE``
- Number of GPUs reserved on each node to execute the task; the same as the
count in ``accelerators: <name>:<count>`` (rounded up if a fraction). Read
Expand Down
18 changes: 12 additions & 6 deletions sky/backends/cloud_vm_ray_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -409,6 +409,11 @@ def add_gang_scheduling_placement_group_and_setup(

job_id = self.job_id
if setup_cmd is not None:
if env_vars is None:
setup_envs = None
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
else:
setup_envs = env_vars.copy()
setup_envs[constants.SKYPILOT_NUM_NODES] = str(num_nodes)
self._code += [
textwrap.dedent(f"""\
setup_cmd = {setup_cmd!r}
Expand Down Expand Up @@ -438,7 +443,7 @@ def add_gang_scheduling_placement_group_and_setup(
.remote(
setup_cmd,
os.path.expanduser({setup_log_path!r}),
env_vars={env_vars!r},
env_vars={setup_envs!r},
stream_logs=True,
with_ray=True,
) for i in range(total_num_nodes)]
Expand Down Expand Up @@ -549,11 +554,12 @@ def add_ray_task(self,
f'placement_group_bundle_index={gang_scheduling_id})')

sky_env_vars_dict_str = [
textwrap.dedent("""\
sky_env_vars_dict = {}
sky_env_vars_dict['SKYPILOT_NODE_IPS'] = job_ip_list_str
textwrap.dedent(f"""\
sky_env_vars_dict = {{}}
sky_env_vars_dict['{constants.SKYPILOT_NODE_IPS}'] = job_ip_list_str
# Environment starting with `SKY_` is deprecated.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
sky_env_vars_dict['SKY_NODE_IPS'] = job_ip_list_str
sky_env_vars_dict['{constants.SKYPILOT_NUM_NODES}'] = len(job_ip_rank_list)
""")
]

Expand All @@ -574,7 +580,7 @@ def add_ray_task(self,


if script is not None:
sky_env_vars_dict['SKYPILOT_NUM_GPUS_PER_NODE'] = {int(math.ceil(num_gpus))!r}
sky_env_vars_dict['{constants.SKYPILOT_NUM_GPUS_PER_NODE}'] = {int(math.ceil(num_gpus))!r}
# Environment starting with `SKY_` is deprecated.
sky_env_vars_dict['SKY_NUM_GPUS_PER_NODE'] = {int(math.ceil(num_gpus))!r}

Expand All @@ -592,7 +598,7 @@ def add_ray_task(self,
node_name = f'worker{{idx_in_cluster}}'
name_str = f'{{node_name}}, rank={{rank}},'
log_path = os.path.expanduser(os.path.join({log_dir!r}, f'{{rank}}-{{node_name}}.log'))
sky_env_vars_dict['SKYPILOT_NODE_RANK'] = rank
sky_env_vars_dict['{constants.SKYPILOT_NODE_RANK}'] = rank
# Environment starting with `SKY_` is deprecated.
sky_env_vars_dict['SKY_NODE_RANK'] = rank

Expand Down
6 changes: 6 additions & 0 deletions sky/skylet/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -225,3 +225,9 @@
# Serve: A default controller with 4 vCPU and 16 GB memory can run up to 16
# services.
CONTROLLER_PROCESS_CPU_DEMAND = 0.25

# SkyPilot environment variables
SKYPILOT_NUM_NODES = 'SKYPILOT_NUM_NODES'
SKYPILOT_NODE_IPS = 'SKYPILOT_NODE_IPS'
SKYPILOT_NUM_GPUS_PER_NODE = 'SKYPILOT_NUM_GPUS_PER_NODE'
SKYPILOT_NODE_RANK = 'SKYPILOT_NODE_RANK'
Loading