[Core] Add labels field to resources #3464

romilbhardwaj · 2024-04-22T23:25:33Z

Adds a labels field to resources. Follows proposal 2 from this doc which has more context.

resources:
  labels:
    mylabel: myvalue

Semantics:

If the cluster does not already exist, cluster is created with labels/instance_tags if the cloud supports them. If not supported by cloud, labels are not created but the cluster is launched as usual.
If the cluster already exists, no action is taken. Labels are not compared to check if task can be run. Labels are not updated.
Task labels override global labels set in ~/.sky/config.yaml.

Passes manual tests on k8s, gcp and aws.

TODOs

Add --labels to cli
Add tests in test_smoke.py
Add docs

Tested:

New smoke tests - pytest tests/test_smoke.py::test_task_labels_aws --aws, pytest tests/test_smoke.py::test_task_labels_gcp --gcp, pytest tests/test_smoke.py::test_task_labels_kubernetes --kubernetes
All smoke tests (yet to run)

…o resources_labels # Conflicts: # sky/resources.py # sky/utils/schemas.py

romilbhardwaj · 2024-04-25T19:20:25Z

Ready for review! Running smoke tests and backcompat tests now.

Michaelvll

Thanks for adding the support for labels @romilbhardwaj! Left two comments : )

Michaelvll · 2024-04-25T19:57:28Z

sky/resources.py

+        if self.cloud is None:
+            # Because each cloud has its own label format, we cannot validate
+            # the labels without knowing the cloud.
+            with ux_utils.print_exception_no_traceback():
+                raise ValueError(
+                    'Cloud must be specified when labels are provided.')


This will cause failover unable to work. Should we just apply a global label format check instead? It might be fine to sacrifice some flexibility. Otherwise, a user have to specify the following to enable all the clouds they have which might be a bit unintuitive:

resources: labels: mykey1: myvalue1 any_of: - cloud: aws - cloud: gcp

Supporting failover is slightly tricky....

For example, labels in Kubernetes commonly use / to create a "namespace" for tags (e.g., skypilot.co/accelerators, app.kubernetes.io/component.. see recommended labels). However, GCP does not support / in labels, and a failover would cause provisioning to fail. Putting a stricter global format check would prevent users from creating these labels at all.

Another option could be to not do any validation at all and let these checks fail at provisioning time. We could do that, but this seemed cleaner so went with this for now. Any other ideas? Happy to change if you think we should support failover with labels.

I see! I think it should be fine to require the cloud to be presented when the labels is specified. Another option to support failover could be the following:

Validate the labels in the following function, and return an empty list if the labels do not match the requirement for a specific cloud, i.e. make the current cloud infeasible:

skypilot/sky/clouds/cloud.py

Lines 323 to 356 in 431f567

def get_feasible_launchable_resources(

self,

resources: 'resources_lib.Resources',

num_nodes: int = 1

) -> Tuple[List['resources_lib.Resources'], List[str]]:

"""Returns ([feasible and launchable resources], [fuzzy candidates]).

Feasible resources refer to an offering respecting the resource

requirements. Currently, this function implements "filtering" the

cloud's offerings only w.r.t. accelerators constraints.

Launchable resources require a cloud and an instance type be assigned.

Fuzzy candidates example: when the requested GPU is A100:1 but is not

available in a cloud/region, the fuzzy candidates are results of a fuzzy

search in the catalog that are offered in the location. E.g.,

['A100-80GB:1', 'A100-80GB:2', 'A100-80GB:4', 'A100:8']

"""

if resources.is_launchable():

self._check_instance_type_accelerators_combination(resources)

resources_required_features = resources.get_required_cloud_features()

if num_nodes > 1:

resources_required_features.add(

CloudImplementationFeatures.MULTI_NODE)

try:

self.check_features_are_supported(resources,

resources_required_features)

except exceptions.NotSupportedError:

# TODO(zhwu): The resources are now silently filtered out. We

# should have some logging telling the user why the resources

# are not considered.

return ([], [])

return self._get_feasible_launchable_resources(resources)

Agreed, that could work. Seems a little hacky though, since the error now becomes hidden in the warning instead of being a clear error.

Checking in resources.py (current):

sky launch task.yaml Task from YAML spec: task.yaml ValueError: Invalid label my/label=my/value. Invalid label value my/value for GCP. Value can include lowercase alphanumeric characters, dashes, and underscores, with a total length of 63 characters or less.

Checking in get_feasible_launchable_resources with warning (last proposal):

$ sky launch task.yaml Task from YAML spec: task.yaml W 04-30 15:36:34 cloud.py:383] Label my/label=my/value is invalid for cloud GCP. Reason: Invalid label value my/value for GCP. Value can include lowercase alphanumeric characters, dashes, and underscores, with a total length of 63 characters or less. I 04-30 15:36:34 optimizer.py:1209] No resource satisfying GCP({'T4': 1}) on GCP. sky.exceptions.ResourcesUnavailableError: Catalog does not contain any instances satisfying the request: Task(run=<empty>) resources: GCP({'T4': 1}). To fix: relax or change the resource requirements. Hint: sky show-gpus to list available accelerators. sky check to check the enabled clouds.

wdyt? I'm leaning towards keeping the current variant, but can change to checking in get_feasible_launchable_resources to support failover if you feel strongly about it.

Keeping the current way sounds good to me, but we may want to file an issue for it for supporting failover when the labels are specified.

Filed #3500!

sky/utils/schemas.py

Michaelvll

Thanks for the updating @romilbhardwaj! Left several comments. It looks mostly good to me.

docs/source/reference/yaml-spec.rst

sky/backends/backend_utils.py

Michaelvll · 2024-04-30T06:05:55Z

sky/resources.py

+        if self.cloud is None:
+            # Because each cloud has its own label format, we cannot validate
+            # the labels without knowing the cloud.
+            with ux_utils.print_exception_no_traceback():
+                raise ValueError(
+                    'Cloud must be specified when labels are provided.')


I see! I think it should be fine to require the cloud to be presented when the labels is specified. Another option to support failover could be the following:

Validate the labels in the following function, and return an empty list if the labels do not match the requirement for a specific cloud, i.e. make the current cloud infeasible:

skypilot/sky/clouds/cloud.py

Lines 323 to 356 in 431f567

def get_feasible_launchable_resources(

self,

resources: 'resources_lib.Resources',

num_nodes: int = 1

) -> Tuple[List['resources_lib.Resources'], List[str]]:

"""Returns ([feasible and launchable resources], [fuzzy candidates]).

Feasible resources refer to an offering respecting the resource

requirements. Currently, this function implements "filtering" the

cloud's offerings only w.r.t. accelerators constraints.

Launchable resources require a cloud and an instance type be assigned.

Fuzzy candidates example: when the requested GPU is A100:1 but is not

available in a cloud/region, the fuzzy candidates are results of a fuzzy

search in the catalog that are offered in the location. E.g.,

['A100-80GB:1', 'A100-80GB:2', 'A100-80GB:4', 'A100:8']

"""

if resources.is_launchable():

self._check_instance_type_accelerators_combination(resources)

resources_required_features = resources.get_required_cloud_features()

if num_nodes > 1:

resources_required_features.add(

CloudImplementationFeatures.MULTI_NODE)

try:

self.check_features_are_supported(resources,

resources_required_features)

except exceptions.NotSupportedError:

# TODO(zhwu): The resources are now silently filtered out. We

# should have some logging telling the user why the resources

# are not considered.

return ([], [])

return self._get_feasible_launchable_resources(resources)

sky/utils/schemas.py

…o resources_labels # Conflicts: # sky/templates/aws-ray.yml.j2 # sky/templates/gcp-ray.yml.j2

Michaelvll

Thanks for the update @romilbhardwaj! The code looks mostly good to me. : )

sky/backends/backend_utils.py

sky/resources.py

Michaelvll · 2024-05-01T01:24:07Z

sky/resources.py

+            # Returns the base labels "updated" with the override labels.
+            labels = base_resource_config.get('labels')
+            # Merge the labels from the base and override resources.
+            if labels is not None:


Does this mean that if there is no global labels specified, the local labels will not be used? Should we do the following suggestion instead?

Michaelvll · 2024-05-01T01:26:25Z

sky/resources.py

                new_resource_config = base_resource_config.copy()
                new_resource_config.update(override_config)
+
+                # Handle label merging. When any_of or ordered is specified,
+                # the labels from the base resource are updated with the labels
+                # from the override resource.
+                new_resource_config['labels'] = _merge_labels(
+                    base_resource_config, override_config)


Suggested change

new_resource_config = base_resource_config.copy()

new_resource_config.update(override_config)

# Handle label merging. When any_of or ordered is specified,

# the labels from the base resource are updated with the labels

# from the override resource.

new_resource_config['labels'] = _merge_labels(

base_resource_config, override_config)

new_resource_config = base_resource_config.copy()

override_labels = override_config.pop('labels', {})

new_resource_config.update(override_config)

labels = new_resource_config.get('labels', {})

labels.update(override_labels)

if labels:

new_resource_config['labels'] = labels

romilbhardwaj · 2024-05-01T01:33:14Z

Thanks @Michaelvll! Found a small backcompat bug, fixed now. Ran some manual backward compatibility tests:

Regular cluster launched on master, switch to branch and try exec, logs and launch again.
Spot controller launched on master with instance_tags specified in config, switch to branch and launch more spot jobs, check queue/logs

Running smoke tests now.

romilbhardwaj · 2024-05-02T06:18:26Z

Smoke tests pass, merging now!

sky/resources.py

romilbhardwaj added 9 commits April 22, 2024 15:59

add labels field

4d1df0e

wip

f71a04b

add CLI support

d6df6a9

tests and validation

e3b01d1

add docs

6af38bd

revert label changes in cli

356f4e4

lint

5406190

Merge branch 'master' of https://github.com/skypilot-org/skypilot int…

36d0936

…o resources_labels # Conflicts: # sky/resources.py # sky/utils/schemas.py

add docs

d4fda89

romilbhardwaj marked this pull request as ready for review April 25, 2024 19:03

romilbhardwaj requested a review from Michaelvll April 25, 2024 19:03

Michaelvll reviewed Apr 25, 2024

View reviewed changes

disallow labels in any_of and ordered

eba82b9

Michaelvll reviewed Apr 30, 2024

View reviewed changes

comments, deprecate instance_tag and allow labels in any_of and ordered

4bc4b25

romilbhardwaj mentioned this pull request May 1, 2024

[Core] Support cross-cloud failover when labels are provided #3500

Closed

romilbhardwaj added 2 commits April 30, 2024 18:12

Merge branch 'master' of https://github.com/skypilot-org/skypilot int…

9524e6c

…o resources_labels # Conflicts: # sky/templates/aws-ray.yml.j2 # sky/templates/gcp-ray.yml.j2

Fix back compat

72307b6

Michaelvll approved these changes May 1, 2024

View reviewed changes

romilbhardwaj added 5 commits May 1, 2024 12:10

batch errors

e7fa0a1

comments

ebef725

fixes

2e29e87

lint

e62eb4e

allow self._labels to be none

62620b3

romilbhardwaj merged commit f256b53 into master May 2, 2024
20 checks passed

romilbhardwaj deleted the resources_labels branch May 2, 2024 06:19

Michaelvll reviewed May 2, 2024

View reviewed changes

sky/resources.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Add labels field to resources #3464

[Core] Add labels field to resources #3464

romilbhardwaj commented Apr 22, 2024 •

edited

Loading

romilbhardwaj commented Apr 25, 2024 •

edited

Loading

Michaelvll left a comment •

edited

Loading

Michaelvll Apr 25, 2024

romilbhardwaj Apr 27, 2024 •

edited

Loading

Michaelvll Apr 30, 2024

romilbhardwaj Apr 30, 2024

Michaelvll Apr 30, 2024

romilbhardwaj May 1, 2024

Michaelvll left a comment

Michaelvll Apr 30, 2024

Michaelvll left a comment

Michaelvll May 1, 2024

Michaelvll May 1, 2024

romilbhardwaj commented May 1, 2024

romilbhardwaj commented May 2, 2024

	def get_feasible_launchable_resources(
	self,
	resources: 'resources_lib.Resources',
	num_nodes: int = 1
	) -> Tuple[List['resources_lib.Resources'], List[str]]:
	"""Returns ([feasible and launchable resources], [fuzzy candidates]).

	Feasible resources refer to an offering respecting the resource
	requirements. Currently, this function implements "filtering" the
	cloud's offerings only w.r.t. accelerators constraints.

	Launchable resources require a cloud and an instance type be assigned.

	Fuzzy candidates example: when the requested GPU is A100:1 but is not
	available in a cloud/region, the fuzzy candidates are results of a fuzzy
	search in the catalog that are offered in the location. E.g.,
	['A100-80GB:1', 'A100-80GB:2', 'A100-80GB:4', 'A100:8']
	"""
	if resources.is_launchable():
	self._check_instance_type_accelerators_combination(resources)
	resources_required_features = resources.get_required_cloud_features()
	if num_nodes > 1:
	resources_required_features.add(
	CloudImplementationFeatures.MULTI_NODE)

	try:
	self.check_features_are_supported(resources,
	resources_required_features)
	except exceptions.NotSupportedError:
	# TODO(zhwu): The resources are now silently filtered out. We
	# should have some logging telling the user why the resources
	# are not considered.
	return ([], [])
	return self._get_feasible_launchable_resources(resources)

[Core] Add labels field to resources #3464

[Core] Add labels field to resources #3464

Conversation

romilbhardwaj commented Apr 22, 2024 • edited Loading

romilbhardwaj commented Apr 25, 2024 • edited Loading

Michaelvll left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romilbhardwaj Apr 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romilbhardwaj commented May 1, 2024

romilbhardwaj commented May 2, 2024

romilbhardwaj commented Apr 22, 2024 •

edited

Loading

romilbhardwaj commented Apr 25, 2024 •

edited

Loading

Michaelvll left a comment •

edited

Loading

romilbhardwaj Apr 27, 2024 •

edited

Loading