-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Add labels field to resources #3464
Conversation
…o resources_labels # Conflicts: # sky/resources.py # sky/utils/schemas.py
Ready for review! Running smoke tests and backcompat tests now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the support for labels @romilbhardwaj! Left two comments : )
if self.cloud is None: | ||
# Because each cloud has its own label format, we cannot validate | ||
# the labels without knowing the cloud. | ||
with ux_utils.print_exception_no_traceback(): | ||
raise ValueError( | ||
'Cloud must be specified when labels are provided.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will cause failover unable to work. Should we just apply a global label format check instead? It might be fine to sacrifice some flexibility. Otherwise, a user have to specify the following to enable all the clouds they have which might be a bit unintuitive:
resources:
labels:
mykey1: myvalue1
any_of:
- cloud: aws
- cloud: gcp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Supporting failover is slightly tricky....
For example, labels in Kubernetes commonly use /
to create a "namespace" for tags (e.g., skypilot.co/accelerators
, app.kubernetes.io/component
.. see recommended labels). However, GCP does not support /
in labels, and a failover would cause provisioning to fail. Putting a stricter global format check would prevent users from creating these labels at all.
Another option could be to not do any validation at all and let these checks fail at provisioning time. We could do that, but this seemed cleaner so went with this for now. Any other ideas? Happy to change if you think we should support failover with labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see! I think it should be fine to require the cloud
to be presented when the labels
is specified. Another option to support failover could be the following:
Validate the labels in the following function, and return an empty list if the labels do not match the requirement for a specific cloud, i.e. make the current cloud infeasible:
Lines 323 to 356 in 431f567
def get_feasible_launchable_resources( | |
self, | |
resources: 'resources_lib.Resources', | |
num_nodes: int = 1 | |
) -> Tuple[List['resources_lib.Resources'], List[str]]: | |
"""Returns ([feasible and launchable resources], [fuzzy candidates]). | |
Feasible resources refer to an offering respecting the resource | |
requirements. Currently, this function implements "filtering" the | |
cloud's offerings only w.r.t. accelerators constraints. | |
Launchable resources require a cloud and an instance type be assigned. | |
Fuzzy candidates example: when the requested GPU is A100:1 but is not | |
available in a cloud/region, the fuzzy candidates are results of a fuzzy | |
search in the catalog that are offered in the location. E.g., | |
['A100-80GB:1', 'A100-80GB:2', 'A100-80GB:4', 'A100:8'] | |
""" | |
if resources.is_launchable(): | |
self._check_instance_type_accelerators_combination(resources) | |
resources_required_features = resources.get_required_cloud_features() | |
if num_nodes > 1: | |
resources_required_features.add( | |
CloudImplementationFeatures.MULTI_NODE) | |
try: | |
self.check_features_are_supported(resources, | |
resources_required_features) | |
except exceptions.NotSupportedError: | |
# TODO(zhwu): The resources are now silently filtered out. We | |
# should have some logging telling the user why the resources | |
# are not considered. | |
return ([], []) | |
return self._get_feasible_launchable_resources(resources) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, that could work. Seems a little hacky though, since the error now becomes hidden in the warning instead of being a clear error.
Checking in resources.py (current):
sky launch task.yaml
Task from YAML spec: task.yaml
ValueError: Invalid label my/label=my/value. Invalid label value my/value for GCP. Value can include lowercase alphanumeric characters, dashes, and underscores, with a total length of 63 characters or less.
Checking in get_feasible_launchable_resources with warning (last proposal):
$ sky launch task.yaml
Task from YAML spec: task.yaml
W 04-30 15:36:34 cloud.py:383] Label my/label=my/value is invalid for cloud GCP. Reason: Invalid label value my/value for GCP. Value can include lowercase alphanumeric characters, dashes, and underscores, with a total length of 63 characters or less.
I 04-30 15:36:34 optimizer.py:1209] No resource satisfying GCP({'T4': 1}) on GCP.
sky.exceptions.ResourcesUnavailableError: Catalog does not contain any instances satisfying the request:
Task(run=<empty>)
resources: GCP({'T4': 1}).
To fix: relax or change the resource requirements.
Hint: sky show-gpus to list available accelerators.
sky check to check the enabled clouds.
wdyt? I'm leaning towards keeping the current variant, but can change to checking in get_feasible_launchable_resources
to support failover if you feel strongly about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping the current way sounds good to me, but we may want to file an issue for it for supporting failover when the labels are specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #3500!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updating @romilbhardwaj! Left several comments. It looks mostly good to me.
if self.cloud is None: | ||
# Because each cloud has its own label format, we cannot validate | ||
# the labels without knowing the cloud. | ||
with ux_utils.print_exception_no_traceback(): | ||
raise ValueError( | ||
'Cloud must be specified when labels are provided.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see! I think it should be fine to require the cloud
to be presented when the labels
is specified. Another option to support failover could be the following:
Validate the labels in the following function, and return an empty list if the labels do not match the requirement for a specific cloud, i.e. make the current cloud infeasible:
Lines 323 to 356 in 431f567
def get_feasible_launchable_resources( | |
self, | |
resources: 'resources_lib.Resources', | |
num_nodes: int = 1 | |
) -> Tuple[List['resources_lib.Resources'], List[str]]: | |
"""Returns ([feasible and launchable resources], [fuzzy candidates]). | |
Feasible resources refer to an offering respecting the resource | |
requirements. Currently, this function implements "filtering" the | |
cloud's offerings only w.r.t. accelerators constraints. | |
Launchable resources require a cloud and an instance type be assigned. | |
Fuzzy candidates example: when the requested GPU is A100:1 but is not | |
available in a cloud/region, the fuzzy candidates are results of a fuzzy | |
search in the catalog that are offered in the location. E.g., | |
['A100-80GB:1', 'A100-80GB:2', 'A100-80GB:4', 'A100:8'] | |
""" | |
if resources.is_launchable(): | |
self._check_instance_type_accelerators_combination(resources) | |
resources_required_features = resources.get_required_cloud_features() | |
if num_nodes > 1: | |
resources_required_features.add( | |
CloudImplementationFeatures.MULTI_NODE) | |
try: | |
self.check_features_are_supported(resources, | |
resources_required_features) | |
except exceptions.NotSupportedError: | |
# TODO(zhwu): The resources are now silently filtered out. We | |
# should have some logging telling the user why the resources | |
# are not considered. | |
return ([], []) | |
return self._get_feasible_launchable_resources(resources) |
…o resources_labels # Conflicts: # sky/templates/aws-ray.yml.j2 # sky/templates/gcp-ray.yml.j2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @romilbhardwaj! The code looks mostly good to me. : )
sky/resources.py
Outdated
# Returns the base labels "updated" with the override labels. | ||
labels = base_resource_config.get('labels') | ||
# Merge the labels from the base and override resources. | ||
if labels is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that if there is no global labels specified, the local labels will not be used? Should we do the following suggestion instead?
sky/resources.py
Outdated
new_resource_config = base_resource_config.copy() | ||
new_resource_config.update(override_config) | ||
|
||
# Handle label merging. When any_of or ordered is specified, | ||
# the labels from the base resource are updated with the labels | ||
# from the override resource. | ||
new_resource_config['labels'] = _merge_labels( | ||
base_resource_config, override_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new_resource_config = base_resource_config.copy() | |
new_resource_config.update(override_config) | |
# Handle label merging. When any_of or ordered is specified, | |
# the labels from the base resource are updated with the labels | |
# from the override resource. | |
new_resource_config['labels'] = _merge_labels( | |
base_resource_config, override_config) | |
new_resource_config = base_resource_config.copy() | |
override_labels = override_config.pop('labels', {}) | |
new_resource_config.update(override_config) | |
labels = new_resource_config.get('labels', {}) | |
labels.update(override_labels) | |
if labels: | |
new_resource_config['labels'] = labels |
Thanks @Michaelvll! Found a small backcompat bug, fixed now. Ran some manual backward compatibility tests:
Running smoke tests now. |
Smoke tests pass, merging now! |
Adds a
labels
field to resources. Follows proposal 2 from this doc which has more context.Semantics:
Passes manual tests on k8s, gcp and aws.
TODOs
--labels
to clitest_smoke.py
Tested:
pytest tests/test_smoke.py::test_task_labels_aws --aws
,pytest tests/test_smoke.py::test_task_labels_gcp --gcp
,pytest tests/test_smoke.py::test_task_labels_kubernetes --kubernetes