-
Notifications
You must be signed in to change notification settings - Fork 547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Support inherit base resources with any_of
and ordered
in resources
field
#2833
Conversation
7cefe10
to
9a70122
Compare
@@ -278,8 +278,15 @@ def get_controller_resources( | |||
controller_type=controller_type, | |||
err=common_utils.format_exception(e, | |||
use_bracket=True))) from e | |||
|
|||
return controller_resources | |||
if len(controller_resources) != 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we limit controlller_resources to 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that SkyServe requires a single candidate of controller resources, e.g.:
Lines 105 to 115 in 99f04b1
controller_cloud = ( | |
requested_resources.cloud if not controller_exist and | |
controller_resources.cloud is None else controller_resources.cloud) | |
# TODO(tian): Probably run another sky.launch after we get the load | |
# balancer port from the controller? So we don't need to open so many | |
# ports here. Or, we should have a nginx traffic control to refuse | |
# any connection to the unregistered ports. | |
controller_resources = controller_resources.copy( | |
cloud=controller_cloud, | |
ports=[serve_constants.LOAD_BALANCER_PORT_RANGE]) | |
controller_task.set_resources(controller_resources) |
I think it might be fine to have a more strict constraint for the controller resources at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good.
I am wondering if the fields under |
Yes, my current thinking is to have the field under |
@Michaelvll Okay make sense. |
…t-base-resource-config
lgtm after fixing pylint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this feature!! Sorry for missed it. Left several things to discuss 🫡
…t-base-resource-config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just did another pass and left several nits & one discussion. If we decided to keep the accelerators: {L4:1, T4:1}
as a shortcut/backward compatibility/any other reason, then this version of the code looks great to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix! I tried the following in latest commit:
>>> import sky
>>> sky.Resources.from_yaml_config({'use_spot': True, 'accelerators': {'A100-40GB:1': None, 'T4:1': None, 'V100:1': None}, 'any_of': [{'cloud': 'aws'}, {'cloud': 'gcp'}]})
ValueError: Invalid resources YAML: {'A100-40GB:1': None, 'T4:1': None, 'V100:1': None} is not valid under any of the given schemas. Check problematic field(s): $.accelerators
>>> sky.Resources.from_yaml_config({'use_spot': True, 'accelerators': 'A100-40GB:1', 'any_of': [{'cloud': 'aws'}, {'cloud': 'gcp'}]})
AssertionError: Invalid resource args: dict_keys(['any_of'])
Is this expected? Correct me if I'm wrong, but I suppose we should:
- allow set of accs;
- allow any_of.
in Python API too?
sky/resources.py
Outdated
|
||
def from_yaml_config( | ||
cls, config: Optional[Dict[str, Any]] | ||
) -> Union[Set['Resources'], List['Resources']]: | ||
common_utils.validate_schema(config, schemas.get_resources_schema(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm slightly confused about why we could still use the old schema check. Should we add the any_of
and ordered
fields in the resources schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have the any_of
and ordered
field included in the schema in #2498
Co-authored-by: Tian Xia <[email protected]>
…t-base-resource-config
Could you double check if you are using the correct commit? It seems working correctly for me with the current branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this! Pls ignore the error above; seems like I used the wrong branch. LGTM!
sky/resources.py
Outdated
return type(accelerators)(tmp_resources_list) | ||
else: | ||
with ux_utils.print_exception_no_traceback(): | ||
raise RuntimeError('Accelerators must be a list or a set.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise RuntimeError('Accelerators must be a list or a set.') | |
raise RuntimeError('Accelerators must be a list or a set when multiple accelerators are specified.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call! Actually, I find this should be assertion instead of raise, as the type of the resources should have ben guaranteed by the schema check.
Fixes #2831
Tested (run the relevant ones):
bash format.sh
sky launch -c test-resources echo hi
sky launch -c test-resources --cpus 2 echo hi
sky spot launch -n test-resources --cpus 2 echo hi
with~/.sky/config.yaml
specifying controller's cloud.sky spot launch -n test-resources --cpus 2 "echo hi; sleep 10000"
manually terminate and recover.sky serve up examples/serve/http_server/task.yaml
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
pytest tests/test_smoke.py::test_multiple_a ccelerators_ordered_with_default
pytest tests/test_smoke.py::test_multiple_accelerators_unordered_with_default
bash tests/backward_comaptibility_tests.sh