-
Notifications
You must be signed in to change notification settings - Fork 534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Storage] Use task's region for initializing new stores #3319
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @romilbhardwaj. Tested the service yaml in #3114 as well and it did create a bucket for workdir for the correct best_resources:
...
I 03-16 08:56:55 optimizer.py:907] ----------------------------------------------------------------------------------------------------
I 03-16 08:56:55 optimizer.py:907] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
I 03-16 08:56:55 optimizer.py:907] ----------------------------------------------------------------------------------------------------
I 03-16 08:56:55 optimizer.py:907] AWS g5.2xlarge[Spot] 8 32 A10G:1 eu-central-1a 0.53 ✔
I 03-16 08:56:55 optimizer.py:907] AWS g4dn.12xlarge[Spot] 48 192 T4:4 eu-west-3c 0.66
I 03-16 08:56:55 optimizer.py:907] ----------------------------------------------------------------------------------------------------
I 03-16 08:56:55 optimizer.py:907]
...
I 03-16 08:56:58 controller_utils.py:423] Translating workdir to SkyPilot Storage...
I 03-16 08:56:58 controller_utils.py:448] Workdir '.' will be synced to cloud storage 'skypilot-workdir-zongheng-62d03895'.
I 03-16 08:56:58 controller_utils.py:521] Uploading sources to cloud storage. See: sky storage ls
I 03-16 08:57:00 storage.py:1375] Created S3 bucket skypilot-workdir-zongheng-62d03895 in eu-central-1
However, the service creation subsequently failed:
...
I 03-16 08:59:50 cloud_vm_ray_backend.py:3089] Setup completed.
I 03-16 08:59:59 cloud_vm_ray_backend.py:3172] Job submitted with Job ID: 1
E 03-16 09:01:01 subprocess_utils.py:84] ValueError: Failed to register service 'sky-service-eec1' on the SkyServe controller. Reason:
E 03-16 09:01:01 subprocess_utils.py:84]
E 03-16 09:01:01 subprocess_utils.py:84] AWS: Fetching availability zones mapping...INFO:googleapiclient.discovery_cache:file_cache is only supported with oauth2client<4.0.0
E 03-16 09:01:01 subprocess_utils.py:84] Traceback (most recent call last):
E 03-16 09:01:01 subprocess_utils.py:84] File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
E 03-16 09:01:01 subprocess_utils.py:84] return _run_code(code, main_globals, None,
E 03-16 09:01:01 subprocess_utils.py:84] File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
E 03-16 09:01:01 subprocess_utils.py:84] exec(code, run_globals)
E 03-16 09:01:01 subprocess_utils.py:84] File "/opt/conda/lib/python3.10/site-packages/sky/serve/service.py", line 258, in <module>
E 03-16 09:01:01 subprocess_utils.py:84] _start(args.service_name, args.task_yaml, args.job_id)
E 03-16 09:01:01 subprocess_utils.py:84] File "/opt/conda/lib/python3.10/site-packages/sky/serve/service.py", line 138, in _start
E 03-16 09:01:01 subprocess_utils.py:84] task = task_lib.Task.from_yaml(tmp_task_yaml)
E 03-16 09:01:01 subprocess_utils.py:84] File "/opt/conda/lib/python3.10/site-packages/sky/task.py", line 489, in from_yaml
E 03-16 09:01:01 subprocess_utils.py:84] return Task.from_yaml_config(config)
E 03-16 09:01:01 subprocess_utils.py:84] File "/opt/conda/lib/python3.10/site-packages/sky/task.py", line 422, in from_yaml_config
E 03-16 09:01:01 subprocess_utils.py:84] storage_obj = storage_lib.Storage.from_yaml_config(storage[1])
E 03-16 09:01:01 subprocess_utils.py:84] File "/opt/conda/lib/python3.10/site-packages/sky/data/storage.py", line 979, in from_yaml_config
E 03-16 09:01:01 subprocess_utils.py:84] storage_obj = cls(name=name,
E 03-16 09:01:01 subprocess_utils.py:84] File "/opt/conda/lib/python3.10/site-packages/sky/data/storage.py", line 522, in __init__
E 03-16 09:01:01 subprocess_utils.py:84] self.add_store(StoreType.S3)
E 03-16 09:01:01 subprocess_utils.py:84] File "/opt/conda/lib/python3.10/site-packages/sky/data/storage.py", line 830, in add_store
E 03-16 09:01:01 subprocess_utils.py:84] store = store_cls(
E 03-16 09:01:01 subprocess_utils.py:84] File "/opt/conda/lib/python3.10/site-packages/sky/data/storage.py", line 1031, in __init__
E 03-16 09:01:01 subprocess_utils.py:84] super().__init__(name, source, region, is_sky_managed,
E 03-16 09:01:01 subprocess_utils.py:84] File "/opt/conda/lib/python3.10/site-packages/sky/data/storage.py", line 229, in __init__
E 03-16 09:01:01 subprocess_utils.py:84] self.initialize()
E 03-16 09:01:01 subprocess_utils.py:84] File "/opt/conda/lib/python3.10/site-packages/sky/data/storage.py", line 1144, in initialize
E 03-16 09:01:01 subprocess_utils.py:84] self.client = data_utils.create_s3_client(self.region)
E 03-16 09:01:01 subprocess_utils.py:84] File "/opt/conda/lib/python3.10/site-packages/sky/data/data_utils.py", line 99, in create_s3_client
E 03-16 09:01:01 subprocess_utils.py:84] return aws.client('s3', region_name=region)
E 03-16 09:01:01 subprocess_utils.py:84] File "/opt/conda/lib/python3.10/site-packages/sky/adaptors/aws.py", line 90, in wrapper
E 03-16 09:01:01 subprocess_utils.py:84] return func(*args, **kwargs)
E 03-16 09:01:01 subprocess_utils.py:84] File "/opt/conda/lib/python3.10/site-packages/sky/adaptors/aws.py", line 183, in client
E 03-16 09:01:01 subprocess_utils.py:84] _assert_kwargs_builtin_type(kwargs)
E 03-16 09:01:01 subprocess_utils.py:84] File "/opt/conda/lib/python3.10/site-packages/sky/adaptors/aws.py", line 96, in _assert_kwargs_builtin_type
E 03-16 09:01:01 subprocess_utils.py:84] assert all(isinstance(v, (int, float, str)) for v in kwargs.values()), (
E 03-16 09:01:01 subprocess_utils.py:84] AssertionError: kwargs should not contain none built-in types: {'region_name': None}
E 03-16 09:01:01 subprocess_utils.py:84]
E 03-16 09:01:01 subprocess_utils.py:84]
RuntimeError: Failed to spin up the service. Please check the logs above for more details.
…o storage_add_region
Ready for another look! I've completed the TODOs from the PR description and added/updated smoke tests. Also manually verified that a minimal example of the service yaml from #3114 now works:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @romilbhardwaj!
Thanks for the reviews @concretevitamin! Merging now. |
Closes #3114. Uses the optimizer's
best_resources
or the the tasks' resource spec to infer the region for the bucket, if available.Note that currently the bucket is eagerly provisioned in the first region that will be tried by the provisioner. If failover is trigerred, the bucket and compute may lie in a different region. A better fix could have been to make store init lazy and create the bucket in the region where the VM actually gets provisioned, but that requires deeper changes + considerations on wasted VM time while data is uploaded to the object store.
TODO:
Tested:
task.resources
on GCP and AWS. Also tested that the same buckets can be successfully mounted by goofys/gcsfuse on tasks in other regions.pytest tests/test_smoke.py::TestStorageWithCredentials
for gcs and s3pytest tests/test_smoke.py::test_spot_storage --gcp
pytest tests/test_smoke.py::test_spot_storage --aws
pytest tests/test_smoke.py --aws
(Unrelated failures inspot_recovery_aws
andtest-skyserve-new-autoscaler-update
)pytest tests/test_smoke.py --gcp
(Unrelated failures insky-bench
,test_spot_tpu
)