Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Storage] Use task's region for initializing new stores #3319

Merged
merged 12 commits into from
Mar 25, 2024

Conversation

romilbhardwaj
Copy link
Collaborator

@romilbhardwaj romilbhardwaj commented Mar 16, 2024

Closes #3114. Uses the optimizer's best_resources or the the tasks' resource spec to infer the region for the bucket, if available.

Note that currently the bucket is eagerly provisioned in the first region that will be tried by the provisioner. If failover is trigerred, the bucket and compute may lie in a different region. A better fix could have been to make store init lazy and create the bucket in the region where the VM actually gets provisioned, but that requires deeper changes + considerations on wasted VM time while data is uploaded to the object store.

TODO:

  • Testing across StoreTypes and regions

Tested:

  • YAML with and without region specified in task.resources on GCP and AWS. Also tested that the same buckets can be successfully mounted by goofys/gcsfuse on tasks in other regions.
  • pytest tests/test_smoke.py::TestStorageWithCredentials for gcs and s3
  • pytest tests/test_smoke.py::test_spot_storage --gcp
  • pytest tests/test_smoke.py::test_spot_storage --aws
  • pytest tests/test_smoke.py --aws (Unrelated failures in spot_recovery_aws and test-skyserve-new-autoscaler-update)
  • pytest tests/test_smoke.py --gcp (Unrelated failures in sky-bench, test_spot_tpu )

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @romilbhardwaj. Tested the service yaml in #3114 as well and it did create a bucket for workdir for the correct best_resources:

...
I 03-16 08:56:55 optimizer.py:907] ----------------------------------------------------------------------------------------------------
I 03-16 08:56:55 optimizer.py:907]  CLOUD   INSTANCE              vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE     COST ($)   CHOSEN
I 03-16 08:56:55 optimizer.py:907] ----------------------------------------------------------------------------------------------------
I 03-16 08:56:55 optimizer.py:907]  AWS     g5.2xlarge[Spot]      8       32        A10G:1         eu-central-1a   0.53          ✔
I 03-16 08:56:55 optimizer.py:907]  AWS     g4dn.12xlarge[Spot]   48      192       T4:4           eu-west-3c      0.66
I 03-16 08:56:55 optimizer.py:907] ----------------------------------------------------------------------------------------------------
I 03-16 08:56:55 optimizer.py:907]
...

I 03-16 08:56:58 controller_utils.py:423] Translating workdir to SkyPilot Storage...
I 03-16 08:56:58 controller_utils.py:448] Workdir '.' will be synced to cloud storage 'skypilot-workdir-zongheng-62d03895'.
I 03-16 08:56:58 controller_utils.py:521] Uploading sources to cloud storage. See: sky storage ls
I 03-16 08:57:00 storage.py:1375] Created S3 bucket skypilot-workdir-zongheng-62d03895 in eu-central-1

However, the service creation subsequently failed:

...
I 03-16 08:59:50 cloud_vm_ray_backend.py:3089] Setup completed.
I 03-16 08:59:59 cloud_vm_ray_backend.py:3172] Job submitted with Job ID: 1

E 03-16 09:01:01 subprocess_utils.py:84] ValueError: Failed to register service 'sky-service-eec1' on the SkyServe controller. Reason:
E 03-16 09:01:01 subprocess_utils.py:84]
E 03-16 09:01:01 subprocess_utils.py:84] AWS: Fetching availability zones mapping...INFO:googleapiclient.discovery_cache:file_cache is only supported with oauth2client<4.0.0
E 03-16 09:01:01 subprocess_utils.py:84] Traceback (most recent call last):
E 03-16 09:01:01 subprocess_utils.py:84]   File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
E 03-16 09:01:01 subprocess_utils.py:84]     return _run_code(code, main_globals, None,
E 03-16 09:01:01 subprocess_utils.py:84]   File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
E 03-16 09:01:01 subprocess_utils.py:84]     exec(code, run_globals)
E 03-16 09:01:01 subprocess_utils.py:84]   File "/opt/conda/lib/python3.10/site-packages/sky/serve/service.py", line 258, in <module>
E 03-16 09:01:01 subprocess_utils.py:84]     _start(args.service_name, args.task_yaml, args.job_id)
E 03-16 09:01:01 subprocess_utils.py:84]   File "/opt/conda/lib/python3.10/site-packages/sky/serve/service.py", line 138, in _start
E 03-16 09:01:01 subprocess_utils.py:84]     task = task_lib.Task.from_yaml(tmp_task_yaml)
E 03-16 09:01:01 subprocess_utils.py:84]   File "/opt/conda/lib/python3.10/site-packages/sky/task.py", line 489, in from_yaml
E 03-16 09:01:01 subprocess_utils.py:84]     return Task.from_yaml_config(config)
E 03-16 09:01:01 subprocess_utils.py:84]   File "/opt/conda/lib/python3.10/site-packages/sky/task.py", line 422, in from_yaml_config
E 03-16 09:01:01 subprocess_utils.py:84]     storage_obj = storage_lib.Storage.from_yaml_config(storage[1])
E 03-16 09:01:01 subprocess_utils.py:84]   File "/opt/conda/lib/python3.10/site-packages/sky/data/storage.py", line 979, in from_yaml_config
E 03-16 09:01:01 subprocess_utils.py:84]     storage_obj = cls(name=name,
E 03-16 09:01:01 subprocess_utils.py:84]   File "/opt/conda/lib/python3.10/site-packages/sky/data/storage.py", line 522, in __init__
E 03-16 09:01:01 subprocess_utils.py:84]     self.add_store(StoreType.S3)
E 03-16 09:01:01 subprocess_utils.py:84]   File "/opt/conda/lib/python3.10/site-packages/sky/data/storage.py", line 830, in add_store
E 03-16 09:01:01 subprocess_utils.py:84]     store = store_cls(
E 03-16 09:01:01 subprocess_utils.py:84]   File "/opt/conda/lib/python3.10/site-packages/sky/data/storage.py", line 1031, in __init__
E 03-16 09:01:01 subprocess_utils.py:84]     super().__init__(name, source, region, is_sky_managed,
E 03-16 09:01:01 subprocess_utils.py:84]   File "/opt/conda/lib/python3.10/site-packages/sky/data/storage.py", line 229, in __init__
E 03-16 09:01:01 subprocess_utils.py:84]     self.initialize()
E 03-16 09:01:01 subprocess_utils.py:84]   File "/opt/conda/lib/python3.10/site-packages/sky/data/storage.py", line 1144, in initialize
E 03-16 09:01:01 subprocess_utils.py:84]     self.client = data_utils.create_s3_client(self.region)
E 03-16 09:01:01 subprocess_utils.py:84]   File "/opt/conda/lib/python3.10/site-packages/sky/data/data_utils.py", line 99, in create_s3_client
E 03-16 09:01:01 subprocess_utils.py:84]     return aws.client('s3', region_name=region)
E 03-16 09:01:01 subprocess_utils.py:84]   File "/opt/conda/lib/python3.10/site-packages/sky/adaptors/aws.py", line 90, in wrapper
E 03-16 09:01:01 subprocess_utils.py:84]     return func(*args, **kwargs)
E 03-16 09:01:01 subprocess_utils.py:84]   File "/opt/conda/lib/python3.10/site-packages/sky/adaptors/aws.py", line 183, in client
E 03-16 09:01:01 subprocess_utils.py:84]     _assert_kwargs_builtin_type(kwargs)
E 03-16 09:01:01 subprocess_utils.py:84]   File "/opt/conda/lib/python3.10/site-packages/sky/adaptors/aws.py", line 96, in _assert_kwargs_builtin_type
E 03-16 09:01:01 subprocess_utils.py:84]     assert all(isinstance(v, (int, float, str)) for v in kwargs.values()), (
E 03-16 09:01:01 subprocess_utils.py:84] AssertionError: kwargs should not contain none built-in types: {'region_name': None}
E 03-16 09:01:01 subprocess_utils.py:84]
E 03-16 09:01:01 subprocess_utils.py:84]
RuntimeError: Failed to spin up the service. Please check the logs above for more details.

sky/task.py Outdated Show resolved Hide resolved
@romilbhardwaj
Copy link
Collaborator Author

Ready for another look! I've completed the TODOs from the PR description and added/updated smoke tests. Also manually verified that a minimal example of the service yaml from #3114 now works:

service:
  readiness_probe: /
  replicas: 1

resources:
  use_spot: true
  cpus: 4+
  ordered:
    - region: eu-central-1
    - region: eu-west-1
    - region: eu-west-2
    - region: eu-west-3
    - region: eu-north-1
  ports:
    - 8000

num_nodes: 1

workdir: .

setup: |
  echo hi

run: |
  pwd
  cat hello.txt
  python -m http.server 8000

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @romilbhardwaj!

sky/data/data_utils.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
@romilbhardwaj
Copy link
Collaborator Author

Thanks for the reviews @concretevitamin! Merging now.

@romilbhardwaj romilbhardwaj merged commit eb442b0 into master Mar 25, 2024
20 checks passed
@romilbhardwaj romilbhardwaj deleted the storage_add_region branch March 25, 2024 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug/Feature Request] sky serve doesn't respect regions bucket
2 participants