-
Notifications
You must be signed in to change notification settings - Fork 531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Jobs] Allowing to specify intermediate bucket for file upload #4257
base: master
Are you sure you want to change the base?
[Jobs] Allowing to specify intermediate bucket for file upload #4257
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zpoint! Left some comments about the design, correctness and lifecycle management for the bucket contents.
sky/utils/schemas.py
Outdated
jobs_configs['properties']['bucket'] = { | ||
'type': 'string', | ||
'pattern': '^(https|s3|gs|r2|cos)://.+', | ||
'required': [] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should include it in controller_resources_schema
and enable this feature for SkyServe as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add this to our docs https://github.com/skypilot-org/skypilot/blob/master/docs/source/reference/config.rst?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adde to controller_resources_schema
Could u point me out the code to change if I need to enable for SkyServe
?
Thanks @zpoint! I'm trying to run this yaml:
The task output should show me contents of my workdir and file_mount. Instead, I get:
Related to this: #4257 (comment)? |
Co-authored-by: Romil Bhardwaj <[email protected]>
Co-authored-by: Romil Bhardwaj <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zpoint - I tested the functionality with an external s3 bucket, works well. Left some comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zpoint. Left some comments on the new interfaces added in the latest commits.
sky/data/storage.py
Outdated
def delete_sub_path(self) -> None: | ||
"""Removes objects from the sub path in the bucket.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel delete_sub_path
should be an internal method that is called by delete when a _bucket_sub_path
is set.
In summary:
- If a Storage object is initialized with
_bucket_sub_path
and the bucket already existed,is_sky_managed
should be false and Storage.delete() should just delete the sub_path. - If a Storage object is initialized with
_bucket_sub_path
and the bucket did not already exist,is_sky_managed
should be true and Storage.delete() should just delete the entire bucket. - If a Storage object is not initialized with any
_bucket_sub_path
, we follow current behavior.
Then delete_sub_path() will not be needed as a public method of this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sounds good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The controller loses information about is_sky_managed
. It is determined by is_new_bucket
during creation, so is_sky_managed
is True
locally but False
on the controller. I have to include it in config.yaml
by adding this parameter, And upload this info to controller via yaml config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Co-authored-by: Romil Bhardwaj <[email protected]>
Co-authored-by: Romil Bhardwaj <[email protected]>
sky/data/storage.py
Outdated
is_sky_managed: Optional[bool] = None, | ||
sync_on_reconstruction: Optional[bool] = True, | ||
_bucket_sub_path: Optional[str] = None, # pylint: disable=invalid-name | ||
_allow_bucket_creation: bool = True): # pylint: disable=invalid-name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this _allow_bucket_creation
arg needed?
Can we instead check store.is_sky_managed field after the store's initialization to check if bucket was created, and if it was created then raise an error?
Eg. near L727 in controller_utils.py, we can initialize the store object and check is_sky_managed, and raise error appropriately.
store = store_cls(name=bucket_name,
source=workdir,
_bucket_sub_path=bucket_sub_path,
**store_kwargs)
if store.is_sky_managed:
# Bucket was created, this should not happen since the bucket already exists
# cleanup and raise error
store.delete()
raise ...
Let me know if I'm missing something. It would be great if we could avoid adding new arguments to the Store/Storage interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do that in controller_utils.py
. This avoids adding a parameter but keeps the logic outside of the store. If we're not going to reuse the limitation of creating a bucket, it will be fine to handle it in controller_utils.py
@romilbhardwaj Review reminder 😊 |
Feature for #3978
jobs: bucket
in~/.sky/config.yaml
.s3, gcs, azure, r2, ibm
skypilot-filemounts-{username}-{run_id}
._bucket_sub_path
(Support bothCOPY
andMOUNT
based on the_bucket_sub_path
, The user will only see the subpath instead of the whole bucket if a subpath is specified)_
means internal use only, remove_
to expose in the futurepytest -s tests/test_smoke.py::test_managed_jobs_intermediate_storage
pytest -s tests/test_smoke.py::TestStorageWithCredentials::test_bucket_sub_path --gcp
_bucket_sub_path
works as expected.For managed jobs:
workdir -> {tmp-bucket-A}/workdir
file_mount -> {tmp-bucket-B}{i}
tmp_files -> {tmp-bucket-C}/tmp-files
workdir -> {intermediate-bucket}/job-{run_id}/workdir
file_mount -> {intermediate-bucket}/job-{run_id}/local-file-mounts/{i}
tmp_files -> {intermediate-bucket}/job-{run_id}/tmp-files
{intermediate-bucket}/job-{run_id}
bucket will be deleted automatically as before.Even using the same bucket, each job creates its own subdirectory based on the '{run-id}', and subdirectories of auto-created files will be cleaned up after the job is finished.
This allows users who don't have permission to create a bucket to manually specify a bucket name under
~/.sky/config.yaml
.Test plan
smoke test
custom test
(sky) ➜ cat ~/.sky/config.yaml jobs: bucket: s3://zpoint-bucket-s3/
Launch
Looks good
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh