Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Jobs] Allowing to specify intermediate bucket for file upload #4257

Open
wants to merge 56 commits into
base: master
Choose a base branch
from

Conversation

zpoint
Copy link
Collaborator

@zpoint zpoint commented Nov 4, 2024

Feature for #3978

  • Add a config jobs: bucket in ~/.sky/config.yaml.
    • Support s3, gcs, azure, r2, ibm
    • If this bucket is specified, the bucket name and storage type will be used. Otherwise, one will be generated named: skypilot-filemounts-{username}-{run_id}.
    • Add an attribute for storage YAML config: _bucket_sub_path(Support both COPY and MOUNT based on the _bucket_sub_path, The user will only see the subpath instead of the whole bucket if a subpath is specified)
      • _ means internal use only, remove _ to expose in the future
    • smoke test
      • Add a smoke test case: pytest -s tests/test_smoke.py::test_managed_jobs_intermediate_storage
        • Verify that the new config file is loaded and intermediate bucket names in the config are used.
      • Add a test case: pytest -s tests/test_smoke.py::TestStorageWithCredentials::test_bucket_sub_path --gcp
        • Verify that the bucket subdirectory file sync and deletion have the expected behavior. Mainly, ensure the _bucket_sub_path works as expected.

For managed jobs:

  • All these file sync operations will now use the intermediate buckets. Previously, each file sync would create a new temporary bucket:
    • Previous:
      • workdir specified in the task.yaml workdir -> {tmp-bucket-A}/workdir
      • Local file mount specified in the task.yaml file_mount -> {tmp-bucket-B}{i}
      • Temporary files tmp_files -> {tmp-bucket-C}/tmp-files
    • Now:
      • workdir specified in the task.yaml workdir -> {intermediate-bucket}/job-{run_id}/workdir
      • Local file mount specified in the task.yaml file_mount -> {intermediate-bucket}/job-{run_id}/local-file-mounts/{i}
      • Temporary files tmp_files -> {intermediate-bucket}/job-{run_id}/tmp-files
  • Clean up
    • After the managed job finishes, when cleanup is triggered
      • If user doesn't specific bucket config, sky generated {intermediate-bucket}/job-{run_id} bucket will be deleted automatically as before.
      • If user specific bucket name in config, the directory created above will be deleted, the bucket still exists after that, and the user needs to manually delete the bucket as usual.(Due to potential race problem)
    • Currently aws, gcs, r2 use the command line support to remove the directory, while cos and https(azure) use the SDK to list all files under directory and then delete

Even using the same bucket, each job creates its own subdirectory based on the '{run-id}', and subdirectories of auto-created files will be cleaned up after the job is finished.

This allows users who don't have permission to create a bucket to manually specify a bucket name under ~/.sky/config.yaml.

Test plan

smoke test

pytest -s tests/test_smoke.py::test_managed_jobs_storage
pytest -s tests/test_smoke.py::TestStorageWithCredentials::test_bucket_sub_path --aws

custom test

(sky) ➜  cat ~/.sky/config.yaml
jobs:
  bucket: s3://zpoint-bucket-s3/
(sky) ➜ cat ~/Desktop/hello-sky/work_dir_1.yaml
name: test_workdir_bucket_name_1

workdir: .

resources:
  cloud: aws
  instance_type: t3.small

file_mounts:
  # this will use the user config
  /checkpoint:
    name: zpoint-filemounts-bucket
    source: ~/Desktop/dir1
    mode: MOUNT
    store: azure

  # these will all use same bucket configured in ~/.sky/config.yaml jobs->bucket now for bucket storage
  /dir1: ~/Desktop/dir1
  /dir2: ~/Desktop/dir2
  /dir3/dir3.py: ~/Desktop/dir1/dir1.py

run: |
 sleep 10
  ls /checkpoint
  ls .

Launch

(sky) ➜ sky jobs launch ~/Desktop/hello-sky/work_dir_1.yaml
⠋ Syncing ~/Desktop/dir1 -> https://xxx/zpoint-filemounts-bucket/
⠹ Syncing . -> s3://zpoint-bucket-s3/
⠧ Syncing ~/Desktop/dir1 -> s3://zpoint-bucket-s3/
⠦ Syncing ~/Desktop/dir2 -> s3://zpoint-bucket-s3/
⠴ Syncing /var/folders/83/zxqx914s57x310rfnhq8kk9r0000gn/T/skypilot-filemounts-files-aca97801 -> s3://zpoint-bucket-s3/

Looks good

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@romilbhardwaj romilbhardwaj self-requested a review November 6, 2024 18:42
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zpoint! Left some comments about the design, correctness and lifecycle management for the bucket contents.

sky/utils/controller_utils.py Outdated Show resolved Hide resolved
sky/utils/controller_utils.py Outdated Show resolved Hide resolved
sky/data/storage.py Outdated Show resolved Hide resolved
@zpoint zpoint requested a review from romilbhardwaj November 8, 2024 09:13
sky/data/storage.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
sky/utils/controller_utils.py Outdated Show resolved Hide resolved
sky/utils/controller_utils.py Outdated Show resolved Hide resolved
@zpoint zpoint requested a review from romilbhardwaj November 9, 2024 15:39
sky/data/storage.py Outdated Show resolved Hide resolved
sky/data/storage.py Outdated Show resolved Hide resolved
sky/data/storage.py Outdated Show resolved Hide resolved
sky/utils/controller_utils.py Outdated Show resolved Hide resolved
Comment on lines 715 to 719
jobs_configs['properties']['bucket'] = {
'type': 'string',
'pattern': '^(https|s3|gs|r2|cos)://.+',
'required': []
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should include it in controller_resources_schema and enable this feature for SkyServe as well?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adde to controller_resources_schema

Could u point me out the code to change if I need to enable for SkyServe?

sky/data/storage.py Outdated Show resolved Hide resolved
sky/utils/controller_utils.py Outdated Show resolved Hide resolved
@romilbhardwaj
Copy link
Collaborator

Thanks @zpoint! I'm trying to run this yaml:

resources:
  cloud: aws

file_mounts:
  ~/aws: ~/aws

workdir: ~/tmp-workdir

num_nodes: 1

run: |
  echo "Hello, world!"
  ls ~/aws
  ls .

The task output should show me contents of my workdir and file_mount. Instead, I get:

├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(sky-3ba1-romilb, pid=2100) Hello, world!
(sky-3ba1-romilb, pid=2100) job-89ab894f
(sky-3ba1-romilb, pid=2100) job-89ab894f
✓ Managed job finished: 1 (status: SUCCEEDED).

Related to this: #4257 (comment)?

@zpoint zpoint requested a review from romilbhardwaj November 12, 2024 10:01
docs/source/reference/config.rst Outdated Show resolved Hide resolved
docs/source/reference/config.rst Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zpoint - I tested the functionality with an external s3 bucket, works well. Left some comments.

sky/data/storage.py Outdated Show resolved Hide resolved
sky/utils/controller_utils.py Outdated Show resolved Hide resolved
sky/data/storage.py Outdated Show resolved Hide resolved
@zpoint zpoint requested a review from romilbhardwaj December 2, 2024 15:27
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zpoint. Left some comments on the new interfaces added in the latest commits.

docs/source/reference/config.rst Outdated Show resolved Hide resolved
sky/utils/controller_utils.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
Comment on lines 409 to 410
def delete_sub_path(self) -> None:
"""Removes objects from the sub path in the bucket."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel delete_sub_path should be an internal method that is called by delete when a _bucket_sub_path is set.

In summary:

  • If a Storage object is initialized with _bucket_sub_path and the bucket already existed, is_sky_managed should be false and Storage.delete() should just delete the sub_path.
  • If a Storage object is initialized with _bucket_sub_path and the bucket did not already exist, is_sky_managed should be true and Storage.delete() should just delete the entire bucket.
  • If a Storage object is not initialized with any _bucket_sub_path, we follow current behavior.

Then delete_sub_path() will not be needed as a public method of this class.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sounds good

Copy link
Collaborator Author

@zpoint zpoint Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The controller loses information about is_sky_managed. It is determined by is_new_bucket during creation, so is_sky_managed is True locally but False on the controller. I have to include it in config.yaml by adding this parameter, And upload this info to controller via yaml config.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

sky/data/storage.py Outdated Show resolved Hide resolved
sky/utils/controller_utils.py Outdated Show resolved Hide resolved
@zpoint zpoint requested a review from romilbhardwaj December 14, 2024 11:05
sky/data/storage.py Outdated Show resolved Hide resolved
is_sky_managed: Optional[bool] = None,
sync_on_reconstruction: Optional[bool] = True,
_bucket_sub_path: Optional[str] = None, # pylint: disable=invalid-name
_allow_bucket_creation: bool = True): # pylint: disable=invalid-name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this _allow_bucket_creation arg needed?

Can we instead check store.is_sky_managed field after the store's initialization to check if bucket was created, and if it was created then raise an error?

Eg. near L727 in controller_utils.py, we can initialize the store object and check is_sky_managed, and raise error appropriately.

store = store_cls(name=bucket_name,
                        source=workdir,
                        _bucket_sub_path=bucket_sub_path,
                        **store_kwargs)
if store.is_sky_managed:
  # Bucket was created, this should not happen since the bucket already exists
  # cleanup and raise error
  store.delete()
  raise ...

Let me know if I'm missing something. It would be great if we could avoid adding new arguments to the Store/Storage interface.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do that in controller_utils.py. This avoids adding a parameter but keeps the logic outside of the store. If we're not going to reuse the limitation of creating a bucket, it will be fine to handle it in controller_utils.py

sky/task.py Show resolved Hide resolved
tests/test_yamls/intermediate_bucket.yaml Show resolved Hide resolved
@zpoint zpoint requested a review from romilbhardwaj December 20, 2024 04:34
@zpoint
Copy link
Collaborator Author

zpoint commented Dec 22, 2024

@romilbhardwaj Review reminder 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants