-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Storage] Azure blob storage support #3032
[Storage] Azure blob storage support #3032
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @landscapepainter! Just tested it and seems we don't handle the non-existance bucket well (see comments below). Otherwise, it looks good to me.
sky/cloud_stores.py
Outdated
try: | ||
for blob in container_client.list_blobs(name_starts_with=path): | ||
if blob.name == path: | ||
return False | ||
num_objects += 1 | ||
if num_objects > 1: | ||
return True | ||
except azure.exceptions().HttpResponseError as e: | ||
# Handle case where user lacks sufficient IAM role for | ||
# a private container in the same subscription. Attempt to | ||
# assign appropriate role to current user. | ||
if 'AuthorizationPermissionMismatch' in str(e): | ||
if not role_assigned: | ||
logger.info('Failed to list blobs in container ' | ||
f'{container_url!r}. This implies ' | ||
'insufficient IAM role for storage account' | ||
f' {storage_account_name!r}.') | ||
azure.assign_storage_account_iam_role( | ||
storage_account_name=storage_account_name, | ||
resource_group_name=resource_group_name) | ||
role_assigned = True | ||
refresh_client = True | ||
else: | ||
logger.info( | ||
'Waiting due to the propagation delay of IAM ' | ||
'role assignment to the storage account ' | ||
f'{storage_account_name!r}.') | ||
time.sleep( | ||
constants.RETRY_INTERVAL_AFTER_ROLE_ASSIGNMENT) | ||
continue | ||
raise | ||
# A directory with few or no items | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: keep the try block small
try: | |
for blob in container_client.list_blobs(name_starts_with=path): | |
if blob.name == path: | |
return False | |
num_objects += 1 | |
if num_objects > 1: | |
return True | |
except azure.exceptions().HttpResponseError as e: | |
# Handle case where user lacks sufficient IAM role for | |
# a private container in the same subscription. Attempt to | |
# assign appropriate role to current user. | |
if 'AuthorizationPermissionMismatch' in str(e): | |
if not role_assigned: | |
logger.info('Failed to list blobs in container ' | |
f'{container_url!r}. This implies ' | |
'insufficient IAM role for storage account' | |
f' {storage_account_name!r}.') | |
azure.assign_storage_account_iam_role( | |
storage_account_name=storage_account_name, | |
resource_group_name=resource_group_name) | |
role_assigned = True | |
refresh_client = True | |
else: | |
logger.info( | |
'Waiting due to the propagation delay of IAM ' | |
'role assignment to the storage account ' | |
f'{storage_account_name!r}.') | |
time.sleep( | |
constants.RETRY_INTERVAL_AFTER_ROLE_ASSIGNMENT) | |
continue | |
raise | |
# A directory with few or no items | |
return True | |
try: | |
blobs = container_client.list_blobs(name_starts_with=path) | |
except azure.exceptions().HttpResponseError as e: | |
# Handle case where user lacks sufficient IAM role for | |
# a private container in the same subscription. Attempt to | |
# assign appropriate role to current user. | |
if 'AuthorizationPermissionMismatch' in str(e): | |
if not role_assigned: | |
logger.info('Failed to list blobs in container ' | |
f'{container_url!r}. This implies ' | |
'insufficient IAM role for storage account' | |
f' {storage_account_name!r}.') | |
azure.assign_storage_account_iam_role( | |
storage_account_name=storage_account_name, | |
resource_group_name=resource_group_name) | |
role_assigned = True | |
refresh_client = True | |
else: | |
logger.info( | |
'Waiting due to the propagation delay of IAM ' | |
'role assignment to the storage account ' | |
f'{storage_account_name!r}.') | |
time.sleep( | |
constants.RETRY_INTERVAL_AFTER_ROLE_ASSIGNMENT) | |
continue | |
raise | |
for blob in blobs: | |
if blob.name == path: | |
return False | |
num_objects += 1 | |
if num_objects > 1: | |
return True | |
# A directory with few or no items | |
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I tried at first, but the error is raised when we iterate through the object within the for-loop for blob in blobs:
rather than when the object is obtained with blobs = container_client.list_blobs(name_starts_with=path)
. So the suggested code won't be able to catch the error.
sky/backends/cloud_vm_ray_backend.py
Outdated
try: | ||
if storage.is_directory(src): | ||
sync_cmd = (storage.make_sync_dir_command( | ||
source=src, destination=wrapped_dst)) | ||
# It is a directory so make sure it exists. | ||
mkdir_for_wrapped_dst = f'mkdir -p {wrapped_dst}' | ||
else: | ||
sync_cmd = (storage.make_sync_file_command( | ||
source=src, destination=wrapped_dst)) | ||
# It is a file so make sure *its parent dir* exists. | ||
mkdir_for_wrapped_dst = ( | ||
f'mkdir -p {os.path.dirname(wrapped_dst)}') | ||
except Exception as e: # pylint: disable=broad-except | ||
logger.error( | ||
f'Failed to fetch from the bucket {src!r} to ' | ||
f'remote instance at {dst!r}.\n' | ||
'Error details: ' | ||
f'{common_utils.format_exception(e, use_bracket=True)}.') | ||
# If 'cmd' was appended to 'symlink_commands' for this sync, we | ||
# remove as it failed to sync. | ||
if not dst.startswith('~/') and not dst.startswith('/tmp/'): | ||
symlink_commands.pop() | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this, it seems we skip the failed buckets that do not exist, which misaligns with our previous behavior for other cloud buckets. We should probably not skip all errors here.
file_mounts:
/dst: s3://some-non-exist-bucket-zhwu
Also, it seems for an Azure blob that does not exist, unlike the other clouds, it raises error for the the sync command execution below. Can we get it fail here when we call list_blobs
file_mounts:
/test: https://skyeastusa37461fd.blob.core.windows.net/some-non-exist-bucket-zhwu
I 07-15 22:12:46 backend_utils.py:1336] Syncing (to 1 node): https://skyeastusa37461fd.blob.core.windows.net/some-non-exist-bucket-zhwu -> /test
E 07-15 22:12:47 subprocess_utils.py:84] INFO: Any empty folders will not be processed, because source and/or destination doesn't have full folder support
E 07-15 22:12:47 subprocess_utils.py:84]
E 07-15 22:12:47 subprocess_utils.py:84] Job 0c4734a3-9317-ad47-5ecc-0978846fbf69 has started
E 07-15 22:12:47 subprocess_utils.py:84] Log file is located at: /home/azureuser/.azcopy/0c4734a3-9317-ad47-5ecc-0978846fbf69.log
E 07-15 22:12:47 subprocess_utils.py:84]
E 07-15 22:12:47 subprocess_utils.py:84]
E 07-15 22:12:47 subprocess_utils.py:84] Cannot perform sync due to error: cannot list files due to reason -> github.com/Azure/azure-storage-blob-go/azblob.newStorageError, /home/vsts/go/pkg/mod/github.com/!azure/[email protected]/azblob/zc_storage_error.go:42
E 07-15 22:12:47 subprocess_utils.py:84] ===== RESPONSE ERROR (ServiceCode=ContainerNotFound) =====
E 07-15 22:12:47 subprocess_utils.py:84] Description=The specified container does not exist.
E 07-15 22:12:47 subprocess_utils.py:84] RequestId:8316c808-001e-005c-0b04-d71755000000
E 07-15 22:12:47 subprocess_utils.py:84] Time:2024-07-15T22:12:47.2300060Z, Details:
E 07-15 22:12:47 subprocess_utils.py:84] Code: ContainerNotFound
E 07-15 22:12:47 subprocess_utils.py:84] GET https://skyeastusa37461fd.blob.core.windows.net/some-non-exist-bucket-zhwu?comp=list&delimiter=%2F&include=metadata&restype=container&se=2024-07-15t23%3A12%3A46z&sig=-REDACTED-&sp=rcwl&sr=c&sv=2024-05-04&timeout=901
E 07-15 22:12:47 subprocess_utils.py:84] User-Agent: [AzCopy/10.17.0 Azure-Storage/0.15 (go1.19.2; linux)]
E 07-15 22:12:47 subprocess_utils.py:84] X-Ms-Client-Request-Id: [79bd1b04-f229-46c8-5053-44fb7d1bc8c5]
E 07-15 22:12:47 subprocess_utils.py:84] X-Ms-Version: [2020-10-02]
E 07-15 22:12:47 subprocess_utils.py:84] --------------------------------------------------------------------------------
E 07-15 22:12:47 subprocess_utils.py:84] RESPONSE Status: 404 The specified container does not exist.
E 07-15 22:12:47 subprocess_utils.py:84] Content-Length: [225]
E 07-15 22:12:47 subprocess_utils.py:84] Content-Type: [application/xml]
E 07-15 22:12:47 subprocess_utils.py:84] Date: [Mon, 15 Jul 2024 22:12:47 GMT]
E 07-15 22:12:47 subprocess_utils.py:84] Server: [Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0]
E 07-15 22:12:47 subprocess_utils.py:84] X-Ms-Client-Request-Id: [79bd1b04-f229-46c8-5053-44fb7d1bc8c5]
E 07-15 22:12:47 subprocess_utils.py:84] X-Ms-Error-Code: [ContainerNotFound]
E 07-15 22:12:47 subprocess_utils.py:84] X-Ms-Request-Id: [8316c808-001e-005c-0b04-d71755000000]
E 07-15 22:12:47 subprocess_utils.py:84] X-Ms-Version: [2020-10-02]
E 07-15 22:12:47 subprocess_utils.py:84]
E 07-15 22:12:47 subprocess_utils.py:84]
E 07-15 22:12:47 subprocess_utils.py:84]
E 07-15 22:12:47 subprocess_utils.py:84]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Zhanghao Wu <[email protected]>
@Michaelvll This is ready for another look! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating this @landscapepainter! LGTM.
* first commit * nit * implement fetching bucket * update batch sync * support file name with empty space sync * support blobfuse2 mount/container name validate * support container deletion * support download from container to remote vm * complete download from container to remote vm * update mounting tool blobfuse2 download command * update mounting command * _CREDENTIALS_FILES list update * add smoke test * update storage comment * update download commands to use account key * add account-key for upload * nit * nit fix * data_utils fix * nit * nit * add comments * nit smoke * implement verify_az_bucket * smoke test update and nit mounting_utils * config schema update * support public container usage * nit fix for private bucket test * update _get_bucket to use from_container_url * add _download_file * nit * fix mounting blobfuse2 issues * nit * format * nit * container client fix * smoke test update private_bucket * azure get_client update to use exists() * nit * udpate fetch command for public containers * nit * update fetching command for public containers * silence client logging when used with public containers * az cli and blobfuse installation update * update for faster container client fetch * Handle private container without access * update private container without access smoke test * change due to merging master branch * updates from merging master * update mounting smoke test * mounting smoke test update * remove logger restriction * update comments * update verify_az_bucket to use for both private and public * update comments and formatting * update delete_az_bucket * az cli installation versioning * update logging silence logic for get_client * support azcopy for fetching * update sas token generation with az-cli * propagation hold * merge fix * add support to assign role to access storage account * nit * silence logging from httpx request to get object_id * checks existance of storage account and resource group before creation * create storage account for different regions * fix source name when translating local file mounts for spot sync * smoke test update for storage account names * removing az-cli installation from cloud_stores.py * nit * update sas token generation to use python sdk * nit * Update sky/data/storage.py Co-authored-by: Tian Xia <[email protected]> * move sas token generating functions from data_utils to adaptors.azure * use constant string format to obtain container url * nit * add comment for '/' and azcopy syntax * refactor AzureBlobCloudStorage methods * nit * format * nit * update test storage mount yaml j2 * added rich status message for storage account and resource group creation * update rich status message when creating storage account and resource group * nit * Error handle for when storage account creation did not yet propagate to system * comment update * merge error output into exception message * error comment * additional error handling when creating storage account * nit * update to use raw container url endpoint instead of 'az://' * update config.yaml interface * remove resource group existance check * add more comments for az mount command * nit * add more exception handling for storage account initialization * Remove lru cache decorator from sas token generating functions * nit * nit * Revert back to check if the resource group exists before running command to create. * refactor function to obtain resource group and storage account * nit * add support for storage account under AzureBlobStoreMetadata * set default file permission to be 755 for mounting * Update sky/adaptors/azure.py Co-authored-by: Tian Xia <[email protected]> * nit * nit fixes * format and update error handling * nit fixes * set default storage account and resource group name as string constant * update error handle. * additional error handle for else branch * Additional error handling * nit * update get_az_storage_account_key to replace try-exception with if statement * nit * nit * nit * format * update public container example as not accessible anymore * nit * file_bucket_name update * add StoreType method to retrieve bucket endpoint url * format * add azure storage blob dependency installation for controller * fix fetching methods * nit * additional docstr for _get_storage_account_and_resource_group * nit * update blobfuse2 cache directory * format * refactor get_storage_account_key method * update docker storage mounts smoke test * sleep for storage account creation to propagate * handle externally removed storage account being fetched * format * nit * add logic to retry for role assignment * add comment to _create_storage_account method * additional error handling for role assignment * format * nit * Update sky/adaptors/azure.py Co-authored-by: Zhanghao Wu <[email protected]> * additional installation check for azure blob storage dependencies * format * update step 7 from maybe_translate_local_file_mounts_and_sync_up method to format source correctly for azure * additional comment on container_client.exists() * explicitly check None for match * Update sky/cloud_stores.py Co-authored-by: Zhanghao Wu <[email protected]> * [style] import module instead of class or funcion * nit * docstring nit updates * nit * error handle failure to run list blobs API from cloud_stores.py::is_directory() * nit * nit * Add role assignment logic to handle edge case * format * remove redundant get_az_resource_group method from data_utils * asyncio loop lifecycle manage * update constant values * add logs when resource group and storage account is newly created * Update sky/skylet/constants.py Co-authored-by: Zhanghao Wu <[email protected]> * add comment and move return True within the try except block * reverse the order of two decorators for get_client method to allow cache_clear method * revert error handling at _execute_file_mounts * nit * raise error when non existent storage account or container name is provided. * format * add comment for keeping decorator order --------- Co-authored-by: Romil Bhardwaj <[email protected]> Co-authored-by: Tian Xia <[email protected]> Co-authored-by: Zhanghao Wu <[email protected]>
This resolves #1271 adding support for Azure Blob Storage.
Note:
~/.sky/config.yaml
.goofys
has better performance thanblobfuse2
, but I choseblobfuse2
for our implementation because:goofys
does not support mounting public containers.goofys
was 2020 and Azure Blob Storage related issues within the repo doesn't seemed to be actively worked on.blobfuse2
is officially supported by Azure and performance issue is being worked on.Note: mounting public container is not supported by blobfuse2 as well, but there's a workaround: Question regards to mounting public container to my local directory. Azure/azure-storage-fuse#1338
MOUNT
mode with Azure Blob Storage by default gives full permissions(777) to files/directories unlike other object storages(s3, gcs). It is also impossible to change these permissions(chmod). In order to support these features, we need a special storage account with Azure Data Lake Gen2 capabilities, which are more costly. You can compare the pricing between HNS and flat namespace(which is used by default storage account we use).azcopy
is used to fetch from container instead ofaz-cli
. Fetching withaz-cli
is outdated and the code is not in sync withazcopy
which has much better performance. Usingaz-cli
asaz storage blob sync
does use the codes fromazcopy
, but it doesn't support container->local. It only supports for local->container. To useaz-cli
for downloading from the container, we need to useaz storage blob download-batch
, but this does not utilize the codes fromazcopy
, and therefore, multi-threading is not supported. Hence, I useazcopy
directly for container->local. Butazcopy
requires to append SAS tokens in the command(STORAGE_ACCOUNT_KEY
is not supported opposed toaz-cli
).az-cli
supports the feature to generate SAS token, but notazcopy
. Due to this reason, untilazcopy
supports the feature to generate SAS token, we need to installazcopy
andaz-cli
both in the run when Azure Blob Storage is used for non-Azure instances. Azure instances haveaz-cli
andazcopy
installed by default when provisioned.blobfuse2
, and obtaining container client required different configurations each for public containers and private containers.get_client()
atsky/adaptors/azure.py
. Reference: How to determine the given Container url is either public or private Azure/azure-sdk-for-python#35770config.yaml
for Azure blob storage:source:
(This is how other object storages are handled as well) at task yamlconfig.yaml
and we can infer the resource group from it.blobfuse2
, the module we use for mounting, depends onfuse3
. And there are certain distros that does not natively supportsfuse3
. Ubuntu 18.04 is one of them.Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py::TestStorageWithCredentials --azure
pytest tests/test_smoke.py::test_azure_storage_mounts_with_stop --azure
pytest tests/test_smoke.py::test_docker_storage_mounts
sky launch
withaws
,gcp
, andazure
using the following task yamlconfig.yaml
is in use.comp_test.yaml
: