Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Storage] Azure blob storage support #3032

Merged
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
183 commits
Select commit Hold shift + click to select a range
d867646
first commit
landscapepainter Jan 25, 2024
4c49ed4
nit
landscapepainter Jan 26, 2024
1f2c584
implement fetching bucket
landscapepainter Jan 26, 2024
1da2c76
update batch sync
landscapepainter Jan 27, 2024
3ab8095
support file name with empty space sync
landscapepainter Jan 27, 2024
d7a0a26
support blobfuse2 mount/container name validate
landscapepainter Jan 28, 2024
0820037
support container deletion
landscapepainter Jan 28, 2024
32ef2ad
support download from container to remote vm
landscapepainter Jan 29, 2024
9126119
complete download from container to remote vm
landscapepainter Jan 30, 2024
52787e8
update mounting tool blobfuse2 download command
landscapepainter Jan 30, 2024
a4ce470
update mounting command
landscapepainter Jan 30, 2024
1dc2be6
_CREDENTIALS_FILES list update
landscapepainter Jan 30, 2024
dd3646d
add smoke test
landscapepainter Jan 30, 2024
0ecd1ad
update storage comment
landscapepainter Jan 30, 2024
02e06b5
update download commands to use account key
landscapepainter Feb 3, 2024
9765849
add account-key for upload
landscapepainter Feb 3, 2024
8200265
nit
landscapepainter Feb 3, 2024
c7ce2d5
nit fix
landscapepainter Feb 3, 2024
a156b56
data_utils fix
landscapepainter Feb 3, 2024
212e003
nit
landscapepainter Feb 3, 2024
1a0ad06
nit
landscapepainter Feb 3, 2024
6e76b7f
add comments
landscapepainter Feb 4, 2024
9975596
nit smoke
landscapepainter Feb 4, 2024
6becf33
implement verify_az_bucket
landscapepainter Feb 6, 2024
6350fdd
smoke test update and nit mounting_utils
landscapepainter Feb 7, 2024
7a1fe65
config schema update
landscapepainter Feb 9, 2024
23977b9
support public container usage
landscapepainter Feb 9, 2024
97fc0eb
nit fix for private bucket test
landscapepainter Feb 9, 2024
c56fa32
update _get_bucket to use from_container_url
landscapepainter Feb 10, 2024
d23dbb0
add _download_file
landscapepainter Feb 10, 2024
c7ed6fb
nit
landscapepainter Feb 10, 2024
a78b0f9
fix mounting blobfuse2 issues
landscapepainter Feb 10, 2024
861b718
nit
landscapepainter Feb 10, 2024
3cba1b6
format
landscapepainter Feb 10, 2024
7ba8b88
nit
landscapepainter Feb 11, 2024
8aaed84
container client fix
landscapepainter Mar 3, 2024
5e85fdc
smoke test update private_bucket
landscapepainter Mar 7, 2024
cb1e6a8
azure get_client update to use exists()
landscapepainter Mar 7, 2024
d036cd8
nit
landscapepainter Mar 7, 2024
95736e4
udpate fetch command for public containers
landscapepainter Mar 12, 2024
69539ed
nit
landscapepainter Mar 12, 2024
8411fff
update fetching command for public containers
landscapepainter Mar 12, 2024
d0d6105
silence client logging when used with public containers
landscapepainter Mar 12, 2024
3bcb2cb
az cli and blobfuse installation update
landscapepainter Apr 2, 2024
b9c21be
update for faster container client fetch
landscapepainter Apr 3, 2024
f1a072e
Handle private container without access
landscapepainter Apr 3, 2024
bc1534d
update private container without access smoke test
landscapepainter Apr 3, 2024
524a47f
Merge branch 'master' into azure-blob-storage
landscapepainter Apr 3, 2024
1c96055
change due to merging master branch
landscapepainter Apr 3, 2024
b355d25
updates from merging master
landscapepainter Apr 3, 2024
9e52672
update mounting smoke test
landscapepainter Apr 3, 2024
46111f7
mounting smoke test update
landscapepainter Apr 3, 2024
e8bc823
remove logger restriction
landscapepainter Apr 3, 2024
47b6909
update comments
landscapepainter Apr 3, 2024
f23d806
update verify_az_bucket to use for both private and public
landscapepainter Apr 3, 2024
1570022
update comments and formatting
landscapepainter Apr 3, 2024
2aa1e1f
update delete_az_bucket
landscapepainter Apr 3, 2024
85e323a
az cli installation versioning
landscapepainter Apr 4, 2024
e84c46f
update logging silence logic for get_client
landscapepainter Apr 4, 2024
bac7319
support azcopy for fetching
landscapepainter Apr 17, 2024
6e612c9
update sas token generation with az-cli
landscapepainter Apr 18, 2024
b67e87c
propagation hold
landscapepainter Apr 20, 2024
c994cfc
Merge branch 'master' of https://github.com/skypilot-org/skypilot int…
romilbhardwaj May 1, 2024
2f727ac
merge fix
romilbhardwaj May 1, 2024
4c5938a
add support to assign role to access storage account
landscapepainter May 20, 2024
0f810e0
Merge branch 'azure-blob-storage' of https://github.com/landscapepain…
landscapepainter May 20, 2024
881b0d2
nit
landscapepainter May 20, 2024
34a684c
silence logging from httpx request to get object_id
landscapepainter May 21, 2024
777cd8a
checks existance of storage account and resource group before creation
landscapepainter May 22, 2024
a607874
create storage account for different regions
landscapepainter May 23, 2024
eaf3058
fix source name when translating local file mounts for spot sync
landscapepainter May 25, 2024
7494600
smoke test update for storage account names
landscapepainter May 26, 2024
696b4c6
removing az-cli installation from cloud_stores.py
landscapepainter May 26, 2024
8ec5c08
nit
landscapepainter May 26, 2024
054efae
update sas token generation to use python sdk
landscapepainter May 26, 2024
f856063
nit
landscapepainter May 26, 2024
1612af4
Update sky/data/storage.py
landscapepainter May 26, 2024
70ae2b3
move sas token generating functions from data_utils to adaptors.azure
landscapepainter May 27, 2024
d63fc7a
use constant string format to obtain container url
landscapepainter May 27, 2024
be1cfaa
nit
landscapepainter May 28, 2024
5a1dd87
add comment for '/' and azcopy syntax
landscapepainter May 31, 2024
eb933df
refactor AzureBlobCloudStorage methods
landscapepainter May 31, 2024
8e94496
nit
landscapepainter May 31, 2024
fa113d0
Merge branch 'master' into azure-blob-storage
landscapepainter Jun 1, 2024
5df34de
format
landscapepainter Jun 1, 2024
de3d8a8
nit
landscapepainter Jun 7, 2024
806320c
update test storage mount yaml j2
landscapepainter Jun 7, 2024
63dab29
added rich status message for storage account and resource group crea…
landscapepainter Jun 7, 2024
b9dfa47
update rich status message when creating storage account and resource…
landscapepainter Jun 8, 2024
59d30fb
nit
landscapepainter Jun 8, 2024
119e7e5
Error handle for when storage account creation did not yet propagate …
landscapepainter Jun 8, 2024
ecbab0e
comment update
landscapepainter Jun 8, 2024
6640c15
merge error output into exception message
landscapepainter Jun 8, 2024
fb61be4
error comment
landscapepainter Jun 8, 2024
144a60d
additional error handling when creating storage account
landscapepainter Jun 8, 2024
56f8953
nit
landscapepainter Jun 8, 2024
408c240
update to use raw container url endpoint instead of 'az://'
landscapepainter Jun 14, 2024
720fe38
update config.yaml interface
landscapepainter Jun 14, 2024
80e1bce
remove resource group existance check
landscapepainter Jun 15, 2024
c923575
add more comments for az mount command
landscapepainter Jun 15, 2024
e86e24d
nit
landscapepainter Jun 15, 2024
a1940a2
add more exception handling for storage account initialization
landscapepainter Jun 15, 2024
4e5442f
Remove lru cache decorator from sas token generating functions
landscapepainter Jun 15, 2024
a877d7c
nit
landscapepainter Jun 15, 2024
508b8e1
nit
landscapepainter Jun 16, 2024
d9b70d4
Revert back to check if the resource group exists before running comm…
landscapepainter Jun 16, 2024
3b771dc
refactor function to obtain resource group and storage account
landscapepainter Jun 16, 2024
113cb12
nit
landscapepainter Jun 16, 2024
401796b
add support for storage account under AzureBlobStoreMetadata
landscapepainter Jun 18, 2024
fe83261
set default file permission to be 755 for mounting
landscapepainter Jun 18, 2024
613618e
Update sky/adaptors/azure.py
landscapepainter Jun 18, 2024
f30291f
nit
landscapepainter Jun 18, 2024
dacf597
Merge branch 'azure-blob-storage' of https://github.com/landscapepain…
landscapepainter Jun 18, 2024
6dd9d31
nit fixes
landscapepainter Jun 19, 2024
e5cc383
format and update error handling
landscapepainter Jun 19, 2024
892b504
nit fixes
landscapepainter Jun 19, 2024
aea8316
set default storage account and resource group name as string constant
landscapepainter Jun 19, 2024
3e8c96e
update error handle.
landscapepainter Jun 19, 2024
d58a10b
additional error handle for else branch
landscapepainter Jun 19, 2024
078ee52
Additional error handling
landscapepainter Jun 19, 2024
9512449
nit
landscapepainter Jun 19, 2024
01b48a4
update get_az_storage_account_key to replace try-exception with if st…
landscapepainter Jun 20, 2024
2bd5ec7
nit
landscapepainter Jun 20, 2024
0b16763
nit
landscapepainter Jun 20, 2024
7ec0f68
nit
landscapepainter Jun 20, 2024
7b3010c
format
landscapepainter Jun 20, 2024
78a2533
update public container example as not accessible anymore
landscapepainter Jun 26, 2024
7f6ad76
Merge branch 'master' into azure-blob-storage
landscapepainter Jun 26, 2024
cd33c18
nit
landscapepainter Jun 26, 2024
13f90b4
file_bucket_name update
landscapepainter Jun 26, 2024
6a8db72
add StoreType method to retrieve bucket endpoint url
landscapepainter Jun 26, 2024
5c67690
format
landscapepainter Jun 26, 2024
95fa03d
add azure storage blob dependency installation for controller
landscapepainter Jun 26, 2024
b864f42
fix fetching methods
landscapepainter Jun 27, 2024
08adcde
nit
landscapepainter Jun 27, 2024
493e300
additional docstr for _get_storage_account_and_resource_group
landscapepainter Jun 27, 2024
75d0cda
nit
landscapepainter Jun 27, 2024
89645ed
update blobfuse2 cache directory
landscapepainter Jun 27, 2024
cb606ef
format
landscapepainter Jun 27, 2024
98fcd5f
refactor get_storage_account_key method
landscapepainter Jun 27, 2024
1a15411
update docker storage mounts smoke test
landscapepainter Jun 28, 2024
f494725
sleep for storage account creation to propagate
landscapepainter Jun 30, 2024
6d0e128
handle externally removed storage account being fetched
landscapepainter Jul 1, 2024
f8ecddc
format
landscapepainter Jul 1, 2024
e52f797
Merge branch 'master' into azure-blob-storage
landscapepainter Jul 1, 2024
a6691ed
nit
landscapepainter Jul 1, 2024
41d1000
add logic to retry for role assignment
landscapepainter Jul 3, 2024
0cd9d73
add comment to _create_storage_account method
landscapepainter Jul 3, 2024
33fff63
additional error handling for role assignment
landscapepainter Jul 3, 2024
78c5fd3
format
landscapepainter Jul 3, 2024
db9fa49
nit
landscapepainter Jul 3, 2024
3b89bff
Update sky/adaptors/azure.py
landscapepainter Jul 3, 2024
9f48823
additional installation check for azure blob storage dependencies
landscapepainter Jul 3, 2024
f233cb2
format
landscapepainter Jul 3, 2024
5fe2d60
Merge branch 'master' into azure-blob-storage
landscapepainter Jul 3, 2024
b5e2cc5
update step 7 from maybe_translate_local_file_mounts_and_sync_up meth…
landscapepainter Jul 3, 2024
f87de9d
additional comment on container_client.exists()
landscapepainter Jul 4, 2024
b4421d0
explicitly check None for match
landscapepainter Jul 4, 2024
6b12bff
Update sky/cloud_stores.py
landscapepainter Jul 4, 2024
5909311
[style] import module instead of class or funcion
landscapepainter Jul 4, 2024
4c9ac44
Merge branch 'azure-blob-storage' of https://github.com/landscapepain…
landscapepainter Jul 4, 2024
bfee828
nit
landscapepainter Jul 4, 2024
691582e
docstring nit updates
landscapepainter Jul 7, 2024
64c770a
nit
landscapepainter Jul 8, 2024
908acb4
error handle failure to run list blobs API from cloud_stores.py::is_d…
landscapepainter Jul 9, 2024
c320575
nit
landscapepainter Jul 9, 2024
fb3d48e
nit
landscapepainter Jul 9, 2024
076ef01
Add role assignment logic to handle edge case
landscapepainter Jul 12, 2024
3e75bdf
format
landscapepainter Jul 12, 2024
e1c56f0
Merge branch 'master' into azure-blob-storage
landscapepainter Jul 12, 2024
f711911
remove redundant get_az_resource_group method from data_utils
landscapepainter Jul 12, 2024
d5c4e1b
asyncio loop lifecycle manage
landscapepainter Jul 12, 2024
7fefe7b
update constant values
landscapepainter Jul 12, 2024
b93adb1
add logs when resource group and storage account is newly created
landscapepainter Jul 12, 2024
3f81d7c
Merge branch 'master' into azure-blob-storage
landscapepainter Jul 16, 2024
3c05ff7
Update sky/skylet/constants.py
landscapepainter Jul 16, 2024
ccecf24
add comment and move return True within the try except block
landscapepainter Jul 16, 2024
5b1689f
reverse the order of two decorators for get_client method to allow ca…
landscapepainter Jul 16, 2024
f40604b
revert error handling at _execute_file_mounts
landscapepainter Jul 16, 2024
642a258
nit
landscapepainter Jul 16, 2024
e212ea2
raise error when non existent storage account or container name is pr…
landscapepainter Jul 16, 2024
beba84a
format
landscapepainter Jul 16, 2024
55a5e72
add comment for keeping decorator order
landscapepainter Jul 17, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
350 changes: 275 additions & 75 deletions sky/adaptors/azure.py

Large diffs are not rendered by default.

4 changes: 4 additions & 0 deletions sky/adaptors/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,10 @@ def wrapper(*args, **kwargs):
m.load_module()
return func(*args, **kwargs)

# Preserve lru_cache methods
landscapepainter marked this conversation as resolved.
Show resolved Hide resolved
if hasattr(func, 'cache_clear'):
wrapper.cache_clear = func.cache_clear

return wrapper

return decorator
36 changes: 24 additions & 12 deletions sky/backends/cloud_vm_ray_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -4508,23 +4508,35 @@ def _execute_file_mounts(self, handle: CloudVmRayResourceHandle,
continue

storage = cloud_stores.get_storage_from_path(src)
if storage.is_directory(src):
sync = storage.make_sync_dir_command(source=src,
destination=wrapped_dst)
# It is a directory so make sure it exists.
mkdir_for_wrapped_dst = f'mkdir -p {wrapped_dst}'
else:
sync = storage.make_sync_file_command(source=src,
destination=wrapped_dst)
# It is a file so make sure *its parent dir* exists.
mkdir_for_wrapped_dst = (
f'mkdir -p {os.path.dirname(wrapped_dst)}')
try:
if storage.is_directory(src):
sync_cmd = (storage.make_sync_dir_command(
source=src, destination=wrapped_dst))
# It is a directory so make sure it exists.
mkdir_for_wrapped_dst = f'mkdir -p {wrapped_dst}'
else:
sync_cmd = (storage.make_sync_file_command(
source=src, destination=wrapped_dst))
# It is a file so make sure *its parent dir* exists.
mkdir_for_wrapped_dst = (
f'mkdir -p {os.path.dirname(wrapped_dst)}')
except Exception as e: # pylint: disable=broad-except
logger.error(
f'Failed to fetch from the bucket {src!r} to '
f'remote instance at {dst!r}.\n'
'Error details: '
f'{common_utils.format_exception(e, use_bracket=True)}.')
# If 'cmd' was appended to 'symlink_commands' for this sync, we
# remove as it failed to sync.
if not dst.startswith('~/') and not dst.startswith('/tmp/'):
symlink_commands.pop()
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this, it seems we skip the failed buckets that do not exist, which misaligns with our previous behavior for other cloud buckets. We should probably not skip all errors here.

file_mounts:
  /dst: s3://some-non-exist-bucket-zhwu

Also, it seems for an Azure blob that does not exist, unlike the other clouds, it raises error for the the sync command execution below. Can we get it fail here when we call list_blobs

file_mounts:
    /test: https://skyeastusa37461fd.blob.core.windows.net/some-non-exist-bucket-zhwu
I 07-15 22:12:46 backend_utils.py:1336] Syncing (to 1 node): https://skyeastusa37461fd.blob.core.windows.net/some-non-exist-bucket-zhwu -> /test
E 07-15 22:12:47 subprocess_utils.py:84] INFO: Any empty folders will not be processed, because source and/or destination doesn't have full folder support
E 07-15 22:12:47 subprocess_utils.py:84] 
E 07-15 22:12:47 subprocess_utils.py:84] Job 0c4734a3-9317-ad47-5ecc-0978846fbf69 has started
E 07-15 22:12:47 subprocess_utils.py:84] Log file is located at: /home/azureuser/.azcopy/0c4734a3-9317-ad47-5ecc-0978846fbf69.log
E 07-15 22:12:47 subprocess_utils.py:84] 
E 07-15 22:12:47 subprocess_utils.py:84] 
E 07-15 22:12:47 subprocess_utils.py:84] Cannot perform sync due to error: cannot list files due to reason -> github.com/Azure/azure-storage-blob-go/azblob.newStorageError, /home/vsts/go/pkg/mod/github.com/!azure/[email protected]/azblob/zc_storage_error.go:42
E 07-15 22:12:47 subprocess_utils.py:84] ===== RESPONSE ERROR (ServiceCode=ContainerNotFound) =====
E 07-15 22:12:47 subprocess_utils.py:84] Description=The specified container does not exist.
E 07-15 22:12:47 subprocess_utils.py:84] RequestId:8316c808-001e-005c-0b04-d71755000000
E 07-15 22:12:47 subprocess_utils.py:84] Time:2024-07-15T22:12:47.2300060Z, Details: 
E 07-15 22:12:47 subprocess_utils.py:84]    Code: ContainerNotFound
E 07-15 22:12:47 subprocess_utils.py:84]    GET https://skyeastusa37461fd.blob.core.windows.net/some-non-exist-bucket-zhwu?comp=list&delimiter=%2F&include=metadata&restype=container&se=2024-07-15t23%3A12%3A46z&sig=-REDACTED-&sp=rcwl&sr=c&sv=2024-05-04&timeout=901
E 07-15 22:12:47 subprocess_utils.py:84]    User-Agent: [AzCopy/10.17.0 Azure-Storage/0.15 (go1.19.2; linux)]
E 07-15 22:12:47 subprocess_utils.py:84]    X-Ms-Client-Request-Id: [79bd1b04-f229-46c8-5053-44fb7d1bc8c5]
E 07-15 22:12:47 subprocess_utils.py:84]    X-Ms-Version: [2020-10-02]
E 07-15 22:12:47 subprocess_utils.py:84]    --------------------------------------------------------------------------------
E 07-15 22:12:47 subprocess_utils.py:84]    RESPONSE Status: 404 The specified container does not exist.
E 07-15 22:12:47 subprocess_utils.py:84]    Content-Length: [225]
E 07-15 22:12:47 subprocess_utils.py:84]    Content-Type: [application/xml]
E 07-15 22:12:47 subprocess_utils.py:84]    Date: [Mon, 15 Jul 2024 22:12:47 GMT]
E 07-15 22:12:47 subprocess_utils.py:84]    Server: [Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0]
E 07-15 22:12:47 subprocess_utils.py:84]    X-Ms-Client-Request-Id: [79bd1b04-f229-46c8-5053-44fb7d1bc8c5]
E 07-15 22:12:47 subprocess_utils.py:84]    X-Ms-Error-Code: [ContainerNotFound]
E 07-15 22:12:47 subprocess_utils.py:84]    X-Ms-Request-Id: [8316c808-001e-005c-0b04-d71755000000]
E 07-15 22:12:47 subprocess_utils.py:84]    X-Ms-Version: [2020-10-02]
E 07-15 22:12:47 subprocess_utils.py:84] 
E 07-15 22:12:47 subprocess_utils.py:84] 
E 07-15 22:12:47 subprocess_utils.py:84] 
E 07-15 22:12:47 subprocess_utils.py:84] 

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first issue you mentioned is resolved at f40604b. And the second issue is resolved at e212ea2.


download_target_commands = [
# Ensure sync can write to wrapped_dst (e.g., '/data/').
mkdir_for_wrapped_dst,
# Both the wrapped and the symlink dir exist; sync.
sync,
sync_cmd,
]
command = ' && '.join(download_target_commands)
# dst is only used for message printing.
Expand Down
90 changes: 76 additions & 14 deletions sky/cloud_stores.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,20 @@
"""
import shlex
import subprocess
import time
import urllib.parse

from sky import sky_logging
from sky.adaptors import aws
from sky.adaptors import azure
from sky.adaptors import cloudflare
from sky.adaptors import ibm
from sky.clouds import gcp
from sky.data import data_utils
from sky.data.data_utils import Rclone
from sky.skylet import constants

logger = sky_logging.init_logger(__name__)


class CloudStorage:
Expand Down Expand Up @@ -189,30 +194,84 @@ def is_directory(self, url: str) -> bool:

In cloud object stores, a "directory" refers to a regular object whose
name is a prefix of other objects.

Args:
url: Endpoint url of the container/blob.

Returns:
True if the url is an endpoint of a directory and False if it
is a blob(file).

Raises:
azure.core.exceptions.HttpResponseError: If the user's Azure
Azure account does not have sufficient IAM role for the given
storage account.
"""
# split the url using split_az_path
storage_account_name, container_name, path = data_utils.split_az_path(
url)

# If there aren't more than just container name and storage account,
# that's a directory.
if not path:
return True

# If there are more, we need to check if it is a directory or a file.
container_url = data_utils.AZURE_CONTAINER_URL.format(
storage_account_name=storage_account_name,
container_name=container_name)
# If there's more, we'd need to check if it's a directory or a file.
container_client = data_utils.create_az_client(
client_type='container', container_url=container_url)
num_objects = 0
for blob in container_client.list_blobs(name_starts_with=path):
if blob.name == path:
return False
num_objects += 1
if num_objects > 1:
return True

# A directory with few or no items
return True
resource_group_name = azure.get_az_resource_group(storage_account_name)
role_assignment_start = time.time()
refresh_client = False
role_assigned = False

while (time.time() - role_assignment_start <
landscapepainter marked this conversation as resolved.
Show resolved Hide resolved
constants.WAIT_FOR_STORAGE_ACCOUNT_ROLE_ASSIGNMENT):
container_client = data_utils.create_az_client(
client_type='container',
container_url=container_url,
storage_account_name=storage_account_name,
resource_group_name=resource_group_name,
refresh_client=refresh_client)

num_objects = 0
try:
for blob in container_client.list_blobs(name_starts_with=path):
if blob.name == path:
return False
num_objects += 1
if num_objects > 1:
return True
except azure.exceptions().HttpResponseError as e:
# Handle case where user lacks sufficient IAM role for
# a private container in the same subscription. Attempt to
# assign appropriate role to current user.
if 'AuthorizationPermissionMismatch' in str(e):
if not role_assigned:
logger.info('Failed to list blobs in container '
f'{container_url!r}. This implies '
'insufficient IAM role for storage account'
f' {storage_account_name!r}.')
azure.assign_storage_account_iam_role(
storage_account_name=storage_account_name,
resource_group_name=resource_group_name)
role_assigned = True
refresh_client = True
else:
logger.info(
'Waiting due to the propagation delay of IAM '
'role assignment to the storage account '
f'{storage_account_name!r}.')
time.sleep(
constants.RETRY_INTERVAL_AFTER_ROLE_ASSIGNMENT)
continue
raise
# A directory with few or no items
return True
landscapepainter marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: keep the try block small

Suggested change
try:
for blob in container_client.list_blobs(name_starts_with=path):
if blob.name == path:
return False
num_objects += 1
if num_objects > 1:
return True
except azure.exceptions().HttpResponseError as e:
# Handle case where user lacks sufficient IAM role for
# a private container in the same subscription. Attempt to
# assign appropriate role to current user.
if 'AuthorizationPermissionMismatch' in str(e):
if not role_assigned:
logger.info('Failed to list blobs in container '
f'{container_url!r}. This implies '
'insufficient IAM role for storage account'
f' {storage_account_name!r}.')
azure.assign_storage_account_iam_role(
storage_account_name=storage_account_name,
resource_group_name=resource_group_name)
role_assigned = True
refresh_client = True
else:
logger.info(
'Waiting due to the propagation delay of IAM '
'role assignment to the storage account '
f'{storage_account_name!r}.')
time.sleep(
constants.RETRY_INTERVAL_AFTER_ROLE_ASSIGNMENT)
continue
raise
# A directory with few or no items
return True
try:
blobs = container_client.list_blobs(name_starts_with=path)
except azure.exceptions().HttpResponseError as e:
# Handle case where user lacks sufficient IAM role for
# a private container in the same subscription. Attempt to
# assign appropriate role to current user.
if 'AuthorizationPermissionMismatch' in str(e):
if not role_assigned:
logger.info('Failed to list blobs in container '
f'{container_url!r}. This implies '
'insufficient IAM role for storage account'
f' {storage_account_name!r}.')
azure.assign_storage_account_iam_role(
storage_account_name=storage_account_name,
resource_group_name=resource_group_name)
role_assigned = True
refresh_client = True
else:
logger.info(
'Waiting due to the propagation delay of IAM '
'role assignment to the storage account '
f'{storage_account_name!r}.')
time.sleep(
constants.RETRY_INTERVAL_AFTER_ROLE_ASSIGNMENT)
continue
raise
for blob in blobs:
if blob.name == path:
return False
num_objects += 1
if num_objects > 1:
return True
# A directory with few or no items
return True

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I tried at first, but the error is raised when we iterate through the object within the for-loop for blob in blobs: rather than when the object is obtained with blobs = container_client.list_blobs(name_starts_with=path). So the suggested code won't be able to catch the error.

else:
raise TimeoutError(
'Failed to determine the container path status within '
f'{constants.WAIT_FOR_STORAGE_ACCOUNT_ROLE_ASSIGNMENT}'
'seconds.')

def _get_azcopy_source(self, source: str, is_dir: bool) -> str:
"""Converts the source so it can be used as an argument for azcopy."""
Expand Down Expand Up @@ -408,5 +467,8 @@ def get_storage_from_path(url: str) -> CloudStorage:
's3': S3CloudStorage(),
'r2': R2CloudStorage(),
'cos': IBMCosCloudStorage(),
# TODO: This is a hack, as Azure URL starts with https://, we should
# refactor the registry to be able to take regex, so that Azure blob can
# be identified with `https://(.*?)\.blob\.core\.windows\.net`
'https': AzureBlobCloudStorage()
landscapepainter marked this conversation as resolved.
Show resolved Hide resolved
}
2 changes: 1 addition & 1 deletion sky/clouds/azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -481,7 +481,7 @@ def check_credentials(cls) -> Tuple[bool, Optional[str]]:
# Check if the azure blob storage dependencies are installed.
try:
# pylint: disable=redefined-outer-name, import-outside-toplevel, unused-import
import azure.storage.blob
from azure.storage import blob
import msgraph
except ImportError as e:
return False, (
Expand Down
107 changes: 49 additions & 58 deletions sky/data/data_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,20 +63,21 @@ def split_az_path(az_path: str) -> Tuple[str, str, str]:
"""Splits Path into Storage account and Container names and Relative Path

Args:
az_path: str; Container Path,
az_path: Container Path,
e.g. https://azureopendatastorage.blob.core.windows.net/nyctlc

Returns:
str; Name of the storage account
str; Name of the container
str; Paths of the file/directory defined within the container
str: Name of the storage account
str: Name of the container
str: Paths of the file/directory defined within the container
"""
path_parts = az_path.replace('https://', '').split('/')
service_endpoint = path_parts.pop(0)
service_endpoint_parts = service_endpoint.split('.')
storage_account_name = service_endpoint_parts[0]
container_name = path_parts.pop(0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After manually overriding the region of the storage account, I was able to get the spot controller to launch. However, launching the task on it failed with these logs:

(base) ➜  ~  sky spot logs --controller 1
D 05-01 13:34:46 skypilot_config.py:144] Using config path: /Users/romilb/.sky/config.yaml
D 05-01 13:34:46 skypilot_config.py:148] Config loaded:
D 05-01 13:34:46 skypilot_config.py:148] {'serve': {'controller': {'resources': {'cloud': 'kubernetes'}}},
D 05-01 13:34:46 skypilot_config.py:148]  'spot': {'controller': {'resources': {'cloud': 'azure'}}}}
D 05-01 13:34:46 skypilot_config.py:158] Config syntax check passed.
D 05-01 13:34:48 azure.py:645] az vm list --query "[?tags.\"ray-cluster-name\"=='sky-spot-controller-2ea485ea-2ea4'].id" -o json returned 0.
D 05-01 13:34:48 azure.py:645] **** STDOUT ****
D 05-01 13:34:48 azure.py:645] [
D 05-01 13:34:48 azure.py:645]   "/subscriptions/aa86df77-e703-453e-b2f4-955c3b33e534/resourceGroups/SKY-SPOT-CONTROLLER-2EA485EA-2EA4-EASTUS/providers/Microsoft.Compute/virtualMachines/ray-sky-spot-controller-2ea485ea-2ea4-head-6df2-c2120"
D 05-01 13:34:48 azure.py:645] ]
D 05-01 13:34:48 azure.py:645]
D 05-01 13:34:48 azure.py:645] **** STDERR ****
D 05-01 13:34:48 azure.py:645]
D 05-01 13:34:49 azure.py:665] az vm show -d --ids /subscriptions/aa86df77-e703-453e-b2f4-955c3b33e534/resourceGroups/SKY-SPOT-CONTROLLER-2EA485EA-2EA4-EASTUS/providers/Microsoft.Compute/virtualMachines/ray-sky-spot-controller-2ea485ea-2ea4-head-6df2-c2120 --query "powerState" -o json returned 0.
D 05-01 13:34:49 azure.py:665] **** STDOUT ****
D 05-01 13:34:49 azure.py:665] "VM running"
D 05-01 13:34:49 azure.py:665]
D 05-01 13:34:49 azure.py:665] **** STDERR ****
D 05-01 13:34:49 azure.py:665]
Tailing logs of job 1 on cluster 'sky-spot-controller-2ea485ea'...
I 05-01 20:34:53 log_lib.py:407] Start streaming logs for job 1.
INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed).
INFO: Waiting for task resources on 1 node. This will block if the cluster is full.
INFO: All task resources reserved.
INFO: Reserved IPs: ['10.247.0.4']
(sky-2a8c-romilb, pid=8641) D 05-01 20:30:08 skypilot_config.py:144] Using config path: /home/azureuser/.sky/spot_tasks/sky-2a8c-romilb-779f.config_yaml
(sky-2a8c-romilb, pid=8641) D 05-01 20:30:08 skypilot_config.py:148] Config loaded:
(sky-2a8c-romilb, pid=8641) D 05-01 20:30:08 skypilot_config.py:148] {'serve': {'controller': {'resources': {'cloud': 'kubernetes'}}},
(sky-2a8c-romilb, pid=8641) D 05-01 20:30:08 skypilot_config.py:148]  'spot': {'controller': {'resources': {'cloud': 'azure'}}}}
(sky-2a8c-romilb, pid=8641) D 05-01 20:30:08 skypilot_config.py:158] Config syntax check passed.
(sky-2a8c-romilb, pid=8641) D 05-01 20:30:09 skypilot_config.py:144] Using config path: /home/azureuser/.sky/spot_tasks/sky-2a8c-romilb-779f.config_yaml
(sky-2a8c-romilb, pid=8641) D 05-01 20:30:09 skypilot_config.py:148] Config loaded:
(sky-2a8c-romilb, pid=8641) D 05-01 20:30:09 skypilot_config.py:148] {'serve': {'controller': {'resources': {'cloud': 'kubernetes'}}},
(sky-2a8c-romilb, pid=8641) D 05-01 20:30:09 skypilot_config.py:148]  'spot': {'controller': {'resources': {'cloud': 'azure'}}}}
(sky-2a8c-romilb, pid=8641) D 05-01 20:30:09 skypilot_config.py:158] Config syntax check passed.
(sky-2a8c-romilb, pid=8641) Process Process-1:
(sky-2a8c-romilb, pid=8641) Traceback (most recent call last):
(sky-2a8c-romilb, pid=8641)   File "/home/azureuser/miniconda3/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(sky-2a8c-romilb, pid=8641)     self.run()
(sky-2a8c-romilb, pid=8641)   File "/home/azureuser/miniconda3/lib/python3.10/multiprocessing/process.py", line 108, in run
(sky-2a8c-romilb, pid=8641)     self._target(*self._args, **self._kwargs)
(sky-2a8c-romilb, pid=8641)   File "/home/azureuser/miniconda3/lib/python3.10/site-packages/sky/spot/controller.py", line 387, in _run_controller
(sky-2a8c-romilb, pid=8641)     spot_controller = SpotController(job_id, dag_yaml, retry_until_up)
(sky-2a8c-romilb, pid=8641)   File "/home/azureuser/miniconda3/lib/python3.10/site-packages/sky/spot/controller.py", line 52, in __init__
(sky-2a8c-romilb, pid=8641)     self._dag, self._dag_name = _get_dag_and_name(dag_yaml)
(sky-2a8c-romilb, pid=8641)   File "/home/azureuser/miniconda3/lib/python3.10/site-packages/sky/spot/controller.py", line 40, in _get_dag_and_name
(sky-2a8c-romilb, pid=8641)     dag = dag_utils.load_chain_dag_from_yaml(dag_yaml)
(sky-2a8c-romilb, pid=8641)   File "/home/azureuser/miniconda3/lib/python3.10/site-packages/sky/utils/dag_utils.py", line 104, in load_chain_dag_from_yaml
(sky-2a8c-romilb, pid=8641)     task = task_lib.Task.from_yaml_config(task_config, env_overrides)
(sky-2a8c-romilb, pid=8641)   File "/home/azureuser/miniconda3/lib/python3.10/site-packages/sky/task.py", line 419, in from_yaml_config
(sky-2a8c-romilb, pid=8641)     storage_obj = storage_lib.Storage.from_yaml_config(storage[1])
(sky-2a8c-romilb, pid=8641)   File "/home/azureuser/miniconda3/lib/python3.10/site-packages/sky/data/storage.py", line 1004, in from_yaml_config
(sky-2a8c-romilb, pid=8641)     storage_obj = cls(name=name,
(sky-2a8c-romilb, pid=8641)   File "/home/azureuser/miniconda3/lib/python3.10/site-packages/sky/data/storage.py", line 492, in __init__
(sky-2a8c-romilb, pid=8641)     self._validate_storage_spec(name)
(sky-2a8c-romilb, pid=8641)   File "/home/azureuser/miniconda3/lib/python3.10/site-packages/sky/data/storage.py", line 715, in _validate_storage_spec
(sky-2a8c-romilb, pid=8641)     name = data_utils.split_az_path(source)[0]
(sky-2a8c-romilb, pid=8641)   File "/home/azureuser/miniconda3/lib/python3.10/site-packages/sky/data/data_utils.py", line 67, in split_az_path
(sky-2a8c-romilb, pid=8641)     container_name = path_parts.pop(0)
(sky-2a8c-romilb, pid=8641) IndexError: pop from empty list
(sky-2a8c-romilb, pid=8641) I 05-01 20:30:10 controller.py:473] Killing controller process 9375.
(sky-2a8c-romilb, pid=8641) I 05-01 20:30:10 controller.py:481] Controller process 9375 killed.
(sky-2a8c-romilb, pid=8641) I 05-01 20:30:10 controller.py:483] Cleaning up any spot cluster for job 1.
(sky-2a8c-romilb, pid=8641) Traceback (most recent call last):
(sky-2a8c-romilb, pid=8641)   File "/home/azureuser/miniconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
(sky-2a8c-romilb, pid=8641)     return _run_code(code, main_globals, None,
(sky-2a8c-romilb, pid=8641)   File "/home/azureuser/miniconda3/lib/python3.10/runpy.py", line 86, in _run_code
(sky-2a8c-romilb, pid=8641)     exec(code, run_globals)
(sky-2a8c-romilb, pid=8641)   File "/home/azureuser/miniconda3/lib/python3.10/site-packages/sky/spot/controller.py", line 532, in <module>
(sky-2a8c-romilb, pid=8641)     start(args.job_id, args.dag_yaml, args.retry_until_up)
(sky-2a8c-romilb, pid=8641)   File "/home/azureuser/miniconda3/lib/python3.10/site-packages/sky/spot/controller.py", line 491, in start
(sky-2a8c-romilb, pid=8641)     _cleanup(job_id, dag_yaml=dag_yaml)
(sky-2a8c-romilb, pid=8641)   File "/home/azureuser/miniconda3/lib/python3.10/site-packages/sky/spot/controller.py", line 431, in _cleanup
(sky-2a8c-romilb, pid=8641)     dag, _ = _get_dag_and_name(dag_yaml)
(sky-2a8c-romilb, pid=8641)   File "/home/azureuser/miniconda3/lib/python3.10/site-packages/sky/spot/controller.py", line 40, in _get_dag_and_name
(sky-2a8c-romilb, pid=8641)     dag = dag_utils.load_chain_dag_from_yaml(dag_yaml)
(sky-2a8c-romilb, pid=8641)   File "/home/azureuser/miniconda3/lib/python3.10/site-packages/sky/utils/dag_utils.py", line 104, in load_chain_dag_from_yaml
(sky-2a8c-romilb, pid=8641)     task = task_lib.Task.from_yaml_config(task_config, env_overrides)
(sky-2a8c-romilb, pid=8641)   File "/home/azureuser/miniconda3/lib/python3.10/site-packages/sky/task.py", line 419, in from_yaml_config
(sky-2a8c-romilb, pid=8641)     storage_obj = storage_lib.Storage.from_yaml_config(storage[1])
(sky-2a8c-romilb, pid=8641)   File "/home/azureuser/miniconda3/lib/python3.10/site-packages/sky/data/storage.py", line 1004, in from_yaml_config
(sky-2a8c-romilb, pid=8641)     storage_obj = cls(name=name,
(sky-2a8c-romilb, pid=8641)   File "/home/azureuser/miniconda3/lib/python3.10/site-packages/sky/data/storage.py", line 492, in __init__
(sky-2a8c-romilb, pid=8641)     self._validate_storage_spec(name)
(sky-2a8c-romilb, pid=8641)   File "/home/azureuser/miniconda3/lib/python3.10/site-packages/sky/data/storage.py", line 715, in _validate_storage_spec
(sky-2a8c-romilb, pid=8641)     name = data_utils.split_az_path(source)[0]
(sky-2a8c-romilb, pid=8641)   File "/home/azureuser/miniconda3/lib/python3.10/site-packages/sky/data/data_utils.py", line 67, in split_az_path
(sky-2a8c-romilb, pid=8641)     container_name = path_parts.pop(0)
(sky-2a8c-romilb, pid=8641) IndexError: pop from empty list

Copy link
Collaborator Author

@landscapepainter landscapepainter May 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is resolved with eaf3058

path = '/'.join(path_parts)

return storage_account_name, container_name, path


Expand Down Expand Up @@ -157,22 +158,37 @@ def verify_gcs_bucket(name: str) -> bool:
return False


def create_az_client(client_type: str, **kwargs) -> Client:
def create_az_client(client_type: str, **kwargs: Any) -> Client:
"""Helper method that connects to AZ client for diverse Resources.

Args:
client_type: str; specify client type, e.g. storage, resource, container

Returns:
Client object facing AZ Resource of the 'client_type'.
Client object facing AZ Resource of the 'client_type'.
"""
resource_group_name = kwargs.pop('resource_group_name', None)
container_url = kwargs.pop('container_url', None)
storage_account_name = kwargs.pop('storage_account_name', None)
refresh_client = kwargs.pop('refresh_client', False)
if client_type == 'container':
assert container_url is not None
# We do not assert on resource_group_name as it is set to None when the
# container_url is for public container with user access.
assert container_url is not None, ('container_url must be provided for '
'container client')
assert storage_account_name is not None, ('storage_account_name must '
'be provided for container '
'client')

subscription_id = azure.get_subscription_id()
return azure.get_client(client_type,
subscription_id,
container_url=container_url)
if refresh_client:
azure.get_client.cache_clear()
client = azure.get_client(client_type,
subscription_id,
container_url=container_url,
storage_account_name=storage_account_name,
resource_group_name=resource_group_name)
return client


def verify_az_bucket(storage_account_name: str, container_name: str) -> bool:
Expand All @@ -183,46 +199,20 @@ def verify_az_bucket(storage_account_name: str, container_name: str) -> bool:
container_name: str; Name of the container

Returns:
boolean; Shows either or not the container exists.
True if the container exists, False otherwise.
"""
container_url = AZURE_CONTAINER_URL.format(
storage_account_name=storage_account_name,
container_name=container_name)
container_client = create_az_client(client_type='container',
container_url=container_url)
resource_group_name = azure.get_az_resource_group(storage_account_name)
container_client = create_az_client(
client_type='container',
container_url=container_url,
storage_account_name=storage_account_name,
resource_group_name=resource_group_name)
return container_client.exists()
cblmemo marked this conversation as resolved.
Show resolved Hide resolved


def get_az_resource_group(
storage_account_name: str,
storage_client: Optional[Client] = None) -> Optional[str]:
"""Returns the resource group name the given storage account belongs to.

Args:
storage_account_name: str; Name of the storage account
storage_client: Optional[Client]; Client object facing storage

Returns:
Name of the resource group the given storage account belongs to.
"""
if storage_client is None:
storage_client = create_az_client('storage')
for account in storage_client.storage_accounts.list():
if account.name == storage_account_name:
# Extract the resource group name from the account ID
# An example of account.id would be the following:
# /subscriptions/{subscription_id}/resourceGroups/{resource_group_name}/providers/Microsoft.Storage/storageAccounts/{container_name} # pylint: disable=line-too-long
split_account_id = account.id.split('/')
assert len(split_account_id) == 9
resource_group_name = split_account_id[4]
return resource_group_name
# resource group cannot be found when using container not created
# under the user's subscription id, i.e. public container, or
# private containers not belonging to the user or when the storage account
# does not exist.
return None


def get_az_storage_account_key(
storage_account_name: str,
resource_group_name: Optional[str] = None,
Expand All @@ -232,24 +222,25 @@ def get_az_storage_account_key(
"""Returns access key of the given name of storage account.

Args:
storage_account_name: str; Name of the storage account
resource_group_name: Optional[str]; Name of the resource group the
storage_account_name: Name of the storage account
resource_group_name: Name of the resource group the
passed storage account belongs to.
storage_clent: Optional[Client]; Client object facing Storage
resource_client: Optional[Client]; Client object facing Resource
storage_clent: Client object facing Storage
resource_client: Client object facing Resource

Returns:
One of the few access keys to the given storage account
One of the two access keys to the given storage account, or None if
the account is not found.
"""
if resource_client is None:
resource_client = create_az_client('resource')
if storage_client is None:
storage_client = create_az_client('storage')
if resource_group_name is None:
resource_group_name = get_az_resource_group(storage_account_name,
storage_client)
resource_group_name = azure.get_az_resource_group(
storage_account_name, storage_client)
# resource_group_name is None when using a public container or
# a private containers not belonging to the user.
# a private container not belonging to the user.
if resource_group_name is None:
return None

Expand Down Expand Up @@ -286,23 +277,23 @@ def get_az_storage_account_key(


def is_az_container_endpoint(endpoint_url: str) -> bool:
"""Checks if the provided url is valid container endpoint
"""Checks if provided url follows a valid container endpoint naming format.

Args:
endpoint_url: str; Url of container endpoint.
endpoint_url: Url of container endpoint.
e.g. https://azureopendatastorage.blob.core.windows.net/nyctlc

Returns:
boolean; Shows either or not the container endpoint is valid.
bool: True if the endpoint is valid, False otherwise.
"""
# storage account must be length of 3-24
# Storage account must be length of 3-24
# Reference: https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/resource-name-rules#microsoftstorage # pylint: disable=line-too-long
pattern = re.compile(
r'^https://([a-z0-9]{3,24})\.blob\.core\.windows\.net(/[^/]+)*$')
match = pattern.match(endpoint_url)
if match:
return True
return False
if match is None:
return False
return True


def create_r2_client(region: str = 'auto') -> Client:
Expand Down
Loading
Loading