-
Notifications
You must be signed in to change notification settings - Fork 533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Azure] SkyPilot provisioner for Azure #3704
Conversation
…o azure-provisioner
…kypilot into azure-provisioner
…into azure-provisioner
…o azure-provisioner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this awesome refactoring @Michaelvll ! It looks mostly good to me. Left some nits here :))
sky/jobs/utils.py
Outdated
@@ -843,4 +843,10 @@ def set_pending(cls, job_id: int, managed_job_dag: 'dag_lib.Dag') -> str: | |||
@classmethod | |||
def _build(cls, code: str) -> str: | |||
generated_code = cls._PREFIX + '\n' + code | |||
return f'{constants.SKY_PYTHON_CMD} -u -c {shlex.quote(generated_code)}' | |||
# Activate the python env to make sure some cloud CLI, such as az | |||
# command is available in the subprocess. This useful for a controller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# command is available in the subprocess. This useful for a controller | |
# command is available in the subprocess. This is useful for a controller |
nit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To discuss: do we need to activate this all the time? Should we only activate on need? Though arguably this is only a source env
so the overhead might be tolerable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call! Just figured out that we can get rid of this activation by better handling the case where a cluster is created with an old provisioner.
).result().properties.outputs | ||
|
||
nsg_id = outputs['nsg']['value'] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to wait until the nsg is created like here?
skypilot/sky/skylet/providers/azure/config.py
Lines 183 to 204 in 5e23f16
# We should wait for the NSG to be created before opening any ports | |
# to avoid overriding the newly-added NSG rules. | |
nsg_id = outputs["nsg"]["value"] | |
nsg_name = nsg_id.split("/")[-1] | |
network_client = NetworkManagementClient(credentials, subscription_id) | |
backoff = common_utils.Backoff(max_backoff_factor=1) | |
start_time = time.time() | |
while True: | |
nsg = network_client.network_security_groups.get(resource_group, nsg_name) | |
if nsg.provisioning_state == "Succeeded": | |
break | |
if time.time() - start_time > _WAIT_NSG_CREATION_NUM_TIMEOUT_SECONDS: | |
raise RuntimeError( | |
f"Fails to create NSG {nsg_name} in {resource_group} within " | |
f"{_WAIT_NSG_CREATION_NUM_TIMEOUT_SECONDS} seconds." | |
) | |
backoff_time = backoff.current_backoff() | |
logger.info( | |
f"NSG {nsg_name} is not created yet. Waiting for " | |
f"{backoff_time} seconds before checking again." | |
) | |
time.sleep(backoff_time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We moved the waiting to the place where we are trying to opening ports to reduce the overheads for creating instances. What do you think?
skypilot/sky/provision/azure/instance.py
Lines 730 to 744 in 4b6dc85
while True: | |
if nsg.provisioning_state not in ['Creating', 'Updating']: | |
break | |
if (time.time() - start_time > | |
_WAIT_NSG_CREATION_NUM_TIMEOUT_SECONDS): | |
logger.warning( | |
f'Fails to wait for the creation of NSG {nsg.name} in ' | |
f'{resource_group} within ' | |
f'{_WAIT_NSG_CREATION_NUM_TIMEOUT_SECONDS} seconds. ' | |
'Skip this NSG.') | |
backoff_time = backoff.current_backoff() | |
logger.info(f'NSG {nsg.name} is not created yet. Waiting for ' | |
f'{backoff_time} seconds before checking again.') | |
time.sleep(backoff_time) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh good point! As long as this does not affect other functionality of the cluster, like SSH, it should be fine to wait in opening ports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, since we will wait for SSH during provisioning, it should be fine to not wait for the creation of nsg : )
@@ -1,13 +1,11 @@ | |||
include sky/backends/monkey_patches/*.py | |||
exclude sky/clouds/service_catalog/data_fetchers/analyze.py | |||
include sky/provision/kubernetes/manifests/* | |||
include sky/provision/azure/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this for Azure only, while other clouds using new provisioner does not need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This because we have some non-python file in the provisioner, i.e. the two template json files, which will not be included in the installed package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Got it!
Tested (ded2dd8):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the prompt fix! It looks great to me ;)
* Use SkyPilot for status query * format * Avoid reconfig * Add todo * Add termination and stopping * add stop and termination into __init__ * get rid of azure special handling in backend * format * Fix filtering for autodown clusters * Move NSG waiting * wip * wip * working? * Fix and format * remove node providers * Add manifest and fix formating * Fix waiting for deletion * remove azure provider format * Skip termination for resource group does not exist * Add retry for fetching subscription ID * Fix provisioning state * Fix restarting instances by adding wait for pendings * fixs * fix * Add azure handler * adopt changes from node provider * format * fix merge conflict * format * Add detailed reason * fix import * Fix backward compat * fix head node fetching * format * fix existing instances * backward compat test for multi-node * backward compat for cached cluster info * fix back compat for provisioner update * minor * fix restarting * revert accidental changes * fix logging controller utils * add path * activate python env for sky jobs logs * fix quote * format * Longer timeout for docker initialization * fix * make cloud init more readable * fix * fix docker * fix tests * add region argument for eu-south-1 region * Add --region argument for storage aws s3 * Fix tests * longer * wip * wip * address comments * revert storage * revert changes
Blocked by #3696, #3700Single-node
master (05ce5e9)
This PR:
Single-node launch on existing cluster (1.5x faster)
master (05ce5e9)
This PR:
Multi-node cluster (1.8x faster)
master (05ce5e9)
This PR:
TODO:
ray-
vssky-
and changing the name of deployment fromray-config
toskypilot-config
)Adding(We cannot add node type to the VM name as Azure VM name is immutable and we may have to change the name if we decide to change which node is the head, i.e. a user terminated the head node)head
andworker
in the name of instancesTested (run the relevant ones):
bash format.sh
sky launch --cloud azure --gpus A10 -c test-azure-a10 nvidia-smi --down -i 0
pytest tests/test_smoke.py --azure
(except fortests/test_smoke.py::test_cancel_pytorch
andtests/test_smoke.py::test_huggingface
, also [Tests]tests/test_smoke.py::test_sky_bench
fail #3731 )pytest tests/test_smoke.py::test_fill_in_the_name
pytest tests/test_smoke.py::test_skyserve_new_autoscaler_update --aws
(check the fix for tests in this PR does not hurt other clouds, with [Storage] Add region argument for AWS s3 upload command #3735 merged)conda deactivate; bash -i tests/backward_compatibility_tests.sh