Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Azure] SkyPilot provisioner for Azure #3704

Merged
merged 73 commits into from
Jul 15, 2024
Merged

[Azure] SkyPilot provisioner for Azure #3704

merged 73 commits into from
Jul 15, 2024

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Jun 28, 2024

Blocked by #3696, #3700

Single-node

master (05ce5e9)

multitime -n 5 sky launch --cloud azure -y --cpus 2 --down
            Mean        Std.Dev.    Min         Median      Max
real        220.920     6.553       213.297     219.030     231.210
user        13.407      0.800       12.713      12.793      14.526
sys         2.633       0.067       2.567       2.629       2.755

This PR:

multitime -n 5 sky launch --cloud azure -y --cpus 2 --down
            Mean        Std.Dev.    Min         Median      Max
real        225.484     46.623      199.291     201.043     318.462
user        7.344       0.082       7.277       7.305       7.500
sys         1.351       0.047       1.301       1.341       1.439

Single-node launch on existing cluster (1.5x faster)

master (05ce5e9)

multitime -n 5 sky launch -c test-azure-back-2 echo hi
            Mean        Std.Dev.    Min         Median      Max
real        40.386      2.236       36.693      40.717      42.717      
user        10.914      0.085       10.787      10.931      11.043      
sys         1.547       0.055       1.438       1.571       1.588

This PR:

multitime -n 5 sky launch -c test-azure-back-2 echo hi
            Mean        Std.Dev.    Min         Median      Max
real        26.650      2.722       23.534      25.818      30.705      
user        6.063       0.621       5.606       5.726       7.271       
sys         0.995       0.091       0.933       0.955       1.175  

Multi-node cluster (1.8x faster)

master (05ce5e9)

multitime -n 5 sky launch --cloud azure -y --cpus 2 --num-nodes 4 --down
            Mean        Std.Dev.    Min         Median      Max
real        415.640     25.084      384.469     410.001     458.268
user        27.843      0.326       27.420      27.951      28.336
sys         5.152       0.156       5.006       5.045       5.375

This PR:

multitime -n 5 sky launch --cloud azure -y --cpus 2 --num-nodes 4 --down
            Mean        Std.Dev.    Min         Median      Max
real        233.957     3.154       230.198     235.264     238.449
user        8.640       0.648       8.221       8.310       9.928
sys         1.644       0.132       1.490       1.611       1.863

TODO:

  • Backward compatibility (especially check: actual name starting ray- vs sky- and changing the name of deployment from ray-config to skypilot-config)
    • Relaunch on existing running cluster
    • Restart a stopped instance
  • Adding head and worker in the name of instances (We cannot add node type to the VM name as Azure VM name is immutable and we may have to change the name if we decide to change which node is the head, i.e. a user terminated the head node)

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky launch --cloud azure --gpus A10 -c test-azure-a10 nvidia-smi --down -i 0
  • All smoke tests: pytest tests/test_smoke.py --azure (except for tests/test_smoke.py::test_cancel_pytorch and tests/test_smoke.py::test_huggingface, also [Tests] tests/test_smoke.py::test_sky_bench fail #3731 )
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@Michaelvll Michaelvll requested a review from cblmemo July 10, 2024 00:51
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this awesome refactoring @Michaelvll ! It looks mostly good to me. Left some nits here :))

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
@@ -843,4 +843,10 @@ def set_pending(cls, job_id: int, managed_job_dag: 'dag_lib.Dag') -> str:
@classmethod
def _build(cls, code: str) -> str:
generated_code = cls._PREFIX + '\n' + code
return f'{constants.SKY_PYTHON_CMD} -u -c {shlex.quote(generated_code)}'
# Activate the python env to make sure some cloud CLI, such as az
# command is available in the subprocess. This useful for a controller
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# command is available in the subprocess. This useful for a controller
# command is available in the subprocess. This is useful for a controller

nit

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To discuss: do we need to activate this all the time? Should we only activate on need? Though arguably this is only a source env so the overhead might be tolerable

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! Just figured out that we can get rid of this activation by better handling the case where a cluster is created with an old provisioner.

).result().properties.outputs

nsg_id = outputs['nsg']['value']

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to wait until the nsg is created like here?

# We should wait for the NSG to be created before opening any ports
# to avoid overriding the newly-added NSG rules.
nsg_id = outputs["nsg"]["value"]
nsg_name = nsg_id.split("/")[-1]
network_client = NetworkManagementClient(credentials, subscription_id)
backoff = common_utils.Backoff(max_backoff_factor=1)
start_time = time.time()
while True:
nsg = network_client.network_security_groups.get(resource_group, nsg_name)
if nsg.provisioning_state == "Succeeded":
break
if time.time() - start_time > _WAIT_NSG_CREATION_NUM_TIMEOUT_SECONDS:
raise RuntimeError(
f"Fails to create NSG {nsg_name} in {resource_group} within "
f"{_WAIT_NSG_CREATION_NUM_TIMEOUT_SECONDS} seconds."
)
backoff_time = backoff.current_backoff()
logger.info(
f"NSG {nsg_name} is not created yet. Waiting for "
f"{backoff_time} seconds before checking again."
)
time.sleep(backoff_time)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We moved the waiting to the place where we are trying to opening ports to reduce the overheads for creating instances. What do you think?

while True:
if nsg.provisioning_state not in ['Creating', 'Updating']:
break
if (time.time() - start_time >
_WAIT_NSG_CREATION_NUM_TIMEOUT_SECONDS):
logger.warning(
f'Fails to wait for the creation of NSG {nsg.name} in '
f'{resource_group} within '
f'{_WAIT_NSG_CREATION_NUM_TIMEOUT_SECONDS} seconds. '
'Skip this NSG.')
backoff_time = backoff.current_backoff()
logger.info(f'NSG {nsg.name} is not created yet. Waiting for '
f'{backoff_time} seconds before checking again.')
time.sleep(backoff_time)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good point! As long as this does not affect other functionality of the cluster, like SSH, it should be fine to wait in opening ports.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, since we will wait for SSH during provisioning, it should be fine to not wait for the creation of nsg : )

sky/provision/azure/config.py Outdated Show resolved Hide resolved
sky/provision/gcp/instance_utils.py Outdated Show resolved Hide resolved
@@ -1,13 +1,11 @@
include sky/backends/monkey_patches/*.py
exclude sky/clouds/service_catalog/data_fetchers/analyze.py
include sky/provision/kubernetes/manifests/*
include sky/provision/azure/*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this for Azure only, while other clouds using new provisioner does not need this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This because we have some non-python file in the provisioner, i.e. the two template json files, which will not be included in the installed package.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Got it!

sky/utils/command_runner.py Outdated Show resolved Hide resolved
sky/utils/controller_utils.py Outdated Show resolved Hide resolved
sky/provision/azure/instance.py Show resolved Hide resolved
@Michaelvll Michaelvll requested a review from cblmemo July 14, 2024 05:03
@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Jul 14, 2024

Tested (ded2dd8):

  • Any manual or new tests for this PR (please specify below)
    • sky launch --cloud gcp --num-nodes 2 --cloud gcp --cpus 2 echo hi
  • All smoke tests: pytest tests/test_smoke.py --azure (except for tests/test_smoke.py::test_cancel_pytorch and tests/test_smoke.py::test_huggingface due to availability issue, also [Tests] tests/test_smoke.py::test_sky_bench fail #3731 )
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the prompt fix! It looks great to me ;)

@Michaelvll Michaelvll added this pull request to the merge queue Jul 15, 2024
Merged via the queue into master with commit 465d36c Jul 15, 2024
20 checks passed
@Michaelvll Michaelvll deleted the azure-provisioner branch July 15, 2024 20:20
Michaelvll added a commit that referenced this pull request Aug 23, 2024
* Use SkyPilot for status query

* format

* Avoid reconfig

* Add todo

* Add termination and stopping

* add stop and termination into __init__

* get rid of azure special handling in backend

* format

* Fix filtering for autodown clusters

* Move NSG waiting

* wip

* wip

* working?

* Fix and format

* remove node providers

* Add manifest and fix formating

* Fix waiting for deletion

* remove azure provider format

* Skip termination for resource group does not exist

* Add retry for fetching subscription ID

* Fix provisioning state

* Fix restarting instances by adding wait for pendings

* fixs

* fix

* Add azure handler

* adopt changes from node provider

* format

* fix merge conflict

* format

* Add detailed reason

* fix import

* Fix backward compat

* fix head node fetching

* format

* fix existing instances

* backward compat test for multi-node

* backward compat for cached cluster info

* fix back compat for provisioner update

* minor

* fix restarting

* revert accidental changes

* fix logging controller utils

* add path

* activate python env for sky jobs logs

* fix quote

* format

* Longer timeout for docker initialization

* fix

* make cloud init more readable

* fix

* fix docker

* fix tests

* add region argument for eu-south-1 region

* Add --region argument for storage aws s3

* Fix tests

* longer

* wip

* wip

* address comments

* revert storage

* revert changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

logging: Capture error messages for RuntimeError: Errors occurred during provision; check logs above.
2 participants