Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Azure] Avoid azure reconfig everytime, speed up launch by up to 5.8x #3697

Merged
merged 10 commits into from
Jun 29, 2024

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Jun 27, 2024

Mitigates #3695.

This PR avoids an additional configuration which causes a significant slow down.

Single Node (1.6x faster)

multitme -n 5 sky launch --cloud azure -y
            Mean        Std.Dev.    Min         Median      Max
real        214.837     11.941      203.255     211.937     237.772
user        9.281       0.455       8.990       9.093       10.186
sys         2.748       0.080       2.641       2.761       2.873

Single Node on existing cluster (5.8x faster)

multitime -n 5 sky launch -c test-azure-skip-deploy --cloud azure --cpus 2 -y
            Mean        Std.Dev.    Min         Median      Max
real        28.327      0.950       27.159      28.309      29.676      
user        10.261      0.166       10.113      10.187      10.582      
sys         1.497       0.066       1.442       1.472       1.623 

Multi Node (1.24x faster)

multitime -n 5 sky launch --cloud azure -y --cpus 2 --num-nodes 2
            Mean        Std.Dev.    Min         Median      Max
real        541.261     284.335     385.725     402.150     1109.727
user        21.173      2.179       19.834      19.920      25.469
sys         5.746       0.293       5.519       5.640       6.322

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky autostop -i 0 --down test-azure
    • Wait until the cluster terminated, and sky launch -c test-azure echo hi
    • sky launch -c test-azure --cloud azure --cpus 2 echo hi; manually terminate the cluster; sky down test-azure
    • sky launch -c test-azure --cloud azure --cpus 2 echo hi; manually delete resource group on the portal; sky stop test-azure (our error handling works)
  • All smoke tests: pytest tests/test_smoke.py --azure (except for tests related to GPUs due to the quota limit and the issue with A10 [Core] Fix A10 GPU on Azure #3707)
FAILED tests/test_smoke.py::test_managed_jobs - Exception: test failed: less /var/tmp/managed-jobs-i8368hll.log
FAILED tests/test_smoke.py::test_skyserve_large_readiness_timeout - Exception: test failed: less /var/tmp/test-skyserve-large-readiness-timeout-xcoph374.log
FAILED tests/test_smoke.py::test_huggingface - Exception: test failed: less /var/tmp/huggingface_glue_imdb_app-fool8shv.log
FAILED tests/test_smoke.py::test_multi_echo - Exception: test failed: less /var/tmp/multi_echo-sz1vyxl_.log
FAILED tests/test_smoke.py::test_job_queue - Exception: test failed: less /var/tmp/job_queue-51ns08vv.log
FAILED tests/test_smoke.py::test_job_queue_with_docker[docker:continuumio/miniconda3:latest] - Exception: test failed: less /var/tmp/job_queue_with_docker-4m7keo3r.log
FAILED tests/test_smoke.py::test_cancel_pytorch - Exception: test failed: less /var/tmp/cancel-pytorch-w43a0mbm.log
FAILED tests/test_smoke.py::test_job_queue_with_docker[docker:nvidia/cuda:11.8.0-devel-ubuntu18.04] - Exception: test failed: less /var/tmp/job_queue_with_docker-bxu6ez58.log
FAILED tests/test_smoke.py::test_job_queue_with_docker[docker:ubuntu:18.04] - Exception: test failed: less /var/tmp/job_queue_with_docker-q6o1kwco.log
FAILED tests/test_smoke.py::test_job_queue_with_docker[docker:winglian/axolotl:main-latest] - Exception: test failed: less /var/tmp/job_queue_with_docker-e0xabtb2.log
FAILED tests/test_smoke.py::test_sky_bench - Exception: test failed: less /var/tmp/sky-bench-exgbfdgp.log
FAILED tests/test_smoke.py::test_job_queue_multinode - Exception: test failed: less /var/tmp/job_queue_multinode-cd29qwwe.log
FAILED tests/test_smoke.py::test_job_queue_with_docker[docker:continuumio/miniconda3:24.1.2-0] - Exception: test failed: less /var/tmp/job_queue_with_docker-n1agtcv7.log
FAILED tests/test_smoke.py::test_skyserve_readiness_timeout_fail - Exception: test failed: less /var/tmp/test-skyserve-readiness-timeout-fail-y1wcupuq.log
FAILED tests/test_smoke.py::test_skyserve_base_ondemand_fallback - Exception: test failed: less /var/tmp/test-skyserve-base-ondemand-fallback-h887u0t1.log
FAILED tests/test_smoke.py::test_skyserve_update - Exception: test failed: less /var/tmp/test-skyserve-update-wigafrh6.log
FAILED tests/test_smoke.py::test_skyserve_llm - Exception: test failed: less /var/tmp/test-skyserve-llm-smothj9l.log
FAILED tests/test_smoke.py::test_skyserve_new_autoscaler_update[blue_green] - Exception: test failed: less /var/tmp/test-skyserve-new-autoscaler-update-5pof25gf.log
FAILED tests/test_smoke.py::test_skyserve_new_autoscaler_update[rolling] - Exception: test failed: less /var/tmp/test-skyserve-new-autoscaler-update-kg7w6usv.log
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@Michaelvll Michaelvll changed the title [Azure] Avoid azure reconfig everytime which reduce launch by 1.6x [Azure] Avoid azure reconfig everytime which speed up launch by 1.6x Jun 27, 2024
@Michaelvll Michaelvll changed the title [Azure] Avoid azure reconfig everytime which speed up launch by 1.6x [Azure] Avoid azure reconfig everytime, speed up launch by 1.6x Jun 27, 2024
@Michaelvll Michaelvll marked this pull request as ready for review June 27, 2024 08:37
@concretevitamin concretevitamin requested a review from cblmemo June 27, 2024 15:31
@Michaelvll Michaelvll added the P0 label Jun 28, 2024
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this @Michaelvll ! LGTM.

@Michaelvll Michaelvll changed the title [Azure] Avoid azure reconfig everytime, speed up launch by 1.6x [Azure] Avoid azure reconfig everytime, speed up launch by up to 6x Jun 28, 2024
@Michaelvll Michaelvll changed the title [Azure] Avoid azure reconfig everytime, speed up launch by up to 6x [Azure] Avoid azure reconfig everytime, speed up launch by up to 5.8x Jun 28, 2024
@Michaelvll Michaelvll merged commit 4821f70 into master Jun 29, 2024
20 checks passed
@Michaelvll Michaelvll deleted the azure-no-additional-config branch June 29, 2024 01:48
Michaelvll added a commit that referenced this pull request Aug 23, 2024
…#3697)

* Avoid azure reconfig everytime

* Add debug message

* format

* Fix error handling

* format

* skip deployment recreation when deployment exist

* Add retry for subscription ID

* fix logging

* format

* comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants