Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initializing Azure instances is very slow #328

Closed
suquark opened this issue Feb 15, 2022 · 13 comments
Closed

Initializing Azure instances is very slow #328

suquark opened this issue Feb 15, 2022 · 13 comments
Labels
enhancement New feature or request

Comments

@suquark
Copy link
Collaborator

suquark commented Feb 15, 2022

It takes me 14min to spin up a cluster with 2 cpu nodes.

The most time consuming part is installing pip packages, especially azure-cli. This may be addressed by releasing images with azure-cli pre-installed.

@infwinston
Copy link
Member

+1. I had this slow initialization issue too. I might miss something but why is azure-cli needed to be install on remote VM?

@suquark
Copy link
Collaborator Author

suquark commented Feb 15, 2022

Because ray-autoscaler is using it. For GCP and AWS, their CLIs are already installed.

@infwinston
Copy link
Member

oh I see. It's used on the head node to further provision resources for worker nodes? Is it correct?

@suquark
Copy link
Collaborator Author

suquark commented Feb 16, 2022

hmmm, it is mostly used by ray autoscaler for monitoring

@Michaelvll Michaelvll added the enhancement New feature or request label Mar 7, 2022
@concretevitamin
Copy link
Member

concretevitamin commented Aug 25, 2022

I tried revisiting this issue briefly. For a cpunode:

  • Launching using Azure web console: about 1.5 min from "create" button to being able to SSH in. Same VM image, region. Only diff being using an existing resource group.
  • Launching using sky launch (which means ray autoscaler, which means Azure python SDK): super slow, ~4-5min from create to SSH; total ~9 min (after installing runtime). Every step is slower than console.

I hacked the template by using the same resource group per region -- no speedup.

So the root cause seems to be Azure's python SDK being much slower than their console. We can take a deeper look.

Typical output

  • ~4-5min from create to SSH
  • ~4min to install runtime
I 08-25 08:58:46 cloud_vm_ray_backend.py:892] To view detailed progress: tail -n100 -f /Users/zongheng/sky_logs/sky-2022-08-25-08-58-44-664631/provision.log
I 08-25 08:58:46 cloud_vm_ray_backend.py:1096] Launching on Azure eastus ()
I 08-25 09:01:53 cloud_vm_ray_backend.py:1131] Retrying head node provisioning due to head fetching timeout.
I 08-25 09:03:40 log_utils.py:45] Head node is up.
I 08-25 09:07:41 cloud_vm_ray_backend.py:984] Successfully provisioned or found existing VM.

@romilbhardwaj
Copy link
Collaborator

So the root cause seems to be Azure's python SDK being much slower than their console. We can take a deeper look.

Might be good to verify this hypothesis by using their pure python SDK (without ray autoscaler) to provision a VM and measure time. Here's an example.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label May 12, 2023
@infwinston
Copy link
Member

We should also keep this one open unless we are satisfied with the speed with Azure.

@infwinston infwinston removed the Stale label May 13, 2023
@github-actions
Copy link
Contributor

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label Sep 10, 2023
@github-actions
Copy link
Contributor

This issue was closed because it has been stalled for 10 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 21, 2023
@WesleyYue
Copy link

WesleyYue commented Jun 5, 2024

Can this be re-opened? Still very slow today. For reference, a simple vllm setup takes 18 mins.

@romilbhardwaj romilbhardwaj reopened this Jun 5, 2024
@github-actions github-actions bot removed the Stale label Jun 6, 2024
@WesleyYue
Copy link

Related #3695

@Michaelvll
Copy link
Collaborator

This issue should be mitigated by #3704. Closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants