Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Install SkyPilot runtime in separate env #3575

Merged
merged 41 commits into from
May 23, 2024
Merged

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented May 21, 2024

Blocked by #3572
Moved #2801 here due to the merging issues and our latest support for explicitly python / ray path.

This PR enables:

  1. Users to install dependencies in the base environment without worrying about conflicting with SkyPilot runtime.
  2. Users to have an image with python 3.12 as the default python version (which is not supported by SkyPilot runtime).
  3. A more SkyPilot native and less verbose axolotl example with the support of the latest axolotl docker image.

Closes #2722
Mitigate #2673

Background

Installing every skypilot runtime in the base environment can cause issues when users also tries to install their dependencies in the base environment on the remote VM. It is very easy to cause the VM to become in an unexpected state. We now move all the skypilot runtime to a new conda environment for better isolation from the users' own Python environment.

Performance

New cluster

TL;DR: there is a negligible negative effect to the launching time for all the clouds with the separate env.

Results on AWS, GCP, Kubernetes

AWS

multitime -n 5 sky launch -y --cloud aws --cpus 2
master

            Mean        Std.Dev.    Min         Median      Max
real        152.313     7.229       142.269     152.518     164.088     
user        6.756       0.527       6.405       6.498       7.800       
sys         0.721       0.116       0.562       0.714       0.923  

This PR

            Mean        Std.Dev.    Min         Median      Max
real        154.497     2.993       151.134     154.496     157.933     
user        6.841       0.419       6.516       6.665       7.659       
sys         0.712       0.079       0.637       0.684       0.853
            Mean        Std.Dev.    Min         Median      Max
real        149.659     6.103       143.091     148.129     160.476     
user        6.454       0.064       6.374       6.439       6.559       
sys         0.695       0.029       0.645       0.699       0.727  

GCP

multitime -n 5 sky launch -y --cloud gcp --cpus 2
master

            Mean        Std.Dev.    Min         Median      Max
real        129.164     5.770       121.261     129.621     135.799     
user        17.263      0.141       17.135      17.191      17.499      
sys         1.764       0.075       1.675       1.730       1.863   

This PR

(21s slower, due to the installation of dependencies for ray and skypilot)

            Mean        Std.Dev.    Min         Median      Max
real        150.368     4.356       144.007     150.592     156.529     
user        17.784      0.586       17.283      17.671      18.895      
sys         1.849       0.128       1.702       1.834       2.064    

after adding a6f6996

            Mean        Std.Dev.    Min         Median      Max
real        124.762     4.736       120.339     122.328     133.610     
user        17.562      0.585       17.147      17.307      18.724      
sys         1.875       0.107       1.727       1.867       2.032  

Kubernetes

master:

            Mean        Std.Dev.    Min         Median      Max
real        54.512      1.991       51.994      54.134      56.943      
user        4.770       0.611       4.416       4.489       5.988       
sys         0.633       0.064       0.558       0.645       0.734 

This PR:

            Mean        Std.Dev.    Min         Median      Max
real        56.873      3.327       53.555      54.552      61.738      
user        4.729       0.469       4.490       4.494       5.668       
sys         0.525       0.074       0.465       0.485       0.665

Launch on existing cluster

sky launch -c existing-cluster --cpus 2 --cloud kubernetes -y
multitime -n 5 sky launch -c existing-cluster echo hi

Master:

            Mean        Std.Dev.    Min         Median      Max
real        49.564      1.980       47.926      48.969      53.362      
user        3.318       0.446       3.035       3.139       4.204       
sys         0.423       0.104       0.340       0.376       0.627   

This PR:

            Mean        Std.Dev.    Min         Median      Max
real        46.420      0.826       45.826      46.087      48.059      
user        3.096       0.046       3.043       3.080       3.169       
sys         0.409       0.021       0.395       0.399       0.451    

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky launch -c test-ax --cloud kubernetes axolotl.yaml --env HF_TOKEN
    • sky launch -c test-ax --cloud gcp -i 0 --down --use-spot axolotl.yaml --env HF_TOKEN
    • sky launch -c test-ax --cloud gcp -i 0 --down --use-spot axolotl.yaml --env HF_TOKEN --image-id docker:winglian/axolotl:main-latest (with python 3.12 in the docker image)
    • sky launch --image-id ami-0df2a11dd1fe1f8e3 --cpus 2 --cloud aws --region us-east-1 -y -i20 --down conda env list (AMI used in Custom images: challenges & problems to solve #2673)
  • All smoke tests: pytest tests/test_smoke.py
  • All smoke tests: pytest tests/test_smoke.py --aws
  • All smoke tests: pytest tests/test_smoke.py --kubernetes
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@dongreenberg
Copy link
Contributor

Cc @carolineechen

@Michaelvll Michaelvll marked this pull request as ready for review May 22, 2024 17:35
@Michaelvll Michaelvll requested a review from cblmemo May 23, 2024 06:37
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this @Michaelvll ! It looks great to me :)) Left some nits

llm/axolotl/axolotl-docker.yaml Outdated Show resolved Hide resolved
llm/axolotl/axolotl-spot.yaml Outdated Show resolved Hide resolved
sky/skylet/attempt_skylet.py Show resolved Hide resolved
sky/skylet/constants.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update Axolotl's GitHub readme with this, maybe after 0.6.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will update it in a separate PR.

@Michaelvll Michaelvll merged commit 97cba00 into master May 23, 2024
20 checks passed
@Michaelvll Michaelvll deleted the skypilot-runtime-env branch May 23, 2024 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[core] Ray version change breaks SkyPilot cluster
4 participants