Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New provisioner for RunPod #2829

Merged
merged 88 commits into from
Jan 13, 2024
Merged

New provisioner for RunPod #2829

merged 88 commits into from
Jan 13, 2024

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Dec 1, 2023

This PR is based on #2360 and #2681.

Background

This is an attempt to adopt the new provisioner for RunPod instead of using the ray up implementation. It changes the implementaton in #2360 to use our new provisioner API.

Main Reason

It is very hard to get it right for the ray autoscaler to work on the head node when the cluster is launched with ray up. When created with ray up, ray autoscaler will call runpod API to check the healthiness of the cluster on the head node, and due to the special IP setup in runpod, it is very hard to get it right for ray autoscaler to get the correct internal IP. That causes the ray status on the head node displays no healthy nodes, leading to the failure of sky status -r.

Takeaway

It is very easy to use the new provisioner API to implement a support for a new cloud, especially when the new cloud has a clean API. It is much easier than node_provider based implementation. We should recommand new cloud support to use the new provisioner API instead.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky launch -c test-runpod --cloud runpod --gpus RTX4090 nvidia-smi
    • sky exec test-runpod nvidia-smi
    • sky launch -c test-runpod nvidia-smi
    • sky autostop -i 0 --down test-runpod
    • sky status -r test-runpod
    • sky launch -c test-runpod-mn --cloud runpod --num-nodes 2 --gpus RTX4090 nvidia-smi (not supported yet as the interconnection among nodes are non-trivial)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the runpod cloud provider! Just done a pass and left several nits. It mostly looks good to me 🫡

sky/clouds/runpod.py Outdated Show resolved Hide resolved
sky/clouds/runpod.py Outdated Show resolved Hide resolved
sky/clouds/runpod.py Outdated Show resolved Hide resolved
sky/clouds/runpod.py Outdated Show resolved Hide resolved
sky/clouds/runpod.py Outdated Show resolved Hide resolved
sky/provision/runpod/utils.py Outdated Show resolved Hide resolved
sky/provision/runpod/utils.py Outdated Show resolved Hide resolved
sky/templates/runpod-ray.yml.j2 Outdated Show resolved Hide resolved
sky/templates/runpod-ray.yml.j2 Show resolved Hide resolved
tests/test_optimizer_random_dag.py Show resolved Hide resolved
@Michaelvll Michaelvll requested a review from cblmemo January 13, 2024 01:24
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on my side 🫡

sky/clouds/runpod.py Outdated Show resolved Hide resolved
from sky import sky_logging
from sky import status_lib
from sky.provision import common
from sky.provision.runpod import utils
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just started reading this file w/o looking at those imports and find it a little bit confusing for the utils module - not sure what it is. Just a nit and I'm happy with the current implementation too 🫡

sky/clouds/runpod.py Outdated Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a fetcher for runpod?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. It is being added. I don't think it is necessary for merging this PR.

sky/provision/runpod/instance.py Outdated Show resolved Hide resolved
sky/provision/runpod/utils.py Outdated Show resolved Hide resolved
sky/provision/runpod/utils.py Outdated Show resolved Hide resolved
sky/templates/runpod-ray.yml.j2 Outdated Show resolved Hide resolved
@Michaelvll Michaelvll merged commit 2ac6aa1 into master Jan 13, 2024
19 checks passed
@Michaelvll Michaelvll deleted the runpod-new-provisioner branch January 13, 2024 04:36
Michaelvll added a commit that referenced this pull request Jan 13, 2024
* init

* remove ray

* update config

* update

* update

* update

* complete bootstrapping

* add start instance

* fix

* fix

* fix

* update

* wait stopping instances

* support normal gcp tpus first

* fix gcp

* support get cluster info

* fix

* update

* wait for instance starting

* rename

* hide gcp package import

* fix

* fix

* update constants

* fix comments

* remove unused methods

* fix comments

* sync 'config' & 'constants' with upstream, Nov 16

* sync 'instace_utils' with the upstream, Nov 16

* fix typing

* parallelize provisioning

* Fix TPU node

* Fix TPU NAME env for tpu node

* implement bulk provision

* refactor selflink

* format

* reduce the sleep time for autostop

* provisioner version refactoring

* refactor

* Add logging

* avoid saving the provisioner version

* format

* format

* Fix scheduling field in config

* format

* fix public key content

* Fix provisioner version for azure

* Use ray port from head node for workers

* format

* fix ray_port

* fix smoke tests

* shorter sleep time

* refactor status refresh version

* Use new provisioner to launch runpod to avoid issue with ray autoscaler on head
Co-authored-by: Justin Merrell <[email protected]>

* Add wait for the instances to be ready

* fix setup

* Retry and give for getting internal IP

* comment

* Remove internal IP

* use external IP
TODO: use external ray port

* fix ssh port

* Unsupported feature

* typo

* fix ssh ports

* rename var

* format

* Fix cloud unsupported resources

* Runpod update name mapping (#2945)

* Avoid using GpuInfo

* fix all_regions

* Fix runpod list accelerators

* format

* revert to GpuInfo

* Fix get_feasible_launchable_resources

* Add error

* Fix optimizer random_dag for feature check

* address comments

* remove test code

* format

* Add type hints

* format

* format

* fix keyerror

* Address comments

---------

Co-authored-by: Justin Merrell <[email protected]>
Co-authored-by: Siyuan <[email protected]>
Co-authored-by: Doyoung Kim <[email protected]>
Michaelvll added a commit that referenced this pull request Jan 13, 2024
* init

* remove ray

* update config

* update

* update

* update

* complete bootstrapping

* add start instance

* fix

* fix

* fix

* update

* wait stopping instances

* support normal gcp tpus first

* fix gcp

* support get cluster info

* fix

* update

* wait for instance starting

* rename

* hide gcp package import

* fix

* fix

* update constants

* fix comments

* remove unused methods

* fix comments

* sync 'config' & 'constants' with upstream, Nov 16

* sync 'instace_utils' with the upstream, Nov 16

* fix typing

* parallelize provisioning

* Fix TPU node

* Fix TPU NAME env for tpu node

* implement bulk provision

* refactor selflink

* format

* reduce the sleep time for autostop

* provisioner version refactoring

* refactor

* Add logging

* avoid saving the provisioner version

* format

* format

* Fix scheduling field in config

* format

* fix public key content

* Fix provisioner version for azure

* Use ray port from head node for workers

* format

* fix ray_port

* fix smoke tests

* shorter sleep time

* refactor status refresh version

* Use new provisioner to launch runpod to avoid issue with ray autoscaler on head
Co-authored-by: Justin Merrell <[email protected]>

* Add wait for the instances to be ready

* fix setup

* Retry and give for getting internal IP

* comment

* Remove internal IP

* use external IP
TODO: use external ray port

* fix ssh port

* Unsupported feature

* typo

* fix ssh ports

* rename var

* format

* Fix cloud unsupported resources

* Runpod update name mapping (#2945)

* Avoid using GpuInfo

* fix all_regions

* Fix runpod list accelerators

* format

* revert to GpuInfo

* Fix get_feasible_launchable_resources

* Add error

* Fix optimizer random_dag for feature check

* address comments

* remove test code

* format

* Add type hints

* format

* format

* fix keyerror

* Address comments

* Fix ports

---------

Co-authored-by: Justin Merrell <[email protected]>
Co-authored-by: Siyuan <[email protected]>
Co-authored-by: Doyoung Kim <[email protected]>
Michaelvll added a commit that referenced this pull request Jan 13, 2024
* init

* remove ray

* update config

* update

* update

* update

* complete bootstrapping

* add start instance

* fix

* fix

* fix

* update

* wait stopping instances

* support normal gcp tpus first

* fix gcp

* support get cluster info

* fix

* update

* wait for instance starting

* rename

* hide gcp package import

* fix

* fix

* update constants

* fix comments

* remove unused methods

* fix comments

* sync 'config' & 'constants' with upstream, Nov 16

* sync 'instace_utils' with the upstream, Nov 16

* fix typing

* parallelize provisioning

* Fix TPU node

* Fix TPU NAME env for tpu node

* implement bulk provision

* refactor selflink

* format

* reduce the sleep time for autostop

* provisioner version refactoring

* refactor

* Add logging

* avoid saving the provisioner version

* format

* format

* Fix scheduling field in config

* format

* fix public key content

* Fix provisioner version for azure

* Use ray port from head node for workers

* format

* fix ray_port

* fix smoke tests

* shorter sleep time

* refactor status refresh version

* Use new provisioner to launch runpod to avoid issue with ray autoscaler on head
Co-authored-by: Justin Merrell <[email protected]>

* Add wait for the instances to be ready

* fix setup

* Retry and give for getting internal IP

* comment

* Remove internal IP

* use external IP
TODO: use external ray port

* fix ssh port

* Unsupported feature

* typo

* fix ssh ports

* rename var

* format

* Fix cloud unsupported resources

* Runpod update name mapping (#2945)

* Avoid using GpuInfo

* fix all_regions

* Fix runpod list accelerators

* format

* revert to GpuInfo

* Fix get_feasible_launchable_resources

* Add error

* Fix optimizer random_dag for feature check

* address comments

* remove test code

* format

* Add type hints

* format

* format

* fix keyerror

* Address comments

---------

Co-authored-by: Justin Merrell <[email protected]>
Co-authored-by: Siyuan <[email protected]>
Co-authored-by: Doyoung Kim <[email protected]>
Michaelvll added a commit that referenced this pull request Jan 14, 2024
* New provisioner for RunPod (#2829)

* init

* remove ray

* update config

* update

* update

* update

* complete bootstrapping

* add start instance

* fix

* fix

* fix

* update

* wait stopping instances

* support normal gcp tpus first

* fix gcp

* support get cluster info

* fix

* update

* wait for instance starting

* rename

* hide gcp package import

* fix

* fix

* update constants

* fix comments

* remove unused methods

* fix comments

* sync 'config' & 'constants' with upstream, Nov 16

* sync 'instace_utils' with the upstream, Nov 16

* fix typing

* parallelize provisioning

* Fix TPU node

* Fix TPU NAME env for tpu node

* implement bulk provision

* refactor selflink

* format

* reduce the sleep time for autostop

* provisioner version refactoring

* refactor

* Add logging

* avoid saving the provisioner version

* format

* format

* Fix scheduling field in config

* format

* fix public key content

* Fix provisioner version for azure

* Use ray port from head node for workers

* format

* fix ray_port

* fix smoke tests

* shorter sleep time

* refactor status refresh version

* Use new provisioner to launch runpod to avoid issue with ray autoscaler on head
Co-authored-by: Justin Merrell <[email protected]>

* Add wait for the instances to be ready

* fix setup

* Retry and give for getting internal IP

* comment

* Remove internal IP

* use external IP
TODO: use external ray port

* fix ssh port

* Unsupported feature

* typo

* fix ssh ports

* rename var

* format

* Fix cloud unsupported resources

* Runpod update name mapping (#2945)

* Avoid using GpuInfo

* fix all_regions

* Fix runpod list accelerators

* format

* revert to GpuInfo

* Fix get_feasible_launchable_resources

* Add error

* Fix optimizer random_dag for feature check

* address comments

* remove test code

* format

* Add type hints

* format

* format

* fix keyerror

* Address comments

---------

Co-authored-by: Justin Merrell <[email protected]>
Co-authored-by: Siyuan <[email protected]>
Co-authored-by: Doyoung Kim <[email protected]>

* Add docs for runpod installation

* typo

* typo

* update doc

---------

Co-authored-by: Justin Merrell <[email protected]>
Co-authored-by: Siyuan <[email protected]>
Co-authored-by: Doyoung Kim <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants