New provisioner for RunPod #2829

Michaelvll · 2023-12-01T07:09:10Z

This PR is based on #2360 and #2681.

Background

This is an attempt to adopt the new provisioner for RunPod instead of using the ray up implementation. It changes the implementaton in #2360 to use our new provisioner API.

Main Reason

It is very hard to get it right for the ray autoscaler to work on the head node when the cluster is launched with ray up. When created with ray up, ray autoscaler will call runpod API to check the healthiness of the cluster on the head node, and due to the special IP setup in runpod, it is very hard to get it right for ray autoscaler to get the correct internal IP. That causes the ray status on the head node displays no healthy nodes, leading to the failure of sky status -r.

Takeaway

It is very easy to use the new provisioner API to implement a support for a new cloud, especially when the new cloud has a clean API. It is much easier than node_provider based implementation. We should recommand new cloud support to use the new provisioner API instead.

Tested (run the relevant ones):

cblmemo

Thanks for adding the runpod cloud provider! Just done a pass and left several nits. It mostly looks good to me 🫡

sky/clouds/runpod.py

sky/provision/runpod/utils.py

sky/templates/runpod-ray.yml.j2

tests/test_optimizer_random_dag.py

…-new-provisioner

cblmemo

LGTM on my side 🫡

sky/clouds/runpod.py

cblmemo · 2024-01-13T01:40:14Z

sky/provision/runpod/instance.py

+from sky import sky_logging
+from sky import status_lib
+from sky.provision import common
+from sky.provision.runpod import utils


I just started reading this file w/o looking at those imports and find it a little bit confusing for the utils module - not sure what it is. Just a nit and I'm happy with the current implementation too 🫡

sky/clouds/runpod.py

cblmemo · 2024-01-13T01:51:12Z

sky/clouds/service_catalog/runpod_catalog.py

Do we have a fetcher for runpod?

No. It is being added. I don't think it is necessary for merging this PR.

sky/provision/runpod/instance.py

sky/provision/runpod/utils.py

sky/templates/runpod-ray.yml.j2

* init * remove ray * update config * update * update * update * complete bootstrapping * add start instance * fix * fix * fix * update * wait stopping instances * support normal gcp tpus first * fix gcp * support get cluster info * fix * update * wait for instance starting * rename * hide gcp package import * fix * fix * update constants * fix comments * remove unused methods * fix comments * sync 'config' & 'constants' with upstream, Nov 16 * sync 'instace_utils' with the upstream, Nov 16 * fix typing * parallelize provisioning * Fix TPU node * Fix TPU NAME env for tpu node * implement bulk provision * refactor selflink * format * reduce the sleep time for autostop * provisioner version refactoring * refactor * Add logging * avoid saving the provisioner version * format * format * Fix scheduling field in config * format * fix public key content * Fix provisioner version for azure * Use ray port from head node for workers * format * fix ray_port * fix smoke tests * shorter sleep time * refactor status refresh version * Use new provisioner to launch runpod to avoid issue with ray autoscaler on head Co-authored-by: Justin Merrell <[email protected]> * Add wait for the instances to be ready * fix setup * Retry and give for getting internal IP * comment * Remove internal IP * use external IP TODO: use external ray port * fix ssh port * Unsupported feature * typo * fix ssh ports * rename var * format * Fix cloud unsupported resources * Runpod update name mapping (#2945) * Avoid using GpuInfo * fix all_regions * Fix runpod list accelerators * format * revert to GpuInfo * Fix get_feasible_launchable_resources * Add error * Fix optimizer random_dag for feature check * address comments * remove test code * format * Add type hints * format * format * fix keyerror * Address comments --------- Co-authored-by: Justin Merrell <[email protected]> Co-authored-by: Siyuan <[email protected]> Co-authored-by: Doyoung Kim <[email protected]>

* init * remove ray * update config * update * update * update * complete bootstrapping * add start instance * fix * fix * fix * update * wait stopping instances * support normal gcp tpus first * fix gcp * support get cluster info * fix * update * wait for instance starting * rename * hide gcp package import * fix * fix * update constants * fix comments * remove unused methods * fix comments * sync 'config' & 'constants' with upstream, Nov 16 * sync 'instace_utils' with the upstream, Nov 16 * fix typing * parallelize provisioning * Fix TPU node * Fix TPU NAME env for tpu node * implement bulk provision * refactor selflink * format * reduce the sleep time for autostop * provisioner version refactoring * refactor * Add logging * avoid saving the provisioner version * format * format * Fix scheduling field in config * format * fix public key content * Fix provisioner version for azure * Use ray port from head node for workers * format * fix ray_port * fix smoke tests * shorter sleep time * refactor status refresh version * Use new provisioner to launch runpod to avoid issue with ray autoscaler on head Co-authored-by: Justin Merrell <[email protected]> * Add wait for the instances to be ready * fix setup * Retry and give for getting internal IP * comment * Remove internal IP * use external IP TODO: use external ray port * fix ssh port * Unsupported feature * typo * fix ssh ports * rename var * format * Fix cloud unsupported resources * Runpod update name mapping (#2945) * Avoid using GpuInfo * fix all_regions * Fix runpod list accelerators * format * revert to GpuInfo * Fix get_feasible_launchable_resources * Add error * Fix optimizer random_dag for feature check * address comments * remove test code * format * Add type hints * format * format * fix keyerror * Address comments * Fix ports --------- Co-authored-by: Justin Merrell <[email protected]> Co-authored-by: Siyuan <[email protected]> Co-authored-by: Doyoung Kim <[email protected]>

* init * remove ray * update config * update * update * update * complete bootstrapping * add start instance * fix * fix * fix * update * wait stopping instances * support normal gcp tpus first * fix gcp * support get cluster info * fix * update * wait for instance starting * rename * hide gcp package import * fix * fix * update constants * fix comments * remove unused methods * fix comments * sync 'config' & 'constants' with upstream, Nov 16 * sync 'instace_utils' with the upstream, Nov 16 * fix typing * parallelize provisioning * Fix TPU node * Fix TPU NAME env for tpu node * implement bulk provision * refactor selflink * format * reduce the sleep time for autostop * provisioner version refactoring * refactor * Add logging * avoid saving the provisioner version * format * format * Fix scheduling field in config * format * fix public key content * Fix provisioner version for azure * Use ray port from head node for workers * format * fix ray_port * fix smoke tests * shorter sleep time * refactor status refresh version * Use new provisioner to launch runpod to avoid issue with ray autoscaler on head Co-authored-by: Justin Merrell <[email protected]> * Add wait for the instances to be ready * fix setup * Retry and give for getting internal IP * comment * Remove internal IP * use external IP TODO: use external ray port * fix ssh port * Unsupported feature * typo * fix ssh ports * rename var * format * Fix cloud unsupported resources * Runpod update name mapping (#2945) * Avoid using GpuInfo * fix all_regions * Fix runpod list accelerators * format * revert to GpuInfo * Fix get_feasible_launchable_resources * Add error * Fix optimizer random_dag for feature check * address comments * remove test code * format * Add type hints * format * format * fix keyerror * Address comments --------- Co-authored-by: Justin Merrell <[email protected]> Co-authored-by: Siyuan <[email protected]> Co-authored-by: Doyoung Kim <[email protected]>

* New provisioner for RunPod (#2829) * init * remove ray * update config * update * update * update * complete bootstrapping * add start instance * fix * fix * fix * update * wait stopping instances * support normal gcp tpus first * fix gcp * support get cluster info * fix * update * wait for instance starting * rename * hide gcp package import * fix * fix * update constants * fix comments * remove unused methods * fix comments * sync 'config' & 'constants' with upstream, Nov 16 * sync 'instace_utils' with the upstream, Nov 16 * fix typing * parallelize provisioning * Fix TPU node * Fix TPU NAME env for tpu node * implement bulk provision * refactor selflink * format * reduce the sleep time for autostop * provisioner version refactoring * refactor * Add logging * avoid saving the provisioner version * format * format * Fix scheduling field in config * format * fix public key content * Fix provisioner version for azure * Use ray port from head node for workers * format * fix ray_port * fix smoke tests * shorter sleep time * refactor status refresh version * Use new provisioner to launch runpod to avoid issue with ray autoscaler on head Co-authored-by: Justin Merrell <[email protected]> * Add wait for the instances to be ready * fix setup * Retry and give for getting internal IP * comment * Remove internal IP * use external IP TODO: use external ray port * fix ssh port * Unsupported feature * typo * fix ssh ports * rename var * format * Fix cloud unsupported resources * Runpod update name mapping (#2945) * Avoid using GpuInfo * fix all_regions * Fix runpod list accelerators * format * revert to GpuInfo * Fix get_feasible_launchable_resources * Add error * Fix optimizer random_dag for feature check * address comments * remove test code * format * Add type hints * format * format * fix keyerror * Address comments --------- Co-authored-by: Justin Merrell <[email protected]> Co-authored-by: Siyuan <[email protected]> Co-authored-by: Doyoung Kim <[email protected]> * Add docs for runpod installation * typo * typo * update doc --------- Co-authored-by: Justin Merrell <[email protected]> Co-authored-by: Siyuan <[email protected]> Co-authored-by: Doyoung Kim <[email protected]>

suquark added 30 commits November 16, 2023 11:48

init

fda9c83

remove ray

a972b83

update config

95f470e

update

8dd39e7

update

45ef17a

update

5d07647

complete bootstrapping

b45c09f

add start instance

ee7a924

fix

70ede43

fix

f8fd06d

fix

83880d2

update

75b23ac

wait stopping instances

26149b7

support normal gcp tpus first

03ff947

fix gcp

aea9d1a

support get cluster info

0135eea

fix

a5b7537

update

2bb7438

wait for instance starting

e525f06

rename

0ae66fc

hide gcp package import

084170b

fix

6940640

fix

45f00c1

update constants

9b5428f

fix comments

c4bed46

remove unused methods

4f7fb16

fix comments

0647233

sync 'config' & 'constants' with upstream, Nov 16

a4dbcb0

sync 'instace_utils' with the upstream, Nov 16

43bf2e3

fix typing

0a867ee

Michaelvll added 4 commits January 8, 2024 01:21

revert to GpuInfo

9630832

Fix get_feasible_launchable_resources

97165a4

Add error

f527545

Fix optimizer random_dag for feature check

680beca

Michaelvll requested review from landscapepainter, suquark and cblmemo January 8, 2024 23:45

cblmemo reviewed Jan 11, 2024

View reviewed changes

This was referenced Jan 11, 2024

[Tests] Fix optimizer test by leaving out unsupported clouds #2976

Merged

Add ovhcloud #2740

Closed

Michaelvll added 6 commits January 12, 2024 23:33

Merge branch 'master' of github.com:skypilot-org/skypilot into runpod…

497039c

…-new-provisioner

address comments

8193931

remove test code

e5631f3

format

7399d53

Add type hints

9342d20

format

1955380

Michaelvll requested a review from cblmemo January 13, 2024 01:24

Michaelvll added 2 commits January 13, 2024 01:26

format

8b48f32

fix keyerror

07498d9

cblmemo approved these changes Jan 13, 2024

View reviewed changes

Address comments

b11446e

Michaelvll merged commit 2ac6aa1 into master Jan 13, 2024
19 checks passed

Michaelvll deleted the runpod-new-provisioner branch January 13, 2024 04:36

This was referenced Jan 13, 2024

pushing runpod changes #2360

Closed

Revert "New provisioner for RunPod" #2981

Closed

Michaelvll mentioned this pull request Mar 5, 2024

[Cloud] Support vast.ai #2930

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New provisioner for RunPod #2829

New provisioner for RunPod #2829

Michaelvll commented Dec 1, 2023 •

edited

Loading

cblmemo left a comment

cblmemo left a comment

cblmemo Jan 13, 2024

cblmemo Jan 13, 2024

Michaelvll Jan 13, 2024

New provisioner for RunPod #2829

New provisioner for RunPod #2829

Conversation

Michaelvll commented Dec 1, 2023 • edited Loading

Background

Main Reason

Takeaway

cblmemo left a comment

Choose a reason for hiding this comment

cblmemo left a comment

Choose a reason for hiding this comment

cblmemo Jan 13, 2024

Choose a reason for hiding this comment

cblmemo Jan 13, 2024

Choose a reason for hiding this comment

Michaelvll Jan 13, 2024

Choose a reason for hiding this comment

Michaelvll commented Dec 1, 2023 •

edited

Loading