-
Notifications
You must be signed in to change notification settings - Fork 547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New provisioner for RunPod #2829
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the runpod cloud provider! Just done a pass and left several nits. It mostly looks good to me 🫡
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM on my side 🫡
from sky import sky_logging | ||
from sky import status_lib | ||
from sky.provision import common | ||
from sky.provision.runpod import utils |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just started reading this file w/o looking at those imports and find it a little bit confusing for the utils
module - not sure what it is. Just a nit and I'm happy with the current implementation too 🫡
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a fetcher for runpod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. It is being added. I don't think it is necessary for merging this PR.
* init * remove ray * update config * update * update * update * complete bootstrapping * add start instance * fix * fix * fix * update * wait stopping instances * support normal gcp tpus first * fix gcp * support get cluster info * fix * update * wait for instance starting * rename * hide gcp package import * fix * fix * update constants * fix comments * remove unused methods * fix comments * sync 'config' & 'constants' with upstream, Nov 16 * sync 'instace_utils' with the upstream, Nov 16 * fix typing * parallelize provisioning * Fix TPU node * Fix TPU NAME env for tpu node * implement bulk provision * refactor selflink * format * reduce the sleep time for autostop * provisioner version refactoring * refactor * Add logging * avoid saving the provisioner version * format * format * Fix scheduling field in config * format * fix public key content * Fix provisioner version for azure * Use ray port from head node for workers * format * fix ray_port * fix smoke tests * shorter sleep time * refactor status refresh version * Use new provisioner to launch runpod to avoid issue with ray autoscaler on head Co-authored-by: Justin Merrell <[email protected]> * Add wait for the instances to be ready * fix setup * Retry and give for getting internal IP * comment * Remove internal IP * use external IP TODO: use external ray port * fix ssh port * Unsupported feature * typo * fix ssh ports * rename var * format * Fix cloud unsupported resources * Runpod update name mapping (#2945) * Avoid using GpuInfo * fix all_regions * Fix runpod list accelerators * format * revert to GpuInfo * Fix get_feasible_launchable_resources * Add error * Fix optimizer random_dag for feature check * address comments * remove test code * format * Add type hints * format * format * fix keyerror * Address comments --------- Co-authored-by: Justin Merrell <[email protected]> Co-authored-by: Siyuan <[email protected]> Co-authored-by: Doyoung Kim <[email protected]>
* init * remove ray * update config * update * update * update * complete bootstrapping * add start instance * fix * fix * fix * update * wait stopping instances * support normal gcp tpus first * fix gcp * support get cluster info * fix * update * wait for instance starting * rename * hide gcp package import * fix * fix * update constants * fix comments * remove unused methods * fix comments * sync 'config' & 'constants' with upstream, Nov 16 * sync 'instace_utils' with the upstream, Nov 16 * fix typing * parallelize provisioning * Fix TPU node * Fix TPU NAME env for tpu node * implement bulk provision * refactor selflink * format * reduce the sleep time for autostop * provisioner version refactoring * refactor * Add logging * avoid saving the provisioner version * format * format * Fix scheduling field in config * format * fix public key content * Fix provisioner version for azure * Use ray port from head node for workers * format * fix ray_port * fix smoke tests * shorter sleep time * refactor status refresh version * Use new provisioner to launch runpod to avoid issue with ray autoscaler on head Co-authored-by: Justin Merrell <[email protected]> * Add wait for the instances to be ready * fix setup * Retry and give for getting internal IP * comment * Remove internal IP * use external IP TODO: use external ray port * fix ssh port * Unsupported feature * typo * fix ssh ports * rename var * format * Fix cloud unsupported resources * Runpod update name mapping (#2945) * Avoid using GpuInfo * fix all_regions * Fix runpod list accelerators * format * revert to GpuInfo * Fix get_feasible_launchable_resources * Add error * Fix optimizer random_dag for feature check * address comments * remove test code * format * Add type hints * format * format * fix keyerror * Address comments * Fix ports --------- Co-authored-by: Justin Merrell <[email protected]> Co-authored-by: Siyuan <[email protected]> Co-authored-by: Doyoung Kim <[email protected]>
* init * remove ray * update config * update * update * update * complete bootstrapping * add start instance * fix * fix * fix * update * wait stopping instances * support normal gcp tpus first * fix gcp * support get cluster info * fix * update * wait for instance starting * rename * hide gcp package import * fix * fix * update constants * fix comments * remove unused methods * fix comments * sync 'config' & 'constants' with upstream, Nov 16 * sync 'instace_utils' with the upstream, Nov 16 * fix typing * parallelize provisioning * Fix TPU node * Fix TPU NAME env for tpu node * implement bulk provision * refactor selflink * format * reduce the sleep time for autostop * provisioner version refactoring * refactor * Add logging * avoid saving the provisioner version * format * format * Fix scheduling field in config * format * fix public key content * Fix provisioner version for azure * Use ray port from head node for workers * format * fix ray_port * fix smoke tests * shorter sleep time * refactor status refresh version * Use new provisioner to launch runpod to avoid issue with ray autoscaler on head Co-authored-by: Justin Merrell <[email protected]> * Add wait for the instances to be ready * fix setup * Retry and give for getting internal IP * comment * Remove internal IP * use external IP TODO: use external ray port * fix ssh port * Unsupported feature * typo * fix ssh ports * rename var * format * Fix cloud unsupported resources * Runpod update name mapping (#2945) * Avoid using GpuInfo * fix all_regions * Fix runpod list accelerators * format * revert to GpuInfo * Fix get_feasible_launchable_resources * Add error * Fix optimizer random_dag for feature check * address comments * remove test code * format * Add type hints * format * format * fix keyerror * Address comments --------- Co-authored-by: Justin Merrell <[email protected]> Co-authored-by: Siyuan <[email protected]> Co-authored-by: Doyoung Kim <[email protected]>
* New provisioner for RunPod (#2829) * init * remove ray * update config * update * update * update * complete bootstrapping * add start instance * fix * fix * fix * update * wait stopping instances * support normal gcp tpus first * fix gcp * support get cluster info * fix * update * wait for instance starting * rename * hide gcp package import * fix * fix * update constants * fix comments * remove unused methods * fix comments * sync 'config' & 'constants' with upstream, Nov 16 * sync 'instace_utils' with the upstream, Nov 16 * fix typing * parallelize provisioning * Fix TPU node * Fix TPU NAME env for tpu node * implement bulk provision * refactor selflink * format * reduce the sleep time for autostop * provisioner version refactoring * refactor * Add logging * avoid saving the provisioner version * format * format * Fix scheduling field in config * format * fix public key content * Fix provisioner version for azure * Use ray port from head node for workers * format * fix ray_port * fix smoke tests * shorter sleep time * refactor status refresh version * Use new provisioner to launch runpod to avoid issue with ray autoscaler on head Co-authored-by: Justin Merrell <[email protected]> * Add wait for the instances to be ready * fix setup * Retry and give for getting internal IP * comment * Remove internal IP * use external IP TODO: use external ray port * fix ssh port * Unsupported feature * typo * fix ssh ports * rename var * format * Fix cloud unsupported resources * Runpod update name mapping (#2945) * Avoid using GpuInfo * fix all_regions * Fix runpod list accelerators * format * revert to GpuInfo * Fix get_feasible_launchable_resources * Add error * Fix optimizer random_dag for feature check * address comments * remove test code * format * Add type hints * format * format * fix keyerror * Address comments --------- Co-authored-by: Justin Merrell <[email protected]> Co-authored-by: Siyuan <[email protected]> Co-authored-by: Doyoung Kim <[email protected]> * Add docs for runpod installation * typo * typo * update doc --------- Co-authored-by: Justin Merrell <[email protected]> Co-authored-by: Siyuan <[email protected]> Co-authored-by: Doyoung Kim <[email protected]>
This PR is based on #2360 and #2681.
Background
This is an attempt to adopt the new provisioner for RunPod instead of using the
ray up
implementation. It changes the implementaton in #2360 to use our new provisioner API.Main Reason
It is very hard to get it right for the ray autoscaler to work on the head node when the cluster is launched with
ray up
. When created withray up
, ray autoscaler will callrunpod
API to check the healthiness of the cluster on the head node, and due to the special IP setup in runpod, it is very hard to get it right for ray autoscaler to get the correct internal IP. That causes theray status
on the head node displays no healthy nodes, leading to the failure ofsky status -r
.Takeaway
It is very easy to use the new provisioner API to implement a support for a new cloud, especially when the new cloud has a clean API. It is much easier than
node_provider
based implementation. We should recommand new cloud support to use the new provisioner API instead.Tested (run the relevant ones):
bash format.sh
sky launch -c test-runpod --cloud runpod --gpus RTX4090 nvidia-smi
sky exec test-runpod nvidia-smi
sky launch -c test-runpod nvidia-smi
sky autostop -i 0 --down test-runpod
sky status -r test-runpod
sky launch -c test-runpod-mn --cloud runpod --num-nodes 2 --gpus RTX4090 nvidia-smi
(not supported yet as the interconnection among nodes are non-trivial)pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh