Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray TPU Webhook Autoscaling Changes #180

Merged
merged 26 commits into from
Mar 12, 2024
Merged

Conversation

ryanaoleary
Copy link
Collaborator

@ryanaoleary ryanaoleary commented Jan 22, 2024

Update Ray TPU Webhook to be compatible with Kuberay CRD changes (ray-project/kuberay#1834, ray-project/kuberay#1920, ray-project/kuberay#1913) in order to support TPU pod autoscaling. This PR has been manually tested by deploying the webhook, creating a RayCluster with replicas = 2 and numOfHosts = 2 and verifying that 4 pods were scaled up and that the webhook correctly injects all labels, pod affinity constraints, and unique environment vars for each slice.

@ryanaoleary ryanaoleary self-assigned this Jan 22, 2024
@ryanaoleary ryanaoleary force-pushed the kuberay-tpu-env-injector branch 2 times, most recently from df06001 to be4a759 Compare February 12, 2024 20:25
@ryanaoleary ryanaoleary force-pushed the kuberay-tpu-env-injector branch 6 times, most recently from f3240c2 to 6c0e411 Compare March 5, 2024 22:49
@ryanaoleary ryanaoleary marked this pull request as ready for review March 5, 2024 22:53
applications/ray/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
applications/ray/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
applications/ray/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
applications/ray/TPU_guide.md Show resolved Hide resolved
applications/ray/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
applications/ray/kuberay-tpu-webhook/main.go Show resolved Hide resolved
applications/ray/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
applications/ray/TPU_guide.md Show resolved Hide resolved
applications/ray/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
@andrewsykim
Copy link
Collaborator

/gcbrun

@ryanaoleary
Copy link
Collaborator Author

/gcbrun

@ryanaoleary
Copy link
Collaborator Author

/gcbrun

@richardsliu richardsliu merged commit d389cda into main Mar 12, 2024
6 of 7 checks passed
@ryanaoleary ryanaoleary deleted the kuberay-tpu-env-injector branch March 12, 2024 22:19
annapendleton pushed a commit to annapendleton/ai-on-gke that referenced this pull request Mar 26, 2024
* Changed Ray worker template to use numOfHosts

* Use numOfHosts when checking workers match topology

* Generate DNS hostnames using numOfHosts

* Update go pkgs for kuberay CRD changes

* numOfHosts -> NumOfHosts and better var names

* Go version 1.22 -> 1.21

* Allow kuberay tpu webhook address to be configured

* numOfHosts -> NumOfHosts and better var names

* Allow kuberay tpu webhook address to be configured

* Generate hostnames for multi-host replicas

* Kuberay v1.1 Refactoring

* Add Pod deletion logic to webhook

* go fmt changes

* Inject pod affinity and anti-affinity labels

* Use ray-operator v1.1.0-rc.0

* Update image tags to 1.1 for new changes

* Update documentation for v1.1 changes

* Remove headless service creation from terraform (now done by ray-operator)

* Remove duplicate path var

* Add check for RayCluster name in getReplicaIndex

* Change isRunning bool to isCreated

* Update documentation to specify versions

* Add in check for v5e TPU pods

* Default chipsPerHost to 4

* Update prerequisites in README

* Simplify podAffinity injection

---------

Co-authored-by: Spencer Peterson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants