Skip to content

Commit

Permalink
Ray TPU Webhook Autoscaling Changes (#180)
Browse files Browse the repository at this point in the history
* Changed Ray worker template to use numOfHosts

* Use numOfHosts when checking workers match topology

* Generate DNS hostnames using numOfHosts

* Update go pkgs for kuberay CRD changes

* numOfHosts -> NumOfHosts and better var names

* Go version 1.22 -> 1.21

* Allow kuberay tpu webhook address to be configured

* numOfHosts -> NumOfHosts and better var names

* Allow kuberay tpu webhook address to be configured

* Generate hostnames for multi-host replicas

* Kuberay v1.1 Refactoring

* Add Pod deletion logic to webhook

* go fmt changes

* Inject pod affinity and anti-affinity labels

* Use ray-operator v1.1.0-rc.0

* Update image tags to 1.1 for new changes

* Update documentation for v1.1 changes

* Remove headless service creation from terraform (now done by ray-operator)

* Remove duplicate path var

* Add check for RayCluster name in getReplicaIndex

* Change isRunning bool to isCreated

* Update documentation to specify versions

* Add in check for v5e TPU pods

* Default chipsPerHost to 4

* Update prerequisites in README

* Simplify podAffinity injection

---------

Co-authored-by: Spencer Peterson <[email protected]>
  • Loading branch information
ryanaoleary and spencer-p committed Mar 12, 2024
1 parent 8c4b099 commit d389cda
Show file tree
Hide file tree
Showing 12 changed files with 490 additions and 271 deletions.
4 changes: 2 additions & 2 deletions applications/ray/TPU_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ accelerator_type = "nvidia-tesla-t4"

### Installing the TPU Initialization Webhook

The TPU Initialization Webhook automatically injects the `TPU_WORKER_ID`, `TPU_NAME`, and `TPU_WORKER_HOSTNAMES` environment variables necessary for multi-host TPU clusters. The webhook needs to be installed once per GKE cluster. The instructions can be found [here](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/applications/ray/kuberay-tpu-webhook).
The TPU Initialization Webhook automatically injects the `TPU_WORKER_ID`, `TPU_NAME`, and `TPU_WORKER_HOSTNAMES` environment variables necessary for multi-host TPU clusters. The webhook needs to be installed once per GKE cluster and requires a Kuberay Operator running v1.1 and GKE cluster version of 1.28+. The instructions can be found [here](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/applications/ray/kuberay-tpu-webhook).

### Creating the Kuberay Cluster

Expand All @@ -54,7 +54,7 @@ The TPU Initialization Webhook automatically injects the `TPU_WORKER_ID`, `TPU_N

This should deploy a Kuberay cluster with a single TPU worker node (v4 TPU with `2x2x1` topology).

To deploy a multi-host Ray Cluster, modify the `worker` spec [here](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/modules/kuberay-cluster/kuberay-tpu-values.yaml) by changing the `cloud.google.com/gke-tpu-topology` `nodeSelector` to a multi-host topology. Set the `replicas` field in the `worker` spec to the number of hosts specified by your chosen topology. For v4 TPUs, each TPU VM has access to 4 TPU chips. Therefore, you can calculate the number of TPU VM hosts by taking the product of the topology and dividing by 4 (i.e. a 2x2x4 TPU podslice will have 4 TPU VM hosts).
To deploy a multi-host Ray Cluster, modify the `worker` spec [here](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/modules/kuberay-cluster/kuberay-tpu-values.yaml) by changing the `cloud.google.com/gke-tpu-topology` `nodeSelector` to a multi-host topology. Set the `numOfHosts` field in the `worker` spec to the number of hosts specified by your chosen topology. For v4 TPUs, each TPU VM has access to 4 TPU chips. Therefore, you can calculate the number of TPU VM hosts by taking the product of the topology and dividing by 4 (i.e. a 2x2x4 TPU podslice will have 4 TPU VM hosts).

### Running Sample Workloads

Expand Down
2 changes: 1 addition & 1 deletion applications/ray/kuberay-tpu-webhook/Makefile
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Image URL to use all building/pushing image targets
IMG ?= us-docker.pkg.dev/ai-on-gke/kuberay-tpu-webhook/kuberay-tpu-webhook:v1.0
IMG ?= us-docker.pkg.dev/ai-on-gke/kuberay-tpu-webhook/kuberay-tpu-webhook:v1.1

# Get the currently used golang install path (in GOPATH/bin, unless GOBIN is set)
ifeq (,$(shell go env GOBIN))
Expand Down
7 changes: 6 additions & 1 deletion applications/ray/kuberay-tpu-webhook/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,11 @@ Preinstall on your computer:
- Helm
- Gcloud

When installing using terraform ensure that:
- GKE cluster is created with version 1.28+
- Kuberay Operator version is set to v1.1+
- can edit operator version in ai-on-gke/modules/kuberay-operator/kuberay.tf before running `terraform apply`

### Installing the GKE Platform

1. If needed, git clone https://github.com/GoogleCloudPlatform/ai-on-gke
Expand Down Expand Up @@ -52,4 +57,4 @@ After deploying the webhook, follow the steps in ray/TPU_GUIDE to setup Ray on G

### Limitations

The webhook stores unique `TPU_WORKER_ID`s in memory, and will fail to initialize the environment variables correctly if the webhook pod dies or restarts before intercepting all pods. Finally, environment vars are not updated or removed after the initial admission request.
The webhook stores unique `TPU_WORKER_HOSTNAMES` and `TPU_WORKER_ID`s for each slice in memory, and will fail to initialize the environment variables correctly if the webhook pod dies or restarts before intercepting all pods.
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ spec:
app: kuberay-tpu-webhook
spec:
containers:
- image: us-docker.pkg.dev/ai-on-gke/kuberay-tpu-webhook/kuberay-tpu-webhook:v1.0
- image: us-docker.pkg.dev/ai-on-gke/kuberay-tpu-webhook/kuberay-tpu-webhook:v1.1
imagePullPolicy: Always
name: kuberay-tpu-webhook
ports:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,6 @@ webhooks:
namespace: default
path: /mutate
rules:
- operations: ["CREATE"]
apiGroups: ["ray.io"]
apiVersions: ["*"]
resources: ["rayclusters"]
- operations: ["CREATE"]
apiGroups: [""]
apiVersions: ["v1"]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@ webhooks:
namespace: default
path: /validate
rules:
- operations: ["DELETE"]
apiGroups: [""]
apiVersions: ["v1"]
resources: ["pods"]
- operations: ["CREATE"]
apiGroups: ["ray.io"]
apiVersions: ["*"]
Expand Down
65 changes: 50 additions & 15 deletions applications/ray/kuberay-tpu-webhook/go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -3,32 +3,67 @@ module github.com/GoogleCloudPlatform/kuberay-tpu-webhook
go 1.21

require (
github.com/ray-project/kuberay/ray-operator v1.0.0
k8s.io/api v0.28.3
k8s.io/klog/v2 v2.100.1
github.com/ray-project/kuberay/ray-operator v1.1.0-rc.0
k8s.io/api v0.29.1
k8s.io/apimachinery v0.29.1
k8s.io/klog/v2 v2.120.1
)

require (
github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc // indirect
github.com/fsnotify/fsnotify v1.6.0 // indirect
github.com/go-logr/logr v1.2.4 // indirect
github.com/beorn7/perks v1.0.1 // indirect
github.com/cespare/xxhash/v2 v2.2.0 // indirect
github.com/davecgh/go-spew v1.1.1 // indirect
github.com/emicklei/go-restful/v3 v3.11.0 // indirect
github.com/evanphx/json-patch/v5 v5.8.0 // indirect
github.com/fsnotify/fsnotify v1.7.0 // indirect
github.com/go-logr/logr v1.4.1 // indirect
github.com/go-openapi/jsonpointer v0.19.6 // indirect
github.com/go-openapi/jsonreference v0.20.2 // indirect
github.com/go-openapi/swag v0.22.3 // indirect
github.com/gogo/protobuf v1.3.2 // indirect
github.com/golang/groupcache v0.0.0-20210331224755-41bb18bfe9da // indirect
github.com/golang/protobuf v1.5.3 // indirect
github.com/google/gnostic-models v0.6.8 // indirect
github.com/google/go-cmp v0.6.0 // indirect
github.com/google/gofuzz v1.2.0 // indirect
github.com/google/pprof v0.0.0-20211214055906-6f57359322fd // indirect
github.com/google/uuid v1.3.1 // indirect
github.com/imdario/mergo v0.3.12 // indirect
github.com/josharian/intern v1.0.0 // indirect
github.com/json-iterator/go v1.1.12 // indirect
github.com/mailru/easyjson v0.7.7 // indirect
github.com/matttproud/golang_protobuf_extensions/v2 v2.0.0 // indirect
github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd // indirect
github.com/modern-go/reflect2 v1.0.2 // indirect
github.com/onsi/gomega v1.27.6 // indirect
github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2 // indirect
github.com/stretchr/testify v1.8.4 // indirect
golang.org/x/net v0.17.0 // indirect
golang.org/x/sys v0.15.0 // indirect
github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 // indirect
github.com/pkg/errors v0.9.1 // indirect
github.com/prometheus/client_golang v1.18.0 // indirect
github.com/prometheus/client_model v0.5.0 // indirect
github.com/prometheus/common v0.45.0 // indirect
github.com/prometheus/procfs v0.12.0 // indirect
github.com/rogpeppe/go-internal v1.11.0 // indirect
github.com/spf13/pflag v1.0.5 // indirect
golang.org/x/exp v0.0.0-20220722155223-a9213eeb770e // indirect
golang.org/x/net v0.20.0 // indirect
golang.org/x/oauth2 v0.12.0 // indirect
golang.org/x/sys v0.16.0 // indirect
golang.org/x/term v0.16.0 // indirect
golang.org/x/text v0.14.0 // indirect
golang.org/x/time v0.3.0 // indirect
golang.org/x/tools v0.17.0 // indirect
gomodules.xyz/jsonpatch/v2 v2.4.0 // indirect
google.golang.org/appengine v1.6.7 // indirect
google.golang.org/protobuf v1.32.0 // indirect
gopkg.in/inf.v0 v0.9.1 // indirect
gopkg.in/yaml.v2 v2.4.0 // indirect
k8s.io/apimachinery v0.28.3 // indirect
k8s.io/utils v0.0.0-20230406110748-d93618cff8a2 // indirect
sigs.k8s.io/controller-runtime v0.11.1 // indirect
gopkg.in/yaml.v3 v3.0.1 // indirect
k8s.io/apiextensions-apiserver v0.29.0 // indirect
k8s.io/client-go v0.29.0 // indirect
k8s.io/component-base v0.29.0 // indirect
k8s.io/kube-openapi v0.0.0-20231010175941-2dd684a91f00 // indirect
k8s.io/utils v0.0.0-20240102154912-e7106e64919e // indirect
sigs.k8s.io/controller-runtime v0.17.0 // indirect
sigs.k8s.io/json v0.0.0-20221116044647-bc3834ca7abd // indirect
sigs.k8s.io/structured-merge-diff/v4 v4.2.3 // indirect
sigs.k8s.io/structured-merge-diff/v4 v4.4.1 // indirect
sigs.k8s.io/yaml v1.4.0 // indirect
)
Loading

0 comments on commit d389cda

Please sign in to comment.