Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray TPU Webhook Autoscaling Changes #180

Merged
merged 26 commits into from
Mar 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
682a2d0
Changed Ray worker template to use numOfHosts
ryanaoleary Jan 20, 2024
071ce46
Use numOfHosts when checking workers match topology
ryanaoleary Jan 20, 2024
77cc31c
Generate DNS hostnames using numOfHosts
ryanaoleary Jan 20, 2024
5b8e4ef
Update go pkgs for kuberay CRD changes
ryanaoleary Jan 22, 2024
015da8f
numOfHosts -> NumOfHosts and better var names
ryanaoleary Jan 22, 2024
94a4f3d
Go version 1.22 -> 1.21
ryanaoleary Jan 22, 2024
ec9c0d0
Allow kuberay tpu webhook address to be configured
spencer-p Feb 8, 2024
06698f0
numOfHosts -> NumOfHosts and better var names
ryanaoleary Jan 22, 2024
d6c2eca
Allow kuberay tpu webhook address to be configured
spencer-p Feb 8, 2024
f452807
Generate hostnames for multi-host replicas
ryanaoleary Feb 28, 2024
1dcf6de
Kuberay v1.1 Refactoring
ryanaoleary Mar 4, 2024
5d5fb40
Add Pod deletion logic to webhook
ryanaoleary Mar 5, 2024
7880422
go fmt changes
ryanaoleary Mar 5, 2024
72825f4
Inject pod affinity and anti-affinity labels
ryanaoleary Mar 5, 2024
aab34c1
Use ray-operator v1.1.0-rc.0
ryanaoleary Mar 5, 2024
2671900
Update image tags to 1.1 for new changes
ryanaoleary Mar 5, 2024
f7e4f50
Update documentation for v1.1 changes
ryanaoleary Mar 5, 2024
d9ee210
Remove headless service creation from terraform (now done by ray-oper…
ryanaoleary Mar 6, 2024
422d98e
Remove duplicate path var
ryanaoleary Mar 7, 2024
45c83a4
Add check for RayCluster name in getReplicaIndex
ryanaoleary Mar 7, 2024
9ca9c9f
Change isRunning bool to isCreated
ryanaoleary Mar 7, 2024
8070d22
Update documentation to specify versions
ryanaoleary Mar 7, 2024
7b0a3ae
Add in check for v5e TPU pods
ryanaoleary Mar 8, 2024
d96bb0f
Default chipsPerHost to 4
ryanaoleary Mar 9, 2024
63b17d9
Update prerequisites in README
ryanaoleary Mar 11, 2024
993dffd
Simplify podAffinity injection
ryanaoleary Mar 12, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions applications/ray/TPU_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ accelerator_type = "nvidia-tesla-t4"

### Installing the TPU Initialization Webhook

The TPU Initialization Webhook automatically injects the `TPU_WORKER_ID`, `TPU_NAME`, and `TPU_WORKER_HOSTNAMES` environment variables necessary for multi-host TPU clusters. The webhook needs to be installed once per GKE cluster. The instructions can be found [here](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/applications/ray/kuberay-tpu-webhook).
The TPU Initialization Webhook automatically injects the `TPU_WORKER_ID`, `TPU_NAME`, and `TPU_WORKER_HOSTNAMES` environment variables necessary for multi-host TPU clusters. The webhook needs to be installed once per GKE cluster and requires a Kuberay Operator running v1.1 and GKE cluster version of 1.28+. The instructions can be found [here](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/applications/ray/kuberay-tpu-webhook).
ryanaoleary marked this conversation as resolved.
Show resolved Hide resolved

### Creating the Kuberay Cluster

Expand All @@ -54,7 +54,7 @@ The TPU Initialization Webhook automatically injects the `TPU_WORKER_ID`, `TPU_N

This should deploy a Kuberay cluster with a single TPU worker node (v4 TPU with `2x2x1` topology).

To deploy a multi-host Ray Cluster, modify the `worker` spec [here](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/modules/kuberay-cluster/kuberay-tpu-values.yaml) by changing the `cloud.google.com/gke-tpu-topology` `nodeSelector` to a multi-host topology. Set the `replicas` field in the `worker` spec to the number of hosts specified by your chosen topology. For v4 TPUs, each TPU VM has access to 4 TPU chips. Therefore, you can calculate the number of TPU VM hosts by taking the product of the topology and dividing by 4 (i.e. a 2x2x4 TPU podslice will have 4 TPU VM hosts).
To deploy a multi-host Ray Cluster, modify the `worker` spec [here](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/modules/kuberay-cluster/kuberay-tpu-values.yaml) by changing the `cloud.google.com/gke-tpu-topology` `nodeSelector` to a multi-host topology. Set the `numOfHosts` field in the `worker` spec to the number of hosts specified by your chosen topology. For v4 TPUs, each TPU VM has access to 4 TPU chips. Therefore, you can calculate the number of TPU VM hosts by taking the product of the topology and dividing by 4 (i.e. a 2x2x4 TPU podslice will have 4 TPU VM hosts).
ryanaoleary marked this conversation as resolved.
Show resolved Hide resolved

### Running Sample Workloads

Expand Down
2 changes: 1 addition & 1 deletion applications/ray/kuberay-tpu-webhook/Makefile
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Image URL to use all building/pushing image targets
IMG ?= us-docker.pkg.dev/ai-on-gke/kuberay-tpu-webhook/kuberay-tpu-webhook:v1.0
IMG ?= us-docker.pkg.dev/ai-on-gke/kuberay-tpu-webhook/kuberay-tpu-webhook:v1.1

# Get the currently used golang install path (in GOPATH/bin, unless GOBIN is set)
ifeq (,$(shell go env GOBIN))
Expand Down
7 changes: 6 additions & 1 deletion applications/ray/kuberay-tpu-webhook/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,11 @@ Preinstall on your computer:
- Helm
- Gcloud

When installing using terraform ensure that:
- GKE cluster is created with version 1.28+
- Kuberay Operator version is set to v1.1+
- can edit operator version in ai-on-gke/modules/kuberay-operator/kuberay.tf before running `terraform apply`

### Installing the GKE Platform

1. If needed, git clone https://github.com/GoogleCloudPlatform/ai-on-gke
Expand Down Expand Up @@ -52,4 +57,4 @@ After deploying the webhook, follow the steps in ray/TPU_GUIDE to setup Ray on G

### Limitations

The webhook stores unique `TPU_WORKER_ID`s in memory, and will fail to initialize the environment variables correctly if the webhook pod dies or restarts before intercepting all pods. Finally, environment vars are not updated or removed after the initial admission request.
The webhook stores unique `TPU_WORKER_HOSTNAMES` and `TPU_WORKER_ID`s for each slice in memory, and will fail to initialize the environment variables correctly if the webhook pod dies or restarts before intercepting all pods.
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ spec:
app: kuberay-tpu-webhook
spec:
containers:
- image: us-docker.pkg.dev/ai-on-gke/kuberay-tpu-webhook/kuberay-tpu-webhook:v1.0
- image: us-docker.pkg.dev/ai-on-gke/kuberay-tpu-webhook/kuberay-tpu-webhook:v1.1
imagePullPolicy: Always
name: kuberay-tpu-webhook
ports:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,6 @@ webhooks:
namespace: default
path: /mutate
rules:
- operations: ["CREATE"]
apiGroups: ["ray.io"]
apiVersions: ["*"]
resources: ["rayclusters"]
- operations: ["CREATE"]
apiGroups: [""]
apiVersions: ["v1"]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@ webhooks:
namespace: default
path: /validate
rules:
- operations: ["DELETE"]
apiGroups: [""]
apiVersions: ["v1"]
resources: ["pods"]
- operations: ["CREATE"]
apiGroups: ["ray.io"]
apiVersions: ["*"]
Expand Down
65 changes: 50 additions & 15 deletions applications/ray/kuberay-tpu-webhook/go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -3,32 +3,67 @@ module github.com/GoogleCloudPlatform/kuberay-tpu-webhook
go 1.21

require (
github.com/ray-project/kuberay/ray-operator v1.0.0
k8s.io/api v0.28.3
k8s.io/klog/v2 v2.100.1
github.com/ray-project/kuberay/ray-operator v1.1.0-rc.0
k8s.io/api v0.29.1
k8s.io/apimachinery v0.29.1
k8s.io/klog/v2 v2.120.1
)

require (
github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc // indirect
github.com/fsnotify/fsnotify v1.6.0 // indirect
github.com/go-logr/logr v1.2.4 // indirect
github.com/beorn7/perks v1.0.1 // indirect
github.com/cespare/xxhash/v2 v2.2.0 // indirect
github.com/davecgh/go-spew v1.1.1 // indirect
github.com/emicklei/go-restful/v3 v3.11.0 // indirect
github.com/evanphx/json-patch/v5 v5.8.0 // indirect
github.com/fsnotify/fsnotify v1.7.0 // indirect
github.com/go-logr/logr v1.4.1 // indirect
github.com/go-openapi/jsonpointer v0.19.6 // indirect
github.com/go-openapi/jsonreference v0.20.2 // indirect
github.com/go-openapi/swag v0.22.3 // indirect
github.com/gogo/protobuf v1.3.2 // indirect
github.com/golang/groupcache v0.0.0-20210331224755-41bb18bfe9da // indirect
github.com/golang/protobuf v1.5.3 // indirect
github.com/google/gnostic-models v0.6.8 // indirect
github.com/google/go-cmp v0.6.0 // indirect
github.com/google/gofuzz v1.2.0 // indirect
github.com/google/pprof v0.0.0-20211214055906-6f57359322fd // indirect
github.com/google/uuid v1.3.1 // indirect
github.com/imdario/mergo v0.3.12 // indirect
github.com/josharian/intern v1.0.0 // indirect
github.com/json-iterator/go v1.1.12 // indirect
github.com/mailru/easyjson v0.7.7 // indirect
github.com/matttproud/golang_protobuf_extensions/v2 v2.0.0 // indirect
github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd // indirect
github.com/modern-go/reflect2 v1.0.2 // indirect
github.com/onsi/gomega v1.27.6 // indirect
github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2 // indirect
github.com/stretchr/testify v1.8.4 // indirect
golang.org/x/net v0.17.0 // indirect
golang.org/x/sys v0.15.0 // indirect
github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 // indirect
github.com/pkg/errors v0.9.1 // indirect
github.com/prometheus/client_golang v1.18.0 // indirect
github.com/prometheus/client_model v0.5.0 // indirect
github.com/prometheus/common v0.45.0 // indirect
github.com/prometheus/procfs v0.12.0 // indirect
github.com/rogpeppe/go-internal v1.11.0 // indirect
github.com/spf13/pflag v1.0.5 // indirect
golang.org/x/exp v0.0.0-20220722155223-a9213eeb770e // indirect
golang.org/x/net v0.20.0 // indirect
golang.org/x/oauth2 v0.12.0 // indirect
golang.org/x/sys v0.16.0 // indirect
golang.org/x/term v0.16.0 // indirect
golang.org/x/text v0.14.0 // indirect
golang.org/x/time v0.3.0 // indirect
golang.org/x/tools v0.17.0 // indirect
gomodules.xyz/jsonpatch/v2 v2.4.0 // indirect
google.golang.org/appengine v1.6.7 // indirect
google.golang.org/protobuf v1.32.0 // indirect
gopkg.in/inf.v0 v0.9.1 // indirect
gopkg.in/yaml.v2 v2.4.0 // indirect
k8s.io/apimachinery v0.28.3 // indirect
k8s.io/utils v0.0.0-20230406110748-d93618cff8a2 // indirect
sigs.k8s.io/controller-runtime v0.11.1 // indirect
gopkg.in/yaml.v3 v3.0.1 // indirect
k8s.io/apiextensions-apiserver v0.29.0 // indirect
k8s.io/client-go v0.29.0 // indirect
k8s.io/component-base v0.29.0 // indirect
k8s.io/kube-openapi v0.0.0-20231010175941-2dd684a91f00 // indirect
k8s.io/utils v0.0.0-20240102154912-e7106e64919e // indirect
sigs.k8s.io/controller-runtime v0.17.0 // indirect
sigs.k8s.io/json v0.0.0-20221116044647-bc3834ca7abd // indirect
sigs.k8s.io/structured-merge-diff/v4 v4.2.3 // indirect
sigs.k8s.io/structured-merge-diff/v4 v4.4.1 // indirect
sigs.k8s.io/yaml v1.4.0 // indirect
)
Loading
Loading