Jetstream Autoscaling Guide (GoogleCloudPlatform#703)

* first commit * missing files * various improvements * some autoscaling changes for testing * add targetlabels to podmonitoring * Revert repo pinning * more reversions * more reversions * cleanup * more cleanup * Added to README * revert topology change * tweaks to deployment * HPA terraform fixes * remove stray comment * Add more to README * parameterize metrics scrape port * Cleaned up readme * readme tweak * typo * remove indentation * newline * More updates to readme * change wording * Update metrics scrape example * remove annotation * terraform format * missing comma * maxengine-server in terraform * wording * terraform fmt * parameterize container images * wording * remove ksa var * move deployment to kubectl directory * App -> app * pipe from maxengine module to main * Update tutorials-and-examples/inference-servers/jetstream/maxtext/single-host-inference/README.md Co-authored-by: RupengLiu <[email protected]> * remove TODO * HPA can now scale with HBM --------- Co-authored-by: RupengLiu <[email protected]>
brandonroyal · Jun 17, 2024 · c62d2ba · c62d2ba
1 parent 1d625dd
commit c62d2ba
Show file tree

Hide file tree

Showing 16 changed files with 848 additions and 10 deletions.
diff --git a/.gitignore b/.gitignore
@@ -35,3 +35,4 @@ default.tfstate.backup
 terraform.tfstate*
 terraform.tfvars
 tfplan
+.vscode/
diff --git a/benchmarks/benchmark/tools/locust-load-inference/locust-docker/locust-tasks/tasks.py b/benchmarks/benchmark/tools/locust-load-inference/locust-docker/locust-tasks/tasks.py
@@ -333,7 +333,7 @@ class GrpcBenchmarkUser(GrpcUser):
     def grpc_infer(self):
         prompt = get_random_prompt(self)
         request = jetstream_pb2.DecodeRequest(
-            text_content=jetstream_pb2.DecodeRequest.TextContent(text=request.prompt),
+            text_content=jetstream_pb2.DecodeRequest.TextContent(text=prompt),
             priority=0,
             max_tokens=model_params["max_output_len"],
         )

diff --git a/benchmarks/inference-server/jetstream/jetstream.yaml b/benchmarks/inference-server/jetstream/jetstream.yaml
@@ -18,7 +18,7 @@ spec:
         cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
       containers:
       - name: maxengine-server
-        image: us-docker.pkg.dev/cloud-tpu-images/inference/maxengine-server:v0.2.0
+        image: us-docker.pkg.dev/cloud-tpu-images/inference/maxengine-server:v0.2.2
         args:
         - model_name=gemma-7b
         - tokenizer_path=assets/tokenizer.gemma
@@ -32,6 +32,8 @@ spec:
         - scan_layers=false
         - weight_dtype=bfloat16
         - load_parameters_path=gs://GEMMA_BUCKET_NAME/final/unscanned/gemma_7b-it/0/checkpoints/0/items
+        - attention=dot_product
+        - prometheus_port=9100
         ports:
         - containerPort: 9000
         resources:
@@ -40,7 +42,7 @@ spec:
           limits:
             google.com/tpu: 4
       - name: jetstream-http
-        image: us-docker.pkg.dev/cloud-tpu-images/inference/jetstream-http:v0.2.0
+        image: us-docker.pkg.dev/cloud-tpu-images/inference/jetstream-http:v0.2.2
         ports:
         - containerPort: 8000
 ---

diff --git a/...nd-examples/inference-servers/jetstream/maxtext/single-host-inference/README.md b/...nd-examples/inference-servers/jetstream/maxtext/single-host-inference/README.md
@@ -121,9 +121,11 @@ Completed unscanning checkpoint to gs://BUCKET_NAME/final/unscanned/gemma_7b-it/
 
 ## Deploy Maxengine Server and HTTP Server
 
-In this example, we will deploy a Maxengine server targeting Gemma-7b model. You can use the provided Maxengine server and HTTP server images already in `deployment.yaml` or [build your own](#optionals).
+Next, deploy a Maxengine server hosting the Gemma-7b model. You can use the provided Maxengine server and HTTP server images or [build your own](#build-and-upload-maxengine-server-image). Depending on your needs and constraints you can elect to deploy either via Terraform or via Kubectl.
 
-Add desired overrides to your yaml file by editing the `args` in `deployment.yaml`. You can reference the [MaxText base config file](https://github.com/google/maxtext/blob/main/MaxText/configs/base.yml) on what values can be overridden.
+### Deploy via Kubectl
+
+First navigate to the `./kubectl` directory. Add desired overrides to your yaml file by editing the `args` in `deployment.yaml`. You can reference the [MaxText base config file](https://github.com/google/maxtext/blob/main/MaxText/configs/base.yml) on what values can be overridden.
 
 In the manifest, ensure the value of the BUCKET_NAME is the name of the Cloud Storage bucket that was used when converting your checkpoint.
 
@@ -147,7 +149,55 @@ Deploy the manifest file for the Maxengine server and HTTP server:
 kubectl apply -f deployment.yaml
 ```
 
-## Verify the deployment
+### Deploy via Terraform
+
+Navigate to the `./terraform` directory and do the standard [`terraform init`](https://developer.hashicorp.com/terraform/cli/commands/init). The deployment requires some inputs, an example `sample-terraform.tfvars` is provided as a starting point, run `cp sample-terraform.tfvars terraform.tfvars` and modify the resulting `terraform.tfvars` as needed. Finally run `terraform apply` to apply these resources to your cluster.
+
+#### (optional) Enable Horizontal Pod Autoscaling via Terraform
+
+Applying the following resources to your cluster will enable autoscaling with customer metrics:
+ - PodMonitoring: For scraping metrics and exporting them to Google Cloud Monitoring
+ - Custom Metrics Stackdriver Adapter (CMSA): For enabling your HPA objects to read metrics from the Google Cloud Monitoring API.
+ - [Horizontal Pod Autoscaler (HPA)](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/): For reading metrics and setting the maxengine-servers deployments replica count accordingly.
+
+These components require a few more inputs and rerunning the [prior step](#deploy-via-terraform) with these set will deploy the components. The following input conditions should be satisfied: `custom_metrics_enabled` should be `true` and `metrics_port`, `hpa_type`, `hpa_averagevalue_target`, `hpa_min_replicas`, `hpa_max_replicas` should all be set.
+
+ Note that only one HPA resource will be created. For those who want to scale based on multiple metrics, we recommend using the following template to apply more HPA resources:
+
+```
+apiVersion: autoscaling/v2
+kind: HorizontalPodAutoscaler
+metadata:
+  name: jetstream-hpa
+spec:
+  scaleTargetRef:
+    apiVersion: apps/v1
+    kind: Deployment
+    name: maxengine-server
+  minReplicas: <YOUR_MIN_REPLICAS>
+  maxReplicas: <YOUR_MAX_REPLICAS>
+  metrics:
+  - type: Pods
+    pods:
+      metric:
+        name: prometheus.googleapis.com|<YOUR_METRIC_NAME>|gauge
+      target:
+        type: AverageValue
+        averageValue: <YOUR_VALUE_HERE>
+```
+
+If you would like to probe the metrics manually, `cURL` your maxengine-server container on whatever metrics port you set and you should see something similar to the following:
+
+```
+# HELP jetstream_prefill_backlog_size Size of prefill queue
+# TYPE jetstream_prefill_backlog_size gauge
+jetstream_prefill_backlog_size{id="SOME-HOSTNAME-HERE>"} 0.0
+# HELP jetstream_slots_used_percentage The percentage of decode slots currently being used
+# TYPE jetstream_slots_used_percentage gauge
+jetstream_slots_used_percentage{id="<SOME-HOSTNAME-HERE>",idx="0"} 0.04166666666666663
+```
+
+### Verify the deployment
 
 Wait for the containers to finish creating:
 ```
@@ -199,7 +249,7 @@ The output should be similar to the following:
 }
 ```
 
-## Optionals
+## Other optional steps
 ### Build and upload Maxengine Server image
 
 Build the Maxengine Server from [here](../maxengine-server) and upload to your project
@@ -223,7 +273,7 @@ docker push gcr.io/${PROJECT_ID}/jetstream/maxtext/jetstream-http:latest
 The Jetstream HTTP server is great for initial testing and validating end-to-end requests and responses. If you would like to interact directly with the Maxengine server directly for use cases such as [benchmarking](https://github.com/google/JetStream/tree/main/benchmarks), you can do so by following the Jetstream benchmarking setup and applying the `deployment.yaml` manifest file and interacting with the Jetstream gRPC server at port 9000.
 
 ```
-kubectl apply -f deployment.yaml
+kubectl apply -f kubectl/deployment.yaml
 
 kubectl port-forward svc/jetstream-svc 9000:9000
 ```

diff --git a/...ext/single-host-inference/deployment.yaml → ...le-host-inference/kubectl/deployment.yaml b/...ext/single-host-inference/deployment.yaml → ...le-host-inference/kubectl/deployment.yaml
@@ -17,7 +17,7 @@ spec:
         cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
       containers:
       - name: maxengine-server
-        image: us-docker.pkg.dev/cloud-tpu-images/inference/maxengine-server:v0.2.0
+        image: us-docker.pkg.dev/cloud-tpu-images/inference/maxengine-server:v0.2.2
         imagePullPolicy: Always
         securityContext:
           privileged: true
@@ -34,6 +34,7 @@ spec:
         - scan_layers=false
         - weight_dtype=bfloat16
         - load_parameters_path=gs://BUCKET_NAME/final/unscanned/gemma_7b-it/0/checkpoints/0/items
+        - attention=dot_product
         - prometheus_port=9100
         ports:
         - containerPort: 9000
@@ -64,4 +65,3 @@ spec:
     name: jetstream-grpc
     port: 9000
     targetPort: 9000
-
diff --git a/...xt/single-host-inference/terraform/custom-metrics-stackdriver-adapter/README.md b/...xt/single-host-inference/terraform/custom-metrics-stackdriver-adapter/README.md
@@ -0,0 +1,26 @@
+# Custom Metrics Stackdriver Adapter
+
+Adapted from https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml
+
+## Usage
+
+To use this module, include it from your main terraform config, i.e.:
+
+```
+module "custom_metrics_stackdriver_adapter" {
+  source = "./path/to/custom-metrics-stackdriver-adapter"
+}
+```
+
+For a workload identity enabled cluster, some additional configuration is
+needed:
+
+```
+module "custom_metrics_stackdriver_adapter" {
+  source = "./path/to/custom-metrics-stackdriver-adapter"
+  workload_identity = {
+    enabled = true
+    project_id = "<PROJECT_ID>"
+  }
+}
+```