Skip to content

Commit

Permalink
Jetstream Autoscaling Guide (GoogleCloudPlatform#703)
Browse files Browse the repository at this point in the history
* first commit

* missing files

* various improvements

* some autoscaling changes for testing

* add targetlabels to podmonitoring

* Revert repo pinning

* more reversions

* more reversions

* cleanup

* more cleanup

* Added to README

* revert topology change

* tweaks to deployment

* HPA terraform fixes

* remove stray comment

* Add more to README

* parameterize metrics scrape port

* Cleaned up readme

* readme tweak

* typo

* remove indentation

* newline

* More updates to readme

* change wording

* Update metrics scrape example

* remove annotation

* terraform format

* missing comma

* maxengine-server in terraform

* wording

* terraform fmt

* parameterize container images

* wording

* remove ksa var

* move deployment to kubectl directory

* App -> app

* pipe from maxengine module to main

* Update tutorials-and-examples/inference-servers/jetstream/maxtext/single-host-inference/README.md

Co-authored-by: RupengLiu <[email protected]>

* remove TODO

* HPA can now scale with HBM

---------

Co-authored-by: RupengLiu <[email protected]>
  • Loading branch information
Bslabe123 and liurupeng committed Jun 17, 2024
1 parent 1d625dd commit c62d2ba
Show file tree
Hide file tree
Showing 16 changed files with 848 additions and 10 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -35,3 +35,4 @@ default.tfstate.backup
terraform.tfstate*
terraform.tfvars
tfplan
.vscode/
Original file line number Diff line number Diff line change
Expand Up @@ -333,7 +333,7 @@ class GrpcBenchmarkUser(GrpcUser):
def grpc_infer(self):
prompt = get_random_prompt(self)
request = jetstream_pb2.DecodeRequest(
text_content=jetstream_pb2.DecodeRequest.TextContent(text=request.prompt),
text_content=jetstream_pb2.DecodeRequest.TextContent(text=prompt),
priority=0,
max_tokens=model_params["max_output_len"],
)
Expand Down
6 changes: 4 additions & 2 deletions benchmarks/inference-server/jetstream/jetstream.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ spec:
cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
containers:
- name: maxengine-server
image: us-docker.pkg.dev/cloud-tpu-images/inference/maxengine-server:v0.2.0
image: us-docker.pkg.dev/cloud-tpu-images/inference/maxengine-server:v0.2.2
args:
- model_name=gemma-7b
- tokenizer_path=assets/tokenizer.gemma
Expand All @@ -32,6 +32,8 @@ spec:
- scan_layers=false
- weight_dtype=bfloat16
- load_parameters_path=gs://GEMMA_BUCKET_NAME/final/unscanned/gemma_7b-it/0/checkpoints/0/items
- attention=dot_product
- prometheus_port=9100
ports:
- containerPort: 9000
resources:
Expand All @@ -40,7 +42,7 @@ spec:
limits:
google.com/tpu: 4
- name: jetstream-http
image: us-docker.pkg.dev/cloud-tpu-images/inference/jetstream-http:v0.2.0
image: us-docker.pkg.dev/cloud-tpu-images/inference/jetstream-http:v0.2.2
ports:
- containerPort: 8000
---
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -121,9 +121,11 @@ Completed unscanning checkpoint to gs://BUCKET_NAME/final/unscanned/gemma_7b-it/

## Deploy Maxengine Server and HTTP Server

In this example, we will deploy a Maxengine server targeting Gemma-7b model. You can use the provided Maxengine server and HTTP server images already in `deployment.yaml` or [build your own](#optionals).
Next, deploy a Maxengine server hosting the Gemma-7b model. You can use the provided Maxengine server and HTTP server images or [build your own](#build-and-upload-maxengine-server-image). Depending on your needs and constraints you can elect to deploy either via Terraform or via Kubectl.

Add desired overrides to your yaml file by editing the `args` in `deployment.yaml`. You can reference the [MaxText base config file](https://github.com/google/maxtext/blob/main/MaxText/configs/base.yml) on what values can be overridden.
### Deploy via Kubectl

First navigate to the `./kubectl` directory. Add desired overrides to your yaml file by editing the `args` in `deployment.yaml`. You can reference the [MaxText base config file](https://github.com/google/maxtext/blob/main/MaxText/configs/base.yml) on what values can be overridden.

In the manifest, ensure the value of the BUCKET_NAME is the name of the Cloud Storage bucket that was used when converting your checkpoint.

Expand All @@ -147,7 +149,55 @@ Deploy the manifest file for the Maxengine server and HTTP server:
kubectl apply -f deployment.yaml
```

## Verify the deployment
### Deploy via Terraform

Navigate to the `./terraform` directory and do the standard [`terraform init`](https://developer.hashicorp.com/terraform/cli/commands/init). The deployment requires some inputs, an example `sample-terraform.tfvars` is provided as a starting point, run `cp sample-terraform.tfvars terraform.tfvars` and modify the resulting `terraform.tfvars` as needed. Finally run `terraform apply` to apply these resources to your cluster.

#### (optional) Enable Horizontal Pod Autoscaling via Terraform

Applying the following resources to your cluster will enable autoscaling with customer metrics:
- PodMonitoring: For scraping metrics and exporting them to Google Cloud Monitoring
- Custom Metrics Stackdriver Adapter (CMSA): For enabling your HPA objects to read metrics from the Google Cloud Monitoring API.
- [Horizontal Pod Autoscaler (HPA)](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/): For reading metrics and setting the maxengine-servers deployments replica count accordingly.

These components require a few more inputs and rerunning the [prior step](#deploy-via-terraform) with these set will deploy the components. The following input conditions should be satisfied: `custom_metrics_enabled` should be `true` and `metrics_port`, `hpa_type`, `hpa_averagevalue_target`, `hpa_min_replicas`, `hpa_max_replicas` should all be set.

Note that only one HPA resource will be created. For those who want to scale based on multiple metrics, we recommend using the following template to apply more HPA resources:

```
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: jetstream-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: maxengine-server
minReplicas: <YOUR_MIN_REPLICAS>
maxReplicas: <YOUR_MAX_REPLICAS>
metrics:
- type: Pods
pods:
metric:
name: prometheus.googleapis.com|<YOUR_METRIC_NAME>|gauge
target:
type: AverageValue
averageValue: <YOUR_VALUE_HERE>
```

If you would like to probe the metrics manually, `cURL` your maxengine-server container on whatever metrics port you set and you should see something similar to the following:

```
# HELP jetstream_prefill_backlog_size Size of prefill queue
# TYPE jetstream_prefill_backlog_size gauge
jetstream_prefill_backlog_size{id="SOME-HOSTNAME-HERE>"} 0.0
# HELP jetstream_slots_used_percentage The percentage of decode slots currently being used
# TYPE jetstream_slots_used_percentage gauge
jetstream_slots_used_percentage{id="<SOME-HOSTNAME-HERE>",idx="0"} 0.04166666666666663
```

### Verify the deployment

Wait for the containers to finish creating:
```
Expand Down Expand Up @@ -199,7 +249,7 @@ The output should be similar to the following:
}
```

## Optionals
## Other optional steps
### Build and upload Maxengine Server image

Build the Maxengine Server from [here](../maxengine-server) and upload to your project
Expand All @@ -223,7 +273,7 @@ docker push gcr.io/${PROJECT_ID}/jetstream/maxtext/jetstream-http:latest
The Jetstream HTTP server is great for initial testing and validating end-to-end requests and responses. If you would like to interact directly with the Maxengine server directly for use cases such as [benchmarking](https://github.com/google/JetStream/tree/main/benchmarks), you can do so by following the Jetstream benchmarking setup and applying the `deployment.yaml` manifest file and interacting with the Jetstream gRPC server at port 9000.

```
kubectl apply -f deployment.yaml
kubectl apply -f kubectl/deployment.yaml
kubectl port-forward svc/jetstream-svc 9000:9000
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ spec:
cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
containers:
- name: maxengine-server
image: us-docker.pkg.dev/cloud-tpu-images/inference/maxengine-server:v0.2.0
image: us-docker.pkg.dev/cloud-tpu-images/inference/maxengine-server:v0.2.2
imagePullPolicy: Always
securityContext:
privileged: true
Expand All @@ -34,6 +34,7 @@ spec:
- scan_layers=false
- weight_dtype=bfloat16
- load_parameters_path=gs://BUCKET_NAME/final/unscanned/gemma_7b-it/0/checkpoints/0/items
- attention=dot_product
- prometheus_port=9100
ports:
- containerPort: 9000
Expand Down Expand Up @@ -64,4 +65,3 @@ spec:
name: jetstream-grpc
port: 9000
targetPort: 9000

Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Custom Metrics Stackdriver Adapter

Adapted from https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml

## Usage

To use this module, include it from your main terraform config, i.e.:

```
module "custom_metrics_stackdriver_adapter" {
source = "./path/to/custom-metrics-stackdriver-adapter"
}
```

For a workload identity enabled cluster, some additional configuration is
needed:

```
module "custom_metrics_stackdriver_adapter" {
source = "./path/to/custom-metrics-stackdriver-adapter"
workload_identity = {
enabled = true
project_id = "<PROJECT_ID>"
}
}
```
Loading

0 comments on commit c62d2ba

Please sign in to comment.