Skip to content

Commit

Permalink
Jetstream support (GoogleCloudPlatform#677)
Browse files Browse the repository at this point in the history
  • Loading branch information
kfswain committed May 24, 2024
1 parent 06b325d commit 62f8d3c
Show file tree
Hide file tree
Showing 23 changed files with 386 additions and 19 deletions.
8 changes: 4 additions & 4 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ cd infra/stage-1
# Copy the sample variables and update the project ID, cluster name and other
parameters as needed in the `terraform.tfvars` file.
cp sample-terraform.tfvars terraform.tfvars
cp ./sample-tfvars/gpu-sample.tfvars terraform.tfvars
# Initialize the Terraform modules.
terraform init
Expand Down Expand Up @@ -67,7 +67,7 @@ cd infra/stage-2
# and the project name and bucket name parameters as needed in the
# `terraform.tfvars` file. You can specify a new bucket name in which case it
# will be created.
cp sample-terraform.tfvars terraform.tfvars
cp ./sample-tfvars/gpu-sample.tfvars terraform.tfvars
# Initialize the Terraform modules.
terraform init
Expand All @@ -88,7 +88,7 @@ cd inference-server/text-generation-inference
# Copy the sample variables and update the project number and cluster name in
# the fleet_host variable "https://connectgateway.googleapis.com/v1/projects/<project-number>/locations/global/gkeMemberships/<cluster-name>"
# in the `terraform.tfvars` file.
cp sample-terraform.tfvars terraform.tfvars
cp ./sample-tfvars/gpu-sample.tfvars terraform.tfvars
# Initialize the Terraform modules.
terraform init
Expand Down Expand Up @@ -120,7 +120,7 @@ cd benchmark/tools/locust-load-inference
# Copy the sample variables and update the project number and cluster name in
# the fleet_host variable "https://connectgateway.googleapis.com/v1/projects/<project-number>/locations/global/gkeMemberships/<cluster-name>"
# in the `terraform.tfvars` file.
cp sample-terraform.tfvars terraform.tfvars
cp ./sample-tfvars/tgi-sample.tfvars terraform.tfvars
# Initialize the Terraform modules.
terraform init
Expand Down
10 changes: 6 additions & 4 deletions benchmarks/benchmark/tools/locust-load-inference/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ The Locust benchmarking tool currently supports these frameworks:
- tensorrt_llm_triton
- text generation inference (tgi)
- vllm
- jetstream

## Instructions

Expand All @@ -49,7 +50,7 @@ This is my first prompt.\n
This is my second prompt.\n
```

Example prompt datasets are available in the "../../dataset" folder with python scripts and instructions on how to make the dataset available for consumption by this benchmark. The dataset used in the `sample-terraform.tfvars` is the "ShareGPT_v3_unflitered_cleaned_split".
Example prompt datasets are available in the "../../dataset" folder with python scripts and instructions on how to make the dataset available for consumption by this benchmark. The dataset used in the `./sample-tfvars/tgi-sample.tfvars` is the "ShareGPT_v3_unflitered_cleaned_split".

You will set the `gcs_path` in your `terraform.tfvars` to this gcs path containing your prompts.

Expand Down Expand Up @@ -100,10 +101,10 @@ gcloud artifacts repositories create ai-benchmark --location=us-central1 --repos

### Step 6: create and configure terraform.tfvars

Create a `terraform.tfvars` file. `sample-terraform.tfvars` is provided as an example file. You can copy the file as a starting point. Note that at a minimum you will have to change the existing `credentials_config`, `project_id`, and `artifact_registry`.
Create a `terraform.tfvars` file. `./sample-tfvars/tgi-sample.tfvars` is provided as an example file. You can copy the file as a starting point. Note that at a minimum you will have to change the existing `credentials_config`, `project_id`, and `artifact_registry`.

```bash
cp sample-terraform.tfvars terraform.tfvars
cp ./sample-tfvars/tgi-sample.tfvars terraform.tfvars
```

Fill out your `terraform.tfvars` with the desired model and server configuration, referring to the list of required and optional variables [here](#variables). The following variables are required:
Expand Down Expand Up @@ -265,5 +266,6 @@ To change the benchmark configuration, you will have to rerun terraform destroy
| <a name="input_sax_model"></a> [sax\_model](#input\_sax\_model) | Benchmark server configuration for sax model. Only required if framework is sax. | `string` | `""` | no |
| <a name="input_tokenizer"></a> [tokenizer](#input\_tokenizer) | Benchmark server configuration for tokenizer. | `string` | `"tiiuae/falcon-7b"` | yes |
| <a name="input_use_beam_search"></a> [use\_beam\_search](#input\_use\_beam\_search) | Benchmark server configuration for use beam search. | `bool` | `false` | no |
<a name="huggingface_secret"></a> [huggingface_secret](#input\_huggingface_secret) | Name of the kubectl huggingface secret token | `string` | `huggingface-secret` | no |
<a name="huggingface_secret"></a> [huggingface_secret](#input\_huggingface_secret) | Name of the secret holding the huggingface token. Stored in GCP Secrets Manager. | `string` | `huggingface-secret` | no |
<a name="k8s_hf_secret"></a> [k8s_hf_secret](#input\_huggingface_secret) | Name of the secret holding the huggingface token. Stored in K8s. Key is expected to be named: `HF_TOKEN`. See [here](https://kubernetes.io/docs/tasks/configmap-secret/managing-secret-using-kubectl/#use-raw-data) for more. | `string` | `huggingface-secret` | no |
<!-- END_TF_DOCS -->
1 change: 1 addition & 0 deletions benchmarks/benchmark/tools/locust-load-inference/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ locals {
tokenizer = var.tokenizer
use_beam_search = var.use_beam_search
hugging_face_token_secret_list = local.hugging_face_token_secret == null ? [] : [local.hugging_face_token_secret]
k8s_hf_secret_list = var.k8s_hf_secret == null ? [] : [var.k8s_hf_secret]
stop_timeout = var.stop_timeout
request_type = var.request_type
})) : data]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,13 @@ spec:
- name: USE_BEAM_SEARCH
value: ${use_beam_search}
%{ for hugging_face_token_secret in hugging_face_token_secret_list ~}
- name: HUGGINGFACE_TOKEN
valueFrom:
secretKeyRef:
name: hf-key
key: HF_TOKEN
%{ endfor ~}
%{ for hf_token in k8s_hf_secret_list ~}
- name: HUGGINGFACE_TOKEN
valueFrom:
secretKeyRef:
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
credentials_config = {
fleet_host = "https://connectgateway.googleapis.com/v1/projects/PROJECT_NUMBER/locations/global/gkeMemberships/ai-benchmark"
}

project_id = "PROJECT_ID"

namespace = "default"
ksa = "benchmark-sa"
request_type = "grpc"

k8s_hf_secret = "hf-token"


# Locust service configuration
artifact_registry = "REGISTRY_LOCATION"
inference_server_service = "jetstream-svc:9000"
locust_runner_kubernetes_service_account = "sample-runner-sa"
output_bucket = "${PROJECT_ID}-benchmark-output-bucket-01"
gcs_path = "PATH_TO_PROMPT_BUCKET"

# Benchmark configuration for Locust Docker accessing inference server
inference_server_framework = "jetstream"
tokenizer = "google/gemma-7b"

# Benchmark configuration for triggering single test via Locust Runner
test_duration = 60
# Increase test_users to allow more parallelism (especially when testing HPA)
test_users = 1
test_rate = 5
10 changes: 9 additions & 1 deletion benchmarks/benchmark/tools/locust-load-inference/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -197,8 +197,16 @@ variable "run_test_automatically" {
default = false
}

// TODO: add validation to make k8s_hf_secret & hugging_face_secret mutually exclusive once terraform is updated with: https://discuss.hashicorp.com/t/experiment-feedback-input-variable-validation-can-cross-reference-other-objects/66644
variable "k8s_hf_secret" {
description = "Name of secret for huggingface token; stored in k8s "
type = string
nullable = true
default = null
}

variable "hugging_face_secret" {
description = "name of the kubectl huggingface secret token"
description = "name of the kubectl huggingface secret token; stored in Secret Manager. Security considerations: https://kubernetes.io/docs/concepts/security/secrets-good-practices/"
type = string
nullable = true
default = null
Expand Down
151 changes: 151 additions & 0 deletions benchmarks/inference-server/jetstream/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
# AI on GKE Benchmarking for JetStream

Deploying and benchmarking JetStream on TPU has many similarities with the standard GPU path. But distinct enough differences to warrant a separate readme. If you are familiar with deploying on GPU, much of this should be familiar. For a more detailed understanding of each step. Refer to our primary benchmarking [README](https://github.com/GoogleCloudPlatform/ai-on-gke/tree/main/benchmarks)

## Pre-requisites
- [kaggle user/token](https://www.kaggle.com/docs/api)
- [huggingface user/token](https://huggingface.co/docs/hub/en/security-tokens)

### Creating K8s infra

To create our TPU cluster, run:

```
# Stage 1 creates the cluster.
cd infra/stage-1
# Copy the sample variables and update the project ID, cluster name and other
parameters as needed in the `terraform.tfvars` file.
cp sample-tfvars/jetstream-sample.tfvars terraform.tfvars
# Initialize the Terraform modules.
terraform init
# Run plan to see the changes that will be made.
terraform plan
# Run apply if the changes look good by confirming the prompt.
terraform apply
```
To verify that the cluster has been set up correctly, run
```
# Get credentials using fleet membership
gcloud container fleet memberships get-credentials <cluster-name>
# Run a kubectl command to verify
kubectl get nodes
```

## Configure the cluster

To configure the cluster to run inference workloads we need to set up workload identity and GCS Fuse.
```
# Stage 2 configures the cluster for running inference workloads.
cd infra/stage-2
# Copy the sample variables and update the project number and cluster name in
# the fleet_host variable "https://connectgateway.googleapis.com/v1/projects/<project-number>/locations/global/gkeMemberships/<cluster-name>"
# and the project name and bucket name parameters as needed in the
# `terraform.tfvars` file. You can specify a new bucket name in which case it
# will be created.
cp sample-tfvars/jetstream-sample.tfvars terraform.tfvars
# Initialize the Terraform modules.
terraform init
# Run plan to see the changes that will be made.
terraform plan
# Run apply if the changes look good by confirming the prompt.
terraform apply
```

### Convert Gemma model weights to maxtext weights

JetStream has [two engine implementations](https://github.com/google/JetStream?tab=readme-ov-file#jetstream-engine-implementation). A Jax variant (via MaxText) and a Pytorch variant. This guide will use the Jax backend.

Jetstream currently requires that models be converted to MaxText weights. This example will deploy a Gemma-7b model. Much of this information is similar to this guide [here](https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-gemma-tpu-jetstream#convert-checkpoints).

*SKIP IF ALREADY COMPLETED*

Create kaggle secret
```
kubectl create secret generic kaggle-secret \
--from-file=kaggle.json
```

Replace `model-conversion/kaggle_converter.yaml: GEMMA_BUCKET_NAME` with the correct bucket name where you would like the model to be stored.
***NOTE: If you are using a different bucket that the ones you created give the service account Storage Admin permissions on that bucket. This can be done on the UI or by running:
```
gcloud projects add-iam-policy-binding PROJECT_ID \
--member "serviceAccount:SA_NAME@PROJECT_ID.iam.gserviceaccount.com" \
--role roles/storage.admin
```

Run:
```
kubectl apply -f model-conversion/kaggle_converter.yaml
```

This should take ~10 minutes to complete.

### Deploy JetStream

Replace the `jetstream.yaml:GEMMA_BUCKET_NAME` with the same bucket name as above.

Run:
```
kubectl apply -f jetstream.yaml
```

Verify the pod is running with
```
kubectl get pods
```

Get the external IP with:

```
kubectl get services
```

And you can make a request prompt with:
```
curl --request POST \
--header "Content-type: application/json" \
-s \
JETSTREAM_EXTERNAL_IP:8000/generate \
--data \
'{
"prompt": "What is a TPU?",
"max_tokens": 200
}'
```

### Deploy the benchmark

To prepare the dataset for the Locust inference benchmark, view the README.md file in:
```
cd benchmark/dataset/ShareGPT_v3_unflitered_cleaned_split
```

To deploy the Locust inference benchmark with the above model, run
```
cd benchmark/tools/locust-load-inference
# Copy the sample variables and update the project number and cluster name in
# the fleet_host variable "https://connectgateway.googleapis.com/v1/projects/<project-number>/locations/global/gkeMemberships/<cluster-name>"
# in the `terraform.tfvars` file.
cp sample-tfvars/jetstream-sample.tfvars terraform.tfvars
# Initialize the Terraform modules.
terraform init
# Run plan to see the changes that will be made.
terraform plan
# Run apply if the changes look good by confirming the prompt.
terraform apply
```

To further interact with the Locust inference benchmark, view the README.md file in `benchmark/tools/locust-load-inference`
63 changes: 63 additions & 0 deletions benchmarks/inference-server/jetstream/jetstream.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: maxengine-server
spec:
replicas: 1
selector:
matchLabels:
app: maxengine-server
template:
metadata:
labels:
app: maxengine-server
spec:
serviceAccountName: benchmark-sa
nodeSelector:
cloud.google.com/gke-tpu-topology: 2x2
cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
containers:
- name: maxengine-server
image: us-docker.pkg.dev/cloud-tpu-images/inference/maxengine-server:v0.2.0
args:
- model_name=gemma-7b
- tokenizer_path=assets/tokenizer.gemma
- per_device_batch_size=4
- max_prefill_predict_length=1024
- max_target_length=2048
- async_checkpointing=false
- ici_fsdp_parallelism=1
- ici_autoregressive_parallelism=-1
- ici_tensor_parallelism=1
- scan_layers=false
- weight_dtype=bfloat16
- load_parameters_path=gs://GEMMA_BUCKET_NAME/final/unscanned/gemma_7b-it/0/checkpoints/0/items
ports:
- containerPort: 9000
resources:
requests:
google.com/tpu: 4
limits:
google.com/tpu: 4
- name: jetstream-http
image: us-docker.pkg.dev/cloud-tpu-images/inference/jetstream-http:v0.2.0
ports:
- containerPort: 8000
---
apiVersion: v1
kind: Service
metadata:
name: jetstream-svc
spec:
selector:
app: maxengine-server
ports:
- protocol: TCP
name: http
port: 8000
targetPort: 8000
- protocol: TCP
name: grpc
port: 9000
targetPort: 9000
type: LoadBalancer
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
apiVersion: batch/v1
kind: Job
metadata:
name: data-loader-7b
spec:
ttlSecondsAfterFinished: 30
template:
spec:
serviceAccountName: benchmark-sa
restartPolicy: Never
containers:
- name: inference-checkpoint
image: us-docker.pkg.dev/cloud-tpu-images/inference/inference-checkpoint:v0.2.0
args:
- -b=GEMMA_BUCKET_NAME
- -m=google/gemma/maxtext/7b-it/2
volumeMounts:
- mountPath: "/kaggle/"
name: kaggle-credentials
readOnly: true
resources:
requests:
google.com/tpu: 4
limits:
google.com/tpu: 4
nodeSelector:
cloud.google.com/gke-tpu-topology: 2x2
cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
volumes:
- name: kaggle-credentials
secret:
defaultMode: 0400
secretName: kaggle-secret
Loading

0 comments on commit 62f8d3c

Please sign in to comment.