Skip to content

Commit

Permalink
Add latency profile generator (#775)
Browse files Browse the repository at this point in the history
* Add latency profile generator

This change adds a new benchmarking suite called
latency-profile-generator which runs serving benchmarks at different
request rates to produce latency and throughput numbers at different
QPS. This can be used to identify how different models and model servers
perform depending on incoming traffic.

* Update readme and GCS push steps

* first commit

* remove profiling

* correct steps

* correct steps

* jetstream option for backend arg

* extra parameters

* configurable pipeline starting point, request rates configurable

* WIP changes/reversions

* setting for building latency profiler image

* onoly build once for profile-generator

* fmt

* fmt

* Move deploy model server from profile-generator to latency-profile

* fix kubectl wait

* dmt

* intermediate changes

* Update table of contents

* Fix lint issues

* remove specific project id

* remove artifact_registry

* Stripped back LPG automation for separate PR

* typo

* nit

* remove bad depends_on

* nits

* nits

* Update main.tf

* Update README.md

* more cleanup

* move latency-profile module to subdirectory

* fmt

* supports jetstream

* Added comment

* more accurate comment

* readd container folder

---------

Co-authored-by: Brendan Slabe <[email protected]>
Co-authored-by: Brendan Slabe <[email protected]>
  • Loading branch information
3 people authored Aug 23, 2024
1 parent 5796d4b commit 2c2fec4
Show file tree
Hide file tree
Showing 13 changed files with 1,327 additions and 0 deletions.
180 changes: 180 additions & 0 deletions benchmarks/benchmark/tools/profile-generator/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
# AI on GKE Benchmark Latency Profile Generator

<!-- TOC -->
* [AI on GKE Benchmark Latency Profile Generator](#ai-on-gke-benchmark-latency-profile-generator)
* [Overview](#overview)
* [Instructions](#instructions)
* [Step 1: create output bucket](#step-1--create-output-bucket)
* [Step 2: create and give service account access to write to output gcs bucket](#step-2--create-and-give-service-account-access-to-write-to-output-gcs-bucket)
* [Step 3: create artifact repository for automated Latency Profile Generator docker build](#step-3--create-artifact-repository-for-automated-latency-profile-generator-docker-build)
* [Step 4: create and configure terraform.tfvars](#step-4--create-and-configure-terraformtfvars)
* [[optional] set-up credentials config with kubeconfig](#optional-set-up-credentials-config-with-kubeconfig)
* [[optional] set up secret token in Secret Manager](#optional-set-up-secret-token-in-secret-manager)
* [Step 6: terraform initialize, plan and apply](#step-6--terraform-initialize-plan-and-apply)
* [Inputs](#inputs)
<!-- TOC -->

## Overview

This deploys the latency profile generator which measures the throuhghput and
latency at various request rates for the model and model server of your choice.

It currently supports the following frameworks:
- tensorrt_llm_triton
- text generation inference (tgi)
- vllm
- sax
-jetstream

## Instructions

### Step 1: create output bucket

If you followed steps from `../../infra/` for creating your cluster and extra
resources, you will already have an output bucket created for you.
If not, you will have to create and manage your own gcs bucket for storing
benchmarking results.

Set the `output_bucket` in your `terraform.tfvars` to this gcs bucket.

### Step 2: create and give service account access to write to output gcs bucket

The Latency profile generator requires storage.admin access to write output to
the given output gcs bucket. If you followed steps in `../../infra`, then you
already be logged into gcloud have a kubernetes and gcloud service account
created that has the proper access to the created output bucket. If you are
not logged into gcloud, run the following:

```bash
gcloud auth application-default login
```

To give viewer permissions on the gcs bucket to the gcloud service account,
run the following:

```
gcloud storage buckets add-iam-policy-binding gs://$OUTPUT_BUCKET/
--member=serviceAccount:$GOOGLE_SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com --role=roles/storage.admin
```

Your kubernetes service account will inherit the reader permissions.

You will set the `latency_profile_kubernetes_service_account` in your
`terraform.tfvars` to the kubernetes service account name.

### Step 3: create artifact repository for automated Latency Profile Generator docker build

The latency profile generator rebuilds the docker file on each terraform apply
if `build_latency_profile_generator_image` is set to true (default is true).
The containers will be pushed to the given `artifact_registry`. This artifact
repository is expected to already exist. If you created your cluster via
`../../infra/`, then an artifact repository was created for you with the same
name as the prefix in the same location as the cluster. You can also create your
own via this command:

```bash
gcloud artifacts repositories create ai-benchmark --location=us-central1 --repository-format=docker
```

### Step 4: create and configure terraform.tfvars

Create a `terraform.tfvars` file. `./sample-tfvars` is provided as an example
file. You can copy the file as a starting point.
Note that at a minimum you will have to change the existing
`credentials_config`, `project_id`, and `artifact_registry`.

```bash
cp ./sample-tfvars terraform.tfvars
```

Fill out your `terraform.tfvars` with the desired model and server configuration, referring to the list of required and optional variables [here](#variables). The following variables are required:
- `credentials_config` - credentials for cluster to deploy Latency Profile Generator benchmark tool on
- `project_id` - project id for enabling dependent services for building Latency Profile Generator artifacts
- `artifact_registry` - artifact registry to upload Latency Profile Generator artifacts to
- `build_latency_profile_generator_image` - Whether latency profile generator image will be built or not
- `targets` - Which model servers are we targeting for benchmarking? Set the fields on `manual` if intending to benchmark a model server already in the cluster.
- `output_bucket` - gcs bucket to write benchmarking metrics to.
- `latency_profile_kubernetes_service_account` - service account giving access to latency profile generator to write to `output_bucket`
- `k8s_hf_secret` - Name of secret for huggingface token stored in k8s

#### [optional] set-up credentials config with kubeconfig

If your cluster has fleet management enabled, the existing `credentials_config`
can use the fleet host credentials like this:

```bash
credentials_config = {
fleet_host = "https://connectgateway.googleapis.com/v1/projects/$PROJECT_NUMBER/locations/global/gkeMemberships/$CLUSTER_NAME"
}
```

If your cluster does not have fleet management enabled, you can use your
cluster's kubeconfig in the `credentials_config`. You must isolate your
cluster's kubeconfig from other clusters in the default kube.config file.
To do this, run the following command:

```bash
KUBECONFIG=~/.kube/${CLUSTER_NAME}-kube.config gcloud container clusters get-credentials $CLUSTER_NAME --location $CLUSTER_LOCATION
```

Then update your `terraform.tfvars` `credentials_config` to the following:

```bash
credentials_config = {
kubeconfig = {
path = "~/.kube/${CLUSTER_NAME}-kube.config"
}
}
```

#### [optional] set up secret token in Secret Manager

A model may require a security token to access it. For example, Llama2 from
HuggingFace is a gated model that requires a
[user access token](https://huggingface.co/docs/hub/en/security-tokens). If the
model you want to run does not require this, skip this step.

If you followed steps from `.../../infra/`, Secret Manager and the user access
token should already be set up. If not, it is strongly recommended that you use
Workload Identity and Secret Manager to access the user access tokens to avoid
adding a plain text token into the terraform state. To do so, follow the
instructions for
[setting up a secret in Secret Manager here](https://cloud.google.com/kubernetes-engine/docs/tutorials/workload-identity-secrets).

Once complete, you should add these related secret values to your
`terraform.tfvars`:

```bash
# ex. "projects/sample-project/secrets/hugging_face_secret"
hugging_face_secret = $SECRET_ID
# ex. 1
hugging_face_secret_version = $SECRET_VERSION
```

### Step 5: login to gcloud

Run the following gcloud command for authorization:

```bash
gcloud auth application-default login
```

### Step 6: terraform initialize, plan and apply

Run the following terraform commands:

```bash
# initialize terraform
terraform init

# verify changes
terraform plan

# apply changes
terraform apply
```

The results can be viewed via running the following:
```
kubectl logs job/latency-profile-generator
```
8 changes: 8 additions & 0 deletions benchmarks/benchmark/tools/profile-generator/build.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
resource "null_resource" "build_and_push_image" {
count = var.build_latency_profile_generator_image ? 1 : 0
depends_on = [resource.google_project_service.cloudbuild]
provisioner "local-exec" {
working_dir = path.module
command = "gcloud builds submit --tag ${var.artifact_registry}/latency-profile:latest container"
}
}
21 changes: 21 additions & 0 deletions benchmarks/benchmark/tools/profile-generator/container/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04 AS dev

RUN apt-get update -y \
&& apt-get install -y python3-pip git vim curl wget
RUN pip3 install --upgrade pip
RUN pip install packaging torch transformers
WORKDIR /workspace

# install build and runtime dependencies
COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt

RUN pip install -U "huggingface_hub[cli]"

RUN wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

COPY benchmark_serving.py benchmark_serving.py
COPY latency_throughput_curve.sh latency_throughput_curve.sh

RUN chmod +x latency_throughput_curve.sh
RUN chmod +x benchmark_serving.py
Loading

0 comments on commit 2c2fec4

Please sign in to comment.