Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add latency profile generator #775

Merged
merged 37 commits into from
Aug 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
a6677e5
Add latency profile generator
achandrasekar Aug 12, 2024
022ccc8
Update readme and GCS push steps
achandrasekar Aug 12, 2024
d614688
first commit
Bslabe123 Aug 13, 2024
c7f36ab
remove profiling
Bslabe123 Aug 13, 2024
d5d983e
correct steps
Bslabe123 Aug 13, 2024
312fba6
correct steps
Bslabe123 Aug 13, 2024
8973004
jetstream option for backend arg
Bslabe123 Aug 13, 2024
3e62239
extra parameters
Bslabe123 Aug 14, 2024
44143e7
configurable pipeline starting point, request rates configurable
Bslabe123 Aug 15, 2024
91fe581
WIP changes/reversions
Bslabe123 Aug 16, 2024
545531d
setting for building latency profiler image
Bslabe123 Aug 16, 2024
8f2ea1c
onoly build once for profile-generator
Bslabe123 Aug 16, 2024
29556cb
fmt
Bslabe123 Aug 16, 2024
6a77639
fmt
Bslabe123 Aug 16, 2024
6b15b61
Move deploy model server from profile-generator to latency-profile
Bslabe123 Aug 19, 2024
a295aa9
fix kubectl wait
Bslabe123 Aug 19, 2024
f4e76b2
dmt
Bslabe123 Aug 19, 2024
ccc512f
intermediate changes
Bslabe123 Aug 20, 2024
b720fb3
Update table of contents
achandrasekar Aug 21, 2024
49f099b
Fix lint issues
achandrasekar Aug 21, 2024
d5062bc
remove specific project id
Bslabe123 Aug 21, 2024
6b329bf
remove artifact_registry
Bslabe123 Aug 21, 2024
051b310
Stripped back LPG automation for separate PR
Bslabe123 Aug 22, 2024
3353eab
typo
Bslabe123 Aug 22, 2024
6f7e364
nit
Bslabe123 Aug 22, 2024
f73d4a6
remove bad depends_on
Bslabe123 Aug 22, 2024
e17f4c1
nits
Bslabe123 Aug 22, 2024
6a59803
nits
Bslabe123 Aug 22, 2024
99907c6
Update main.tf
Bslabe123 Aug 23, 2024
b54c7a9
Update README.md
Bslabe123 Aug 23, 2024
2618042
more cleanup
Bslabe123 Aug 23, 2024
7267d30
move latency-profile module to subdirectory
Bslabe123 Aug 23, 2024
569ef6d
fmt
Bslabe123 Aug 23, 2024
93682d3
supports jetstream
Bslabe123 Aug 23, 2024
b1a6b7b
Added comment
Bslabe123 Aug 23, 2024
2ab541f
more accurate comment
Bslabe123 Aug 23, 2024
f1d56d6
readd container folder
Bslabe123 Aug 23, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
180 changes: 180 additions & 0 deletions benchmarks/benchmark/tools/profile-generator/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
# AI on GKE Benchmark Latency Profile Generator

<!-- TOC -->
* [AI on GKE Benchmark Latency Profile Generator](#ai-on-gke-benchmark-latency-profile-generator)
* [Overview](#overview)
* [Instructions](#instructions)
* [Step 1: create output bucket](#step-1--create-output-bucket)
* [Step 2: create and give service account access to write to output gcs bucket](#step-2--create-and-give-service-account-access-to-write-to-output-gcs-bucket)
* [Step 3: create artifact repository for automated Latency Profile Generator docker build](#step-3--create-artifact-repository-for-automated-latency-profile-generator-docker-build)
* [Step 4: create and configure terraform.tfvars](#step-4--create-and-configure-terraformtfvars)
* [[optional] set-up credentials config with kubeconfig](#optional-set-up-credentials-config-with-kubeconfig)
* [[optional] set up secret token in Secret Manager](#optional-set-up-secret-token-in-secret-manager)
* [Step 6: terraform initialize, plan and apply](#step-6--terraform-initialize-plan-and-apply)
* [Inputs](#inputs)
<!-- TOC -->

## Overview

This deploys the latency profile generator which measures the throuhghput and
latency at various request rates for the model and model server of your choice.

It currently supports the following frameworks:
- tensorrt_llm_triton
- text generation inference (tgi)
- vllm
- sax
-jetstream

## Instructions

### Step 1: create output bucket

If you followed steps from `../../infra/` for creating your cluster and extra
resources, you will already have an output bucket created for you.
If not, you will have to create and manage your own gcs bucket for storing
benchmarking results.

Set the `output_bucket` in your `terraform.tfvars` to this gcs bucket.

### Step 2: create and give service account access to write to output gcs bucket

The Latency profile generator requires storage.admin access to write output to
the given output gcs bucket. If you followed steps in `../../infra`, then you
already be logged into gcloud have a kubernetes and gcloud service account
created that has the proper access to the created output bucket. If you are
not logged into gcloud, run the following:

```bash
gcloud auth application-default login
```

To give viewer permissions on the gcs bucket to the gcloud service account,
run the following:

```
gcloud storage buckets add-iam-policy-binding gs://$OUTPUT_BUCKET/
--member=serviceAccount:$GOOGLE_SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com --role=roles/storage.admin
```

Your kubernetes service account will inherit the reader permissions.

You will set the `latency_profile_kubernetes_service_account` in your
`terraform.tfvars` to the kubernetes service account name.

### Step 3: create artifact repository for automated Latency Profile Generator docker build

The latency profile generator rebuilds the docker file on each terraform apply
if `build_latency_profile_generator_image` is set to true (default is true).
The containers will be pushed to the given `artifact_registry`. This artifact
repository is expected to already exist. If you created your cluster via
`../../infra/`, then an artifact repository was created for you with the same
name as the prefix in the same location as the cluster. You can also create your
own via this command:

```bash
gcloud artifacts repositories create ai-benchmark --location=us-central1 --repository-format=docker
```

### Step 4: create and configure terraform.tfvars

Create a `terraform.tfvars` file. `./sample-tfvars` is provided as an example
file. You can copy the file as a starting point.
Note that at a minimum you will have to change the existing
`credentials_config`, `project_id`, and `artifact_registry`.

```bash
cp ./sample-tfvars terraform.tfvars
```

Fill out your `terraform.tfvars` with the desired model and server configuration, referring to the list of required and optional variables [here](#variables). The following variables are required:
- `credentials_config` - credentials for cluster to deploy Latency Profile Generator benchmark tool on
- `project_id` - project id for enabling dependent services for building Latency Profile Generator artifacts
- `artifact_registry` - artifact registry to upload Latency Profile Generator artifacts to
- `build_latency_profile_generator_image` - Whether latency profile generator image will be built or not
- `targets` - Which model servers are we targeting for benchmarking? Set the fields on `manual` if intending to benchmark a model server already in the cluster.
- `output_bucket` - gcs bucket to write benchmarking metrics to.
- `latency_profile_kubernetes_service_account` - service account giving access to latency profile generator to write to `output_bucket`
- `k8s_hf_secret` - Name of secret for huggingface token stored in k8s

#### [optional] set-up credentials config with kubeconfig

If your cluster has fleet management enabled, the existing `credentials_config`
can use the fleet host credentials like this:

```bash
credentials_config = {
fleet_host = "https://connectgateway.googleapis.com/v1/projects/$PROJECT_NUMBER/locations/global/gkeMemberships/$CLUSTER_NAME"
}
```

If your cluster does not have fleet management enabled, you can use your
cluster's kubeconfig in the `credentials_config`. You must isolate your
cluster's kubeconfig from other clusters in the default kube.config file.
To do this, run the following command:

```bash
KUBECONFIG=~/.kube/${CLUSTER_NAME}-kube.config gcloud container clusters get-credentials $CLUSTER_NAME --location $CLUSTER_LOCATION
```

Then update your `terraform.tfvars` `credentials_config` to the following:

```bash
credentials_config = {
kubeconfig = {
path = "~/.kube/${CLUSTER_NAME}-kube.config"
}
}
```

#### [optional] set up secret token in Secret Manager

A model may require a security token to access it. For example, Llama2 from
HuggingFace is a gated model that requires a
[user access token](https://huggingface.co/docs/hub/en/security-tokens). If the
model you want to run does not require this, skip this step.

If you followed steps from `.../../infra/`, Secret Manager and the user access
token should already be set up. If not, it is strongly recommended that you use
Workload Identity and Secret Manager to access the user access tokens to avoid
adding a plain text token into the terraform state. To do so, follow the
instructions for
[setting up a secret in Secret Manager here](https://cloud.google.com/kubernetes-engine/docs/tutorials/workload-identity-secrets).

Once complete, you should add these related secret values to your
`terraform.tfvars`:

```bash
# ex. "projects/sample-project/secrets/hugging_face_secret"
hugging_face_secret = $SECRET_ID
# ex. 1
hugging_face_secret_version = $SECRET_VERSION
```

### Step 5: login to gcloud

Run the following gcloud command for authorization:

```bash
gcloud auth application-default login
```

### Step 6: terraform initialize, plan and apply

Run the following terraform commands:

```bash
# initialize terraform
terraform init

# verify changes
terraform plan

# apply changes
terraform apply
```

The results can be viewed via running the following:
```
kubectl logs job/latency-profile-generator
```
8 changes: 8 additions & 0 deletions benchmarks/benchmark/tools/profile-generator/build.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
resource "null_resource" "build_and_push_image" {
count = var.build_latency_profile_generator_image ? 1 : 0
depends_on = [resource.google_project_service.cloudbuild]
provisioner "local-exec" {
working_dir = path.module
command = "gcloud builds submit --tag ${var.artifact_registry}/latency-profile:latest container"
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04 AS dev

RUN apt-get update -y \
&& apt-get install -y python3-pip git vim curl wget
RUN pip3 install --upgrade pip
RUN pip install packaging torch transformers
WORKDIR /workspace

# install build and runtime dependencies
COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt

RUN pip install -U "huggingface_hub[cli]"

RUN wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

COPY benchmark_serving.py benchmark_serving.py
COPY latency_throughput_curve.sh latency_throughput_curve.sh

RUN chmod +x latency_throughput_curve.sh
RUN chmod +x benchmark_serving.py
Loading