Skip to content

Commit

Permalink
updating branch with main
Browse files Browse the repository at this point in the history
  • Loading branch information
german-grandas committed Aug 30, 2024
2 parents c8e5d35 + 203ddbc commit c437736
Show file tree
Hide file tree
Showing 27 changed files with 1,675 additions and 26 deletions.
14 changes: 14 additions & 0 deletions applications/rag/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -325,3 +325,17 @@ module "frontend" {
members_allowlist = var.frontend_members_allowlist != "" ? split(",", var.frontend_members_allowlist) : []
depends_on = [module.namespace]
}

resource "helm_release" "gmp-apps" {
name = "gmp-apps"
provider = helm.rag
chart = "../../charts/gmp-engine/"
namespace = local.kubernetes_namespace
# Timeout is increased to guarantee sufficient scale-up time for Autopilot nodes.
timeout = 1200
depends_on = [module.inference-server, module.frontend]
values = [
"${file("${path.module}/podmonitoring.yaml")}"
]
}

40 changes: 26 additions & 14 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,13 @@ via these terraform scripts on a Standard cluster that you've created yourself.
This tutorial assumes you have access to use google storage APIs via Application Default Credentials (ADC).
To login, you can run the following:

```
```sh
gcloud auth application-default login
```

### Terraform

Install Terraform by following the documentation at https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli.
Install Terraform by following the documentation at <https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli>.
This requires a minimum Terraform version of 1.7.4

### python
Expand All @@ -34,10 +34,11 @@ You may need to run pip install -r benchmark/dataset/ShareGPT_v3_unflitered_clea

This section goes over an end to end example to deploy and benchmark the Falcon 7b model using [TGI](https://huggingface.co/docs/text-generation-inference/en/index) on a Standard GKE Cluster with GPUs.

Each step below has more details in their respective directoy README.md. It is recommended
Each step below has more details in their respective directory README.md. It is recommended
that you read through the available options at least once when testing your own models.

At a high level, running an inference benchmark in GKE involves these five steps:

1. Create the cluster
2. Configure the cluster
3. Deploy the inference server
Expand All @@ -51,7 +52,8 @@ Set up the infrastructure by creating a GKE cluster with appropriate accelerator
configuration.

To create a GPU cluster, in the ai-on-gke/benchmarks folder run:
```

```sh
# Stage 1 creates the cluster.
cd infra/stage-1

Expand All @@ -68,8 +70,10 @@ terraform plan
# Run apply if the changes look good by confirming the prompt.
terraform apply
```

To verify that the cluster has been set up correctly, run
```

```sh
# Get credentials using fleet membership
gcloud container fleet memberships get-credentials <cluster-name> --project <project-id>

Expand All @@ -80,7 +84,8 @@ kubectl get nodes
### 2. Configure the cluster

To configure the cluster to run inference workloads we need to set up workload identity, GCS Fuse and DCGM for GPU metrics. In the ai-on-gke/benchmarks folder run:
```

```sh
# Stage 2 configures the cluster for running inference workloads.
cd infra/stage-2

Expand All @@ -104,7 +109,8 @@ terraform apply
### 3. Deploy the inference server

To deploy TGI with a sample model, in the ai-on-gke/benchmarks folder run:
```

```sh
# text-generation-inference is the inference workload we'll deploy.
cd inference-server/text-generation-inference

Expand All @@ -125,17 +131,20 @@ terraform apply
```

It may take a minute or two for the inference server to be ready to serve. To verify that the model is running, you can run:
```

```sh
kubectl get deployment -n benchmark
```

This will show the status of the TGI server running.

### 4. Deploy the benchmark

#### Prepare the benchmark dataset

To prepare the dataset for the Locust inference benchmark, in the ai-on-gke/benchmarks folder run:
```

```sh
# This folder contains a script that prepares the prompts for ShareGPT_v3_unflitered_cleaned_split dataset
# that works out of the box with the locust benchmarking expected format.
cd benchmark/dataset/ShareGPT_v3_unflitered_cleaned_split
Expand All @@ -147,7 +156,8 @@ python3 upload_sharegpt.py --gcs_path="gs://${PROJECT_ID}-ai-gke-benchmark-fuse/
#### Deploy the benchmarking tool

To deploy the Locust inference benchmark with the above model, in the ai-on-gke/benchmarks folder run:
```

```sh
# This folder contains the benchmark tool that generates requests for your workload
cd benchmark/tools/locust-load-inference

Expand All @@ -173,7 +183,7 @@ To further interact with the Locust inference benchmark, view the README.md file

An end to end Locust benchmark that runs for a given amount of time can be triggered via a curl command to the Locust Runner service:

```
```sh
# get the locust runner endpoint
kubectl get service -n benchmark locust-runner-api

Expand All @@ -182,7 +192,8 @@ curl -XGET http://$RUNNER_ENDPOINT_IP:8000/run
```

A results file will appear in the GCS bucket specified as output_bucket in input variables once the benchmark is completed. Metrics and Locust statistics are visible under the [Cloud Monitoring metrics explorer](http://pantheon.corp.google.com/monitoring/metrics-explorer). In the ai-on-gke/benchmarks/benchmark/tools/locust-load-inference, run the following command to create a sample custom dashboard for the above related example:
```

```sh
# apply the sample dashboard to easily view and explore metrics
gcloud monitoring dashboards create --config-from-file ./sample-dashboards/tgi-dashboard.yaml
```
Expand All @@ -192,9 +203,10 @@ View the results in the [Cloud Monitoring Dashboards](https://pantheon.corp.goog
For more ways to interact with the locust benchmarking tooling, see the instructions in the [locust-load-inference README.md here](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/benchmarks/benchmark/tools/locust-load-inference/README.md#step-9-start-an-end-to-end-benchmark).

### 6. Clean Up

To clean up the above setup, in the ai-on-gke/benchmarks folder run:

```
```sh
# Run destroy on locust load generator
cd benchmark/tools/locust-load-inference
terraform destroy
Expand All @@ -215,4 +227,4 @@ terraform destroy
# Run destroy on infra/stage-1 resources
cd ../stage-1
terraform destroy
```
```
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ def main(gcs_path: str, overwrite: bool):
parser.add_argument('--overwrite', default=False,
action=argparse.BooleanOptionalAction)
args = parser.parse_args()
gcs_uri_pattern = "^gs:\/\/[a-z0-9.\-_]{3,63}\/(.+\/)*(.+)$"
gcs_uri_pattern = "^gs:\\/\\/[a-z0-9.\\-_]{3,63}\\/(.+\\/)*(.+)$"
if not re.match(gcs_uri_pattern, args.gcs_path):
raise ValueError(
f"Invalid GCS path: {args.gcs_path}, expecting format \"gs://$BUCKET/$FILENAME\"")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,9 @@ project_id = "$PROJECT_ID"
namespace = "benchmark"
ksa = "benchmark-ksa"

k8s_hf_secret = "hf-token"

# This is needed for loading the gated models on HF
# k8s_hf_secret = "hf-token"

# Locust service configuration
artifact_registry = "us-central1-docker.pkg.dev/$PROJECT_ID/ai-benchmark"
Expand Down
180 changes: 180 additions & 0 deletions benchmarks/benchmark/tools/profile-generator/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
# AI on GKE Benchmark Latency Profile Generator

<!-- TOC -->
* [AI on GKE Benchmark Latency Profile Generator](#ai-on-gke-benchmark-latency-profile-generator)
* [Overview](#overview)
* [Instructions](#instructions)
* [Step 1: create output bucket](#step-1--create-output-bucket)
* [Step 2: create and give service account access to write to output gcs bucket](#step-2--create-and-give-service-account-access-to-write-to-output-gcs-bucket)
* [Step 3: create artifact repository for automated Latency Profile Generator docker build](#step-3--create-artifact-repository-for-automated-latency-profile-generator-docker-build)
* [Step 4: create and configure terraform.tfvars](#step-4--create-and-configure-terraformtfvars)
* [[optional] set-up credentials config with kubeconfig](#optional-set-up-credentials-config-with-kubeconfig)
* [[optional] set up secret token in Secret Manager](#optional-set-up-secret-token-in-secret-manager)
* [Step 6: terraform initialize, plan and apply](#step-6--terraform-initialize-plan-and-apply)
* [Inputs](#inputs)
<!-- TOC -->

## Overview

This deploys the latency profile generator which measures the throuhghput and
latency at various request rates for the model and model server of your choice.

It currently supports the following frameworks:
- tensorrt_llm_triton
- text generation inference (tgi)
- vllm
- sax
-jetstream

## Instructions

### Step 1: create output bucket

If you followed steps from `../../infra/` for creating your cluster and extra
resources, you will already have an output bucket created for you.
If not, you will have to create and manage your own gcs bucket for storing
benchmarking results.

Set the `output_bucket` in your `terraform.tfvars` to this gcs bucket.

### Step 2: create and give service account access to write to output gcs bucket

The Latency profile generator requires storage.admin access to write output to
the given output gcs bucket. If you followed steps in `../../infra`, then you
already be logged into gcloud have a kubernetes and gcloud service account
created that has the proper access to the created output bucket. If you are
not logged into gcloud, run the following:

```bash
gcloud auth application-default login
```

To give viewer permissions on the gcs bucket to the gcloud service account,
run the following:

```
gcloud storage buckets add-iam-policy-binding gs://$OUTPUT_BUCKET/
--member=serviceAccount:$GOOGLE_SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com --role=roles/storage.admin
```

Your kubernetes service account will inherit the reader permissions.

You will set the `latency_profile_kubernetes_service_account` in your
`terraform.tfvars` to the kubernetes service account name.

### Step 3: create artifact repository for automated Latency Profile Generator docker build

The latency profile generator rebuilds the docker file on each terraform apply
if `build_latency_profile_generator_image` is set to true (default is true).
The containers will be pushed to the given `artifact_registry`. This artifact
repository is expected to already exist. If you created your cluster via
`../../infra/`, then an artifact repository was created for you with the same
name as the prefix in the same location as the cluster. You can also create your
own via this command:

```bash
gcloud artifacts repositories create ai-benchmark --location=us-central1 --repository-format=docker
```

### Step 4: create and configure terraform.tfvars

Create a `terraform.tfvars` file. `./sample-tfvars` is provided as an example
file. You can copy the file as a starting point.
Note that at a minimum you will have to change the existing
`credentials_config`, `project_id`, and `artifact_registry`.

```bash
cp ./sample-tfvars terraform.tfvars
```

Fill out your `terraform.tfvars` with the desired model and server configuration, referring to the list of required and optional variables [here](#variables). The following variables are required:
- `credentials_config` - credentials for cluster to deploy Latency Profile Generator benchmark tool on
- `project_id` - project id for enabling dependent services for building Latency Profile Generator artifacts
- `artifact_registry` - artifact registry to upload Latency Profile Generator artifacts to
- `build_latency_profile_generator_image` - Whether latency profile generator image will be built or not
- `targets` - Which model servers are we targeting for benchmarking? Set the fields on `manual` if intending to benchmark a model server already in the cluster.
- `output_bucket` - gcs bucket to write benchmarking metrics to.
- `latency_profile_kubernetes_service_account` - service account giving access to latency profile generator to write to `output_bucket`
- `k8s_hf_secret` - Name of secret for huggingface token stored in k8s

#### [optional] set-up credentials config with kubeconfig

If your cluster has fleet management enabled, the existing `credentials_config`
can use the fleet host credentials like this:

```bash
credentials_config = {
fleet_host = "https://connectgateway.googleapis.com/v1/projects/$PROJECT_NUMBER/locations/global/gkeMemberships/$CLUSTER_NAME"
}
```

If your cluster does not have fleet management enabled, you can use your
cluster's kubeconfig in the `credentials_config`. You must isolate your
cluster's kubeconfig from other clusters in the default kube.config file.
To do this, run the following command:

```bash
KUBECONFIG=~/.kube/${CLUSTER_NAME}-kube.config gcloud container clusters get-credentials $CLUSTER_NAME --location $CLUSTER_LOCATION
```

Then update your `terraform.tfvars` `credentials_config` to the following:

```bash
credentials_config = {
kubeconfig = {
path = "~/.kube/${CLUSTER_NAME}-kube.config"
}
}
```

#### [optional] set up secret token in Secret Manager

A model may require a security token to access it. For example, Llama2 from
HuggingFace is a gated model that requires a
[user access token](https://huggingface.co/docs/hub/en/security-tokens). If the
model you want to run does not require this, skip this step.

If you followed steps from `.../../infra/`, Secret Manager and the user access
token should already be set up. If not, it is strongly recommended that you use
Workload Identity and Secret Manager to access the user access tokens to avoid
adding a plain text token into the terraform state. To do so, follow the
instructions for
[setting up a secret in Secret Manager here](https://cloud.google.com/kubernetes-engine/docs/tutorials/workload-identity-secrets).

Once complete, you should add these related secret values to your
`terraform.tfvars`:

```bash
# ex. "projects/sample-project/secrets/hugging_face_secret"
hugging_face_secret = $SECRET_ID
# ex. 1
hugging_face_secret_version = $SECRET_VERSION
```

### Step 5: login to gcloud

Run the following gcloud command for authorization:

```bash
gcloud auth application-default login
```

### Step 6: terraform initialize, plan and apply

Run the following terraform commands:

```bash
# initialize terraform
terraform init

# verify changes
terraform plan

# apply changes
terraform apply
```

The results can be viewed via running the following:
```
kubectl logs job/latency-profile-generator
```
8 changes: 8 additions & 0 deletions benchmarks/benchmark/tools/profile-generator/build.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
resource "null_resource" "build_and_push_image" {
count = var.build_latency_profile_generator_image ? 1 : 0
depends_on = [resource.google_project_service.cloudbuild]
provisioner "local-exec" {
working_dir = path.module
command = "gcloud builds submit --tag ${var.artifact_registry}/latency-profile:latest container"
}
}
21 changes: 21 additions & 0 deletions benchmarks/benchmark/tools/profile-generator/container/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04 AS dev

RUN apt-get update -y \
&& apt-get install -y python3-pip git vim curl wget
RUN pip3 install --upgrade pip
RUN pip install packaging torch transformers
WORKDIR /workspace

# install build and runtime dependencies
COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt

RUN pip install -U "huggingface_hub[cli]"

RUN wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

COPY benchmark_serving.py benchmark_serving.py
COPY latency_throughput_curve.sh latency_throughput_curve.sh

RUN chmod +x latency_throughput_curve.sh
RUN chmod +x benchmark_serving.py
Loading

0 comments on commit c437736

Please sign in to comment.