updating branch with main

GoogleCloudPlatform · Aug 30, 2024 · c437736 · c437736
2 parents c8e5d35 + 203ddbc
commit c437736
Show file tree

Hide file tree

Showing 27 changed files with 1,675 additions and 26 deletions.
diff --git a/applications/rag/main.tf b/applications/rag/main.tf
@@ -325,3 +325,17 @@ module "frontend" {
   members_allowlist        = var.frontend_members_allowlist != "" ? split(",", var.frontend_members_allowlist) : []
   depends_on               = [module.namespace]
 }
+
+resource "helm_release" "gmp-apps" {
+  name      = "gmp-apps"
+  provider  = helm.rag
+  chart     = "../../charts/gmp-engine/"
+  namespace = local.kubernetes_namespace
+  # Timeout is increased to guarantee sufficient scale-up time for Autopilot nodes.
+  timeout    = 1200
+  depends_on = [module.inference-server, module.frontend]
+  values = [
+    "${file("${path.module}/podmonitoring.yaml")}"
+  ]
+}
+
diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -15,13 +15,13 @@ via these terraform scripts on a Standard cluster that you've created yourself.
 This tutorial assumes you have access to use google storage APIs via Application Default Credentials (ADC).
 To login, you can run the following:
 
-```
+```sh
 gcloud auth application-default login
 ```
 
 ### Terraform
 
-Install Terraform by following the documentation at https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli.
+Install Terraform by following the documentation at <https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli>.
 This requires a minimum Terraform version of 1.7.4
 
 ### python
@@ -34,10 +34,11 @@ You may need to run pip install -r benchmark/dataset/ShareGPT_v3_unflitered_clea
 
 This section goes over an end to end example to deploy and benchmark the Falcon 7b model using [TGI](https://huggingface.co/docs/text-generation-inference/en/index) on a Standard GKE Cluster with GPUs.
 
-Each step below has more details in their respective directoy README.md. It is recommended
+Each step below has more details in their respective directory README.md. It is recommended
 that you read through the available options at least once when testing your own models.
 
 At a high level, running an inference benchmark in GKE involves these five steps:
+
 1. Create the cluster
 2. Configure the cluster
 3. Deploy the inference server
@@ -51,7 +52,8 @@ Set up the infrastructure by creating a GKE cluster with appropriate accelerator
 configuration.
 
 To create a GPU cluster, in the ai-on-gke/benchmarks folder run:
-```
+
+```sh
 # Stage 1 creates the cluster.
 cd infra/stage-1
 
@@ -68,8 +70,10 @@ terraform plan
 # Run apply if the changes look good by confirming the prompt.
 terraform apply
 ```
+
 To verify that the cluster has been set up correctly, run
-```
+
+```sh
 # Get credentials using fleet membership
 gcloud container fleet memberships get-credentials <cluster-name> --project <project-id>
 
@@ -80,7 +84,8 @@ kubectl get nodes
 ### 2. Configure the cluster
 
 To configure the cluster to run inference workloads we need to set up workload identity, GCS Fuse and DCGM for GPU metrics. In the ai-on-gke/benchmarks folder run:
-```
+
+```sh
 # Stage 2 configures the cluster for running inference workloads.
 cd infra/stage-2
 
@@ -104,7 +109,8 @@ terraform apply
 ### 3. Deploy the inference server
 
 To deploy TGI with a sample model, in the ai-on-gke/benchmarks folder run:
-```
+
+```sh
 # text-generation-inference is the inference workload we'll deploy.
 cd inference-server/text-generation-inference
 
@@ -125,17 +131,20 @@ terraform apply
 ```
 
 It may take a minute or two for the inference server to be ready to serve. To verify that the model is running, you can run:
-```
+
+```sh
 kubectl get deployment -n benchmark
 ```
+
 This will show the status of the TGI server running.
 
 ### 4. Deploy the benchmark
 
 #### Prepare the benchmark dataset
 
 To prepare the dataset for the Locust inference benchmark, in the ai-on-gke/benchmarks folder run:
-```
+
+```sh
 # This folder contains a script that prepares the prompts for ShareGPT_v3_unflitered_cleaned_split dataset
 # that works out of the box with the locust benchmarking expected format.
 cd benchmark/dataset/ShareGPT_v3_unflitered_cleaned_split
@@ -147,7 +156,8 @@ python3 upload_sharegpt.py --gcs_path="gs://${PROJECT_ID}-ai-gke-benchmark-fuse/
 #### Deploy the benchmarking tool
 
 To deploy the Locust inference benchmark with the above model, in the ai-on-gke/benchmarks folder run:
-```
+
+```sh
 # This folder contains the benchmark tool that generates requests for your workload
 cd benchmark/tools/locust-load-inference
 
@@ -173,7 +183,7 @@ To further interact with the Locust inference benchmark, view the README.md file
 
 An end to end Locust benchmark that runs for a given amount of time can be triggered via a curl command to the Locust Runner service:
 
-```
+```sh
 # get the locust runner endpoint
 kubectl get service -n benchmark locust-runner-api
 
@@ -182,7 +192,8 @@ curl -XGET http://$RUNNER_ENDPOINT_IP:8000/run
 ```
 
 A results file will appear in the GCS bucket specified as output_bucket in input variables once the benchmark is completed. Metrics and Locust statistics are visible under the [Cloud Monitoring metrics explorer](http://pantheon.corp.google.com/monitoring/metrics-explorer). In the ai-on-gke/benchmarks/benchmark/tools/locust-load-inference, run the following command to create a sample custom dashboard for the above related example:
-```
+
+```sh
 # apply the sample dashboard to easily view and explore metrics
 gcloud monitoring dashboards create   --config-from-file ./sample-dashboards/tgi-dashboard.yaml
 ```
@@ -192,9 +203,10 @@ View the results in the [Cloud Monitoring Dashboards](https://pantheon.corp.goog
 For more ways to interact with the locust benchmarking tooling, see the instructions in the [locust-load-inference README.md here](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/benchmarks/benchmark/tools/locust-load-inference/README.md#step-9-start-an-end-to-end-benchmark).
 
 ### 6. Clean Up
+
 To clean up the above setup, in the ai-on-gke/benchmarks folder run:
 
-```
+```sh
 # Run destroy on locust load generator
 cd benchmark/tools/locust-load-inference
 terraform destroy
@@ -215,4 +227,4 @@ terraform destroy
 # Run destroy on infra/stage-1 resources
 cd ../stage-1
 terraform destroy
-```
+```
diff --git a/benchmarks/benchmark/dataset/ShareGPT_v3_unflitered_cleaned_split/upload_sharegpt.py b/benchmarks/benchmark/dataset/ShareGPT_v3_unflitered_cleaned_split/upload_sharegpt.py
@@ -95,7 +95,7 @@ def main(gcs_path: str, overwrite: bool):
     parser.add_argument('--overwrite', default=False,
                         action=argparse.BooleanOptionalAction)
     args = parser.parse_args()
-    gcs_uri_pattern = "^gs:\/\/[a-z0-9.\-_]{3,63}\/(.+\/)*(.+)$"
+    gcs_uri_pattern = "^gs:\\/\\/[a-z0-9.\\-_]{3,63}\\/(.+\\/)*(.+)$"
     if not re.match(gcs_uri_pattern, args.gcs_path):
         raise ValueError(
             f"Invalid GCS path: {args.gcs_path}, expecting format \"gs://$BUCKET/$FILENAME\"")

diff --git a/benchmarks/benchmark/tools/locust-load-inference/sample-tfvars/tgi-sample.tfvars b/benchmarks/benchmark/tools/locust-load-inference/sample-tfvars/tgi-sample.tfvars
@@ -7,7 +7,9 @@ project_id = "$PROJECT_ID"
 namespace = "benchmark"
 ksa       = "benchmark-ksa"
 
-k8s_hf_secret = "hf-token"
+
+# This is needed for loading the gated models on HF
+# k8s_hf_secret = "hf-token"
 
 # Locust service configuration 
 artifact_registry                        = "us-central1-docker.pkg.dev/$PROJECT_ID/ai-benchmark"

diff --git a/benchmarks/benchmark/tools/profile-generator/README.md b/benchmarks/benchmark/tools/profile-generator/README.md
@@ -0,0 +1,180 @@
+# AI on GKE Benchmark Latency Profile Generator
+
+<!-- TOC -->
+* [AI on GKE Benchmark Latency Profile Generator](#ai-on-gke-benchmark-latency-profile-generator)
+  * [Overview](#overview)
+  * [Instructions](#instructions)
+    * [Step 1: create output bucket](#step-1--create-output-bucket)
+    * [Step 2: create and give service account access to write to output gcs bucket](#step-2--create-and-give-service-account-access-to-write-to-output-gcs-bucket)
+    * [Step 3: create artifact repository for automated Latency Profile Generator docker build](#step-3--create-artifact-repository-for-automated-latency-profile-generator-docker-build)
+    * [Step 4: create and configure terraform.tfvars](#step-4--create-and-configure-terraformtfvars)
+      * [[optional] set-up credentials config with kubeconfig](#optional-set-up-credentials-config-with-kubeconfig)
+      * [[optional] set up secret token in Secret Manager](#optional-set-up-secret-token-in-secret-manager)
+    * [Step 6: terraform initialize, plan and apply](#step-6--terraform-initialize-plan-and-apply)
+  * [Inputs](#inputs)
+<!-- TOC -->
+
+## Overview
+
+This deploys the latency profile generator which measures the throuhghput and
+latency at various request rates for the model and model server of your choice. 
+
+It currently supports the following frameworks:
+- tensorrt_llm_triton
+- text generation inference (tgi)
+- vllm
+- sax
+-jetstream
+
+## Instructions
+
+### Step 1: create output bucket
+
+If you followed steps from `../../infra/` for creating your cluster and extra
+resources, you will already have an output bucket created for you.
+If not, you will have to create and manage your own gcs bucket for storing
+benchmarking results.
+
+Set the `output_bucket` in your `terraform.tfvars` to this gcs bucket.
+
+### Step 2: create and give service account access to write to output gcs bucket
+
+The Latency profile generator requires storage.admin access to write output to
+the given output gcs bucket. If you followed steps in `../../infra`, then you
+already be logged into gcloud have a kubernetes and gcloud service account 
+created that has the proper access to the created output bucket. If you are
+not logged into gcloud, run the following:
+
+```bash
+gcloud auth application-default login
+```
+
+To give viewer permissions on the gcs bucket to the gcloud service account,
+run the following:
+
+```
+gcloud storage buckets add-iam-policy-binding  gs://$OUTPUT_BUCKET/
+--member=serviceAccount:$GOOGLE_SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com --role=roles/storage.admin
+```
+
+Your kubernetes service account will inherit the reader permissions.
+
+You will set the `latency_profile_kubernetes_service_account` in your
+`terraform.tfvars` to the kubernetes service account name.
+
+### Step 3: create artifact repository for automated Latency Profile Generator docker build
+
+The latency profile generator rebuilds the docker file on each terraform apply 
+if `build_latency_profile_generator_image` is set to true (default is true).
+The containers will be pushed to the given `artifact_registry`. This artifact
+repository is expected to already exist. If you created your cluster via
+`../../infra/`, then an artifact repository was created for you with the same
+name as the prefix in the same location as the cluster. You can also create your
+own via this command:
+
+```bash
+gcloud artifacts repositories create ai-benchmark --location=us-central1 --repository-format=docker
+```
+
+### Step 4: create and configure terraform.tfvars
+
+Create a `terraform.tfvars` file. `./sample-tfvars` is provided as an example
+file. You can copy the file as a starting point.
+Note that at a minimum you will have to change the existing
+`credentials_config`, `project_id`, and `artifact_registry`.
+
+```bash
+cp ./sample-tfvars terraform.tfvars
+```
+
+Fill out your `terraform.tfvars` with the desired model and server configuration, referring to the list of required and optional variables [here](#variables). The following variables are required:
+- `credentials_config` - credentials for cluster to deploy Latency Profile Generator benchmark tool on
+- `project_id` - project id for enabling dependent services for building Latency Profile Generator artifacts
+- `artifact_registry` - artifact registry to upload Latency Profile Generator artifacts to
+- `build_latency_profile_generator_image` - Whether latency profile generator image will be built or not
+- `targets` - Which model servers are we targeting for benchmarking? Set  the fields on `manual` if intending to benchmark a model server already in the cluster.
+- `output_bucket` - gcs bucket to write benchmarking metrics to.
+- `latency_profile_kubernetes_service_account` - service account giving access to latency profile generator to write to `output_bucket`
+- `k8s_hf_secret` - Name of secret for huggingface token stored in k8s
+
+#### [optional] set-up credentials config with kubeconfig
+
+If your cluster has fleet management enabled, the existing `credentials_config`
+can use the fleet host credentials like this:
+
+```bash
+credentials_config = {
+  fleet_host = "https://connectgateway.googleapis.com/v1/projects/$PROJECT_NUMBER/locations/global/gkeMemberships/$CLUSTER_NAME"
+}
+```
+
+If your cluster does not have fleet management enabled, you can use your
+cluster's kubeconfig in the `credentials_config`. You must isolate your
+cluster's kubeconfig from other clusters in the default kube.config file.
+To do this, run the following command:
+
+```bash
+KUBECONFIG=~/.kube/${CLUSTER_NAME}-kube.config gcloud container clusters get-credentials $CLUSTER_NAME --location $CLUSTER_LOCATION
+```
+
+Then update your `terraform.tfvars` `credentials_config` to the following:
+
+```bash
+credentials_config = {
+  kubeconfig = {
+    path = "~/.kube/${CLUSTER_NAME}-kube.config"
+  }
+}
+```
+
+#### [optional] set up secret token in Secret Manager
+
+A model may require a security token to access it. For example, Llama2 from
+HuggingFace is a gated model that requires a
+[user access token](https://huggingface.co/docs/hub/en/security-tokens). If the
+model you want to run does not require this, skip this step.
+
+If you followed steps from `.../../infra/`, Secret Manager and the user access
+token should already be set up. If not, it is strongly recommended that you use
+Workload Identity and Secret Manager to access the user access tokens to avoid
+adding a plain text token into the terraform state. To do so, follow the
+instructions for
+[setting up a secret in Secret Manager here](https://cloud.google.com/kubernetes-engine/docs/tutorials/workload-identity-secrets).
+
+Once complete, you should add these related secret values to your
+`terraform.tfvars`:
+
+```bash
+# ex. "projects/sample-project/secrets/hugging_face_secret"
+hugging_face_secret = $SECRET_ID
+ # ex. 1
+hugging_face_secret_version =  $SECRET_VERSION
+```
+
+### Step 5: login to gcloud
+
+Run the following gcloud command for authorization:
+
+```bash
+gcloud auth application-default login
+```
+
+### Step 6: terraform initialize, plan and apply
+
+Run the following terraform commands:
+
+```bash
+# initialize terraform
+terraform init
+
+# verify changes
+terraform plan
+
+# apply changes
+terraform apply
+```
+
+The results can be viewed via running the following:
+```
+kubectl logs job/latency-profile-generator
+```
diff --git a/benchmarks/benchmark/tools/profile-generator/build.tf b/benchmarks/benchmark/tools/profile-generator/build.tf
@@ -0,0 +1,8 @@
+resource "null_resource" "build_and_push_image" {
+  count      = var.build_latency_profile_generator_image ? 1 : 0
+  depends_on = [resource.google_project_service.cloudbuild]
+  provisioner "local-exec" {
+    working_dir = path.module
+    command     = "gcloud builds submit --tag ${var.artifact_registry}/latency-profile:latest container"
+  }
+}
diff --git a/benchmarks/benchmark/tools/profile-generator/container/Dockerfile b/benchmarks/benchmark/tools/profile-generator/container/Dockerfile
@@ -0,0 +1,21 @@
+FROM nvidia/cuda:12.1.0-devel-ubuntu22.04 AS dev
+
+RUN apt-get update -y \
+    && apt-get install -y python3-pip git vim curl wget
+RUN pip3 install --upgrade pip
+RUN pip install packaging torch transformers
+WORKDIR /workspace
+
+# install build and runtime dependencies
+COPY requirements.txt requirements.txt
+RUN pip install -r requirements.txt
+
+RUN pip install -U "huggingface_hub[cli]"
+
+RUN wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+
+COPY benchmark_serving.py benchmark_serving.py
+COPY latency_throughput_curve.sh latency_throughput_curve.sh
+
+RUN chmod +x latency_throughput_curve.sh
+RUN chmod +x benchmark_serving.py