GoogleCloudPlatform · rsgowman · Mar 6, 2024 · Feb 28, 2024 · Feb 29, 2024 · Feb 29, 2024
diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -132,4 +132,4 @@ terraform plan
 terraform apply
 ```
 
-To further interact with the Locust inference benchmark, view the README.md file in `benchmark/tools/locust-load-inference`
+To further interact with the Locust inference benchmark, view the README.md file in `benchmark/tools/locust-load-inference`
diff --git a/benchmarks/benchmark/dataset/ShareGPT_v3_unflitered_cleaned_split/README.md b/benchmarks/benchmark/dataset/ShareGPT_v3_unflitered_cleaned_split/README.md
@@ -1,10 +1,13 @@
 This directory contains the script for uploading a filtered and formatted file of prompts based on the "anon8231489123/ShareGPT_Vicuna_unfiltered" dataset to a given GCS path.
 
 Example usage:
-   python3 upload_sharegpt.py --gcs_path="gs://$BUCKET_NAME/ShareGPT_V3_unfiltered_cleaned_split_filtered_prompts.txt"
+```
+python3 upload_sharegpt.py --gcs_path="gs://$BUCKET_NAME/ShareGPT_V3_unfiltered_cleaned_split_filtered_prompts.txt"
+```
 
 pre-work:
-- upload_sharegpt.py assumes that the bucket already exists. If it does not exist, make sure that you create your bucket $BUCKET_NAME in your project prior to running the script. You can do that with the following command:
+- upload_sharegpt.py may require additional python libraries; see below.
+- upload_sharegpt.py assumes that the bucket already exists. If you've created your cluster via the terraform scripts in `./infra/stage-2`, then the bucket was created for you. (See `terraform.tfvars` in that directory for the name.) If it does not exist, make sure that you create your bucket $BUCKET_NAME in your project prior to running the script. You can do that with the following command:
 ```
 gcloud storage buckets create gs://$BUCKET_NAME --location=BUCKET_LOCATION
 ```
@@ -20,7 +23,7 @@ Assumes in your environment you:
 - have access to use google storage APIs via Application Default Credentials (ADC)
 
 You may need to do the following:
-- run "pip install google-cloud-storage" to install storage client library dependencies
+- run "pip install wget google-cloud-storage" to install storage client library dependencies. (Optionally, you can run this within a venv, i.e. `python3 -m venv ./venv && source ./venv/bin/activate && pip install ...`)
 - run "gcloud auth application-default login" to enable ADC
 
-For more information on running the google cloud storage API, see https://cloud.google.com/python/docs/reference/storage
+For more information on running the google cloud storage API, see https://cloud.google.com/python/docs/reference/storage
diff --git a/benchmarks/benchmark/tools/locust-load-inference/README.md b/benchmarks/benchmark/tools/locust-load-inference/README.md
@@ -58,7 +58,7 @@ The Locust workload requires storage.admin access to view the dataset in the giv
 To give viewer permissions on the gcs bucket to the gcloud service account, run the following:
 
 ```
-gcloud storage buckets add-iam-policy-binding  gs://$BUCKET/$DATASET_FILENAME
+gcloud storage buckets add-iam-policy-binding  gs://$BUCKET
 --member=serviceAccount:$GOOGLE_SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com --role=roles/storage.admin
 ```
 
@@ -237,4 +237,4 @@ To change the benchmark configuration, you will have to rerun terraform destroy
 | <a name="input_sax_model"></a> [sax\_model](#input\_sax\_model)                                                      | Benchmark server configuration for sax model. Only required if framework is sax.                                            | `string`                                                                                                                                                                                                    | `""`                 |    no    |
 | <a name="input_tokenizer"></a> [tokenizer](#input\_tokenizer)                                                        | Benchmark server configuration for tokenizer.                                                                               | `string`                                                                                                                                                                                                    | `"tiiuae/falcon-7b"` |   yes    |
 | <a name="input_use_beam_search"></a> [use\_beam\_search](#input\_use\_beam\_search)                                  | Benchmark server configuration for use beam search.                                                                         | `bool`                                                                                                                                                                                                      | `false`              |    no    |
-<!-- END_TF_DOCS -->
+<!-- END_TF_DOCS -->
diff --git a/benchmarks/benchmark/tools/locust-load-inference/sample-terraform.tfvars b/benchmarks/benchmark/tools/locust-load-inference/sample-terraform.tfvars
@@ -12,13 +12,14 @@ artifact_registry = "us-central1-docker.pkg.dev/$PROJECT_ID/ai-benchmark"
 inference_server_service = "tgi" # inference server service name
 locust_runner_kubernetes_service_account = "sample-runner-ksa"
 output_bucket = "benchmark-output"
-gcs_path = "gs://ai-on-gke-benchmark/ShareGPT_V3_unfiltered_cleaned_split_filtered_prompts.txt"
+gcs_path = "gs://${PROJECT_ID}-ai-gke-benchmark-fuse/ShareGPT_V3_unfiltered_cleaned_split_filtered_prompts.txt"
 
 # Benchmark configuration for Locust Docker accessing inference server
 inference_server_framework = "tgi"
 tokenizer = "tiiuae/falcon-7b"
 
 # Benchmark configuration for triggering single test via Locust Runner
 test_duration = 60
+# Increase test_users to allow more parallelism (especially when testing HPA)
 test_users = 1
-test_rate = 5
+test_rate = 5
@@ -30,6 +30,8 @@ cp sample-terraform.tfvars terraform.tfvars
 
 Fill out your `terraform.tfvars` with the desired model and server configuration, referring to the list of required and optional variables [here](#variables). Variables `credentials_config` are required.
 
+Optionally configure HPA (Horizontal Pod Autoscaling) by setting `hpa_type`. Note: GMP (Google Managed Prometheus) must be enabled on this cluster (which is the default) to scale based on custom metrics. See `autoscaling.md` for more details.
+
 #### Determine number of gpus
 
 `gpu_count` should be configured respective to the size of the model with some overhead for the kv cache. Here's an example on figuring out how many GPUs you need to run a model:

@@ -0,0 +1,31 @@
+# Autoscaling TGI
+
+## tl;dr
+
+Recommendation: TODO
+
+## Autoscaling Options
+
+### CPU
+
+CPU scaling is a poor choice for this workload - the TGI workload starts up,
+pulls the model weights, and then spends a minute or two worth of cpu time
+crunching some numbers. This causes hpa to add a replica, which then spends
+more cpu time, which causes hpa to add a replica, etc. Eventually, things
+settle, and hpa scales down the replicas. This whole process could take up to
+an hour.
+
+### Custom Metrics
+
+Workload/custom metrics can be viewed in
+https://console.cloud.google.com/monitoring/metrics-explorer. (Just search for
+the metric name, e.g. "tgi_batch_current_size". The full name should be
+"prometheus/tgi_batch_current_size/gauge")
+
+#### `tgi_batch_current_size`
+
+TODO
+
+### External Metrics
+
+TODO
@@ -0,0 +1,31 @@
+# Custom Metrics Stackdriver Adapter
+
+Adapted from https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml
+
+## Usage
+
+To use this module, include it from your main terraform config, i.e.:
+
+```
+module "custom_metrics_stackdriver_adapter" {
+  source = "./path/to/custom-metrics-stackdriver-adapter"
+}
+```
+
+For a workload identity enabled cluster, some additional configuration is
+needed:
+
+```
+module "custom_metrics_stackdriver_adapter" {
+  source = "./path/to/custom-metrics-stackdriver-adapter"
+  workload_identity = {
+    enabled = true
+    project_id = "<PROJECT_ID>"
+  }
+}
+```
+
+# TODO
+
+This module should be moved out of the text-generation-inference subdirectory,
+as it should be more broadly applicable.