GoogleCloudPlatform · achandrasekar · Aug 27, 2024 · Aug 23, 2024
@@ -15,13 +15,13 @@ via these terraform scripts on a Standard cluster that you've created yourself.
 This tutorial assumes you have access to use google storage APIs via Application Default Credentials (ADC).
 To login, you can run the following:
 
-```
+```sh
 gcloud auth application-default login
 ```
 
 ### Terraform
 
-Install Terraform by following the documentation at https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli.
+Install Terraform by following the documentation at <https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli>.
 This requires a minimum Terraform version of 1.7.4
 
 ### python
@@ -34,10 +34,11 @@ You may need to run pip install -r benchmark/dataset/ShareGPT_v3_unflitered_clea
 
 This section goes over an end to end example to deploy and benchmark the Falcon 7b model using [TGI](https://huggingface.co/docs/text-generation-inference/en/index) on a Standard GKE Cluster with GPUs.
 
-Each step below has more details in their respective directoy README.md. It is recommended
+Each step below has more details in their respective directory README.md. It is recommended
 that you read through the available options at least once when testing your own models.
 
 At a high level, running an inference benchmark in GKE involves these five steps:
+
 1. Create the cluster
 2. Configure the cluster
 3. Deploy the inference server
@@ -51,7 +52,8 @@ Set up the infrastructure by creating a GKE cluster with appropriate accelerator
 configuration.
 
 To create a GPU cluster, in the ai-on-gke/benchmarks folder run:
-```
+
+```sh
 # Stage 1 creates the cluster.
 cd infra/stage-1
 
@@ -68,8 +70,10 @@ terraform plan
 # Run apply if the changes look good by confirming the prompt.
 terraform apply
 ```
+
 To verify that the cluster has been set up correctly, run
-```
+
+```sh
 # Get credentials using fleet membership
 gcloud container fleet memberships get-credentials <cluster-name> --project <project-id>
 
@@ -80,7 +84,8 @@ kubectl get nodes
 ### 2. Configure the cluster
 
 To configure the cluster to run inference workloads we need to set up workload identity, GCS Fuse and DCGM for GPU metrics. In the ai-on-gke/benchmarks folder run:
-```
+
+```sh
 # Stage 2 configures the cluster for running inference workloads.
 cd infra/stage-2
 
@@ -104,7 +109,8 @@ terraform apply
 ### 3. Deploy the inference server
 
 To deploy TGI with a sample model, in the ai-on-gke/benchmarks folder run:
-```
+
+```sh
 # text-generation-inference is the inference workload we'll deploy.
 cd inference-server/text-generation-inference
 
@@ -125,17 +131,20 @@ terraform apply
 ```
 
 It may take a minute or two for the inference server to be ready to serve. To verify that the model is running, you can run:
-```
+
+```sh
 kubectl get deployment -n benchmark
 ```
+
 This will show the status of the TGI server running.
 
 ### 4. Deploy the benchmark
 
 #### Prepare the benchmark dataset
 
 To prepare the dataset for the Locust inference benchmark, in the ai-on-gke/benchmarks folder run:
-```
+
+```sh
 # This folder contains a script that prepares the prompts for ShareGPT_v3_unflitered_cleaned_split dataset
 # that works out of the box with the locust benchmarking expected format.
 cd benchmark/dataset/ShareGPT_v3_unflitered_cleaned_split
@@ -147,7 +156,8 @@ python3 upload_sharegpt.py --gcs_path="gs://${PROJECT_ID}-ai-gke-benchmark-fuse/
 #### Deploy the benchmarking tool
 
 To deploy the Locust inference benchmark with the above model, in the ai-on-gke/benchmarks folder run:
-```
+
+```sh
 # This folder contains the benchmark tool that generates requests for your workload
 cd benchmark/tools/locust-load-inference
 
@@ -173,7 +183,7 @@ To further interact with the Locust inference benchmark, view the README.md file
 
 An end to end Locust benchmark that runs for a given amount of time can be triggered via a curl command to the Locust Runner service:
 
-```
+```sh
 # get the locust runner endpoint
 kubectl get service -n benchmark locust-runner-api
 
@@ -182,7 +192,8 @@ curl -XGET http://$RUNNER_ENDPOINT_IP:8000/run
 ```
 
 A results file will appear in the GCS bucket specified as output_bucket in input variables once the benchmark is completed. Metrics and Locust statistics are visible under the [Cloud Monitoring metrics explorer](http://pantheon.corp.google.com/monitoring/metrics-explorer). In the ai-on-gke/benchmarks/benchmark/tools/locust-load-inference, run the following command to create a sample custom dashboard for the above related example:
-```
+
+```sh
 # apply the sample dashboard to easily view and explore metrics
 gcloud monitoring dashboards create   --config-from-file ./sample-dashboards/tgi-dashboard.yaml
 ```
@@ -192,9 +203,10 @@ View the results in the [Cloud Monitoring Dashboards](https://pantheon.corp.goog
 For more ways to interact with the locust benchmarking tooling, see the instructions in the [locust-load-inference README.md here](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/benchmarks/benchmark/tools/locust-load-inference/README.md#step-9-start-an-end-to-end-benchmark).
 
 ### 6. Clean Up
+
 To clean up the above setup, in the ai-on-gke/benchmarks folder run:
 
-```
+```sh
 # Run destroy on locust load generator
 cd benchmark/tools/locust-load-inference
 terraform destroy
@@ -215,4 +227,4 @@ terraform destroy
 # Run destroy on infra/stage-1 resources
 cd ../stage-1
 terraform destroy
-```
+```
@@ -95,7 +95,7 @@ def main(gcs_path: str, overwrite: bool):
     parser.add_argument('--overwrite', default=False,
                         action=argparse.BooleanOptionalAction)
     args = parser.parse_args()
-    gcs_uri_pattern = "^gs:\/\/[a-z0-9.\-_]{3,63}\/(.+\/)*(.+)$"
+    gcs_uri_pattern = "^gs:\\/\\/[a-z0-9.\\-_]{3,63}\\/(.+\\/)*(.+)$"
     if not re.match(gcs_uri_pattern, args.gcs_path):
         raise ValueError(
             f"Invalid GCS path: {args.gcs_path}, expecting format \"gs://$BUCKET/$FILENAME\"")

@@ -7,7 +7,9 @@ project_id = "$PROJECT_ID"
 namespace = "benchmark"
 ksa       = "benchmark-ksa"
 
-k8s_hf_secret = "hf-token"
+
+# This is needed for loading the gated models on HF
+# k8s_hf_secret = "hf-token"
 
 # Locust service configuration 
 artifact_registry                        = "us-central1-docker.pkg.dev/$PROJECT_ID/ai-benchmark"