diff --git a/benchmarks/README.md b/benchmarks/README.md index ae324f0c8..e9157887f 100644 --- a/benchmarks/README.md +++ b/benchmarks/README.md @@ -15,13 +15,13 @@ via these terraform scripts on a Standard cluster that you've created yourself. This tutorial assumes you have access to use google storage APIs via Application Default Credentials (ADC). To login, you can run the following: -``` +```sh gcloud auth application-default login ``` ### Terraform -Install Terraform by following the documentation at https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli. +Install Terraform by following the documentation at . This requires a minimum Terraform version of 1.7.4 ### python @@ -34,10 +34,11 @@ You may need to run pip install -r benchmark/dataset/ShareGPT_v3_unflitered_clea This section goes over an end to end example to deploy and benchmark the Falcon 7b model using [TGI](https://huggingface.co/docs/text-generation-inference/en/index) on a Standard GKE Cluster with GPUs. -Each step below has more details in their respective directoy README.md. It is recommended +Each step below has more details in their respective directory README.md. It is recommended that you read through the available options at least once when testing your own models. At a high level, running an inference benchmark in GKE involves these five steps: + 1. Create the cluster 2. Configure the cluster 3. Deploy the inference server @@ -51,7 +52,8 @@ Set up the infrastructure by creating a GKE cluster with appropriate accelerator configuration. To create a GPU cluster, in the ai-on-gke/benchmarks folder run: -``` + +```sh # Stage 1 creates the cluster. cd infra/stage-1 @@ -68,8 +70,10 @@ terraform plan # Run apply if the changes look good by confirming the prompt. terraform apply ``` + To verify that the cluster has been set up correctly, run -``` + +```sh # Get credentials using fleet membership gcloud container fleet memberships get-credentials --project @@ -80,7 +84,8 @@ kubectl get nodes ### 2. Configure the cluster To configure the cluster to run inference workloads we need to set up workload identity, GCS Fuse and DCGM for GPU metrics. In the ai-on-gke/benchmarks folder run: -``` + +```sh # Stage 2 configures the cluster for running inference workloads. cd infra/stage-2 @@ -104,7 +109,8 @@ terraform apply ### 3. Deploy the inference server To deploy TGI with a sample model, in the ai-on-gke/benchmarks folder run: -``` + +```sh # text-generation-inference is the inference workload we'll deploy. cd inference-server/text-generation-inference @@ -125,9 +131,11 @@ terraform apply ``` It may take a minute or two for the inference server to be ready to serve. To verify that the model is running, you can run: -``` + +```sh kubectl get deployment -n benchmark ``` + This will show the status of the TGI server running. ### 4. Deploy the benchmark @@ -135,7 +143,8 @@ This will show the status of the TGI server running. #### Prepare the benchmark dataset To prepare the dataset for the Locust inference benchmark, in the ai-on-gke/benchmarks folder run: -``` + +```sh # This folder contains a script that prepares the prompts for ShareGPT_v3_unflitered_cleaned_split dataset # that works out of the box with the locust benchmarking expected format. cd benchmark/dataset/ShareGPT_v3_unflitered_cleaned_split @@ -147,7 +156,8 @@ python3 upload_sharegpt.py --gcs_path="gs://${PROJECT_ID}-ai-gke-benchmark-fuse/ #### Deploy the benchmarking tool To deploy the Locust inference benchmark with the above model, in the ai-on-gke/benchmarks folder run: -``` + +```sh # This folder contains the benchmark tool that generates requests for your workload cd benchmark/tools/locust-load-inference @@ -173,7 +183,7 @@ To further interact with the Locust inference benchmark, view the README.md file An end to end Locust benchmark that runs for a given amount of time can be triggered via a curl command to the Locust Runner service: -``` +```sh # get the locust runner endpoint kubectl get service -n benchmark locust-runner-api @@ -182,7 +192,8 @@ curl -XGET http://$RUNNER_ENDPOINT_IP:8000/run ``` A results file will appear in the GCS bucket specified as output_bucket in input variables once the benchmark is completed. Metrics and Locust statistics are visible under the [Cloud Monitoring metrics explorer](http://pantheon.corp.google.com/monitoring/metrics-explorer). In the ai-on-gke/benchmarks/benchmark/tools/locust-load-inference, run the following command to create a sample custom dashboard for the above related example: -``` + +```sh # apply the sample dashboard to easily view and explore metrics gcloud monitoring dashboards create --config-from-file ./sample-dashboards/tgi-dashboard.yaml ``` @@ -192,9 +203,10 @@ View the results in the [Cloud Monitoring Dashboards](https://pantheon.corp.goog For more ways to interact with the locust benchmarking tooling, see the instructions in the [locust-load-inference README.md here](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/benchmarks/benchmark/tools/locust-load-inference/README.md#step-9-start-an-end-to-end-benchmark). ### 6. Clean Up + To clean up the above setup, in the ai-on-gke/benchmarks folder run: -``` +```sh # Run destroy on locust load generator cd benchmark/tools/locust-load-inference terraform destroy @@ -215,4 +227,4 @@ terraform destroy # Run destroy on infra/stage-1 resources cd ../stage-1 terraform destroy -``` \ No newline at end of file +``` diff --git a/benchmarks/benchmark/dataset/ShareGPT_v3_unflitered_cleaned_split/upload_sharegpt.py b/benchmarks/benchmark/dataset/ShareGPT_v3_unflitered_cleaned_split/upload_sharegpt.py index 36435e354..1c72c6c26 100644 --- a/benchmarks/benchmark/dataset/ShareGPT_v3_unflitered_cleaned_split/upload_sharegpt.py +++ b/benchmarks/benchmark/dataset/ShareGPT_v3_unflitered_cleaned_split/upload_sharegpt.py @@ -95,7 +95,7 @@ def main(gcs_path: str, overwrite: bool): parser.add_argument('--overwrite', default=False, action=argparse.BooleanOptionalAction) args = parser.parse_args() - gcs_uri_pattern = "^gs:\/\/[a-z0-9.\-_]{3,63}\/(.+\/)*(.+)$" + gcs_uri_pattern = "^gs:\\/\\/[a-z0-9.\\-_]{3,63}\\/(.+\\/)*(.+)$" if not re.match(gcs_uri_pattern, args.gcs_path): raise ValueError( f"Invalid GCS path: {args.gcs_path}, expecting format \"gs://$BUCKET/$FILENAME\"") diff --git a/benchmarks/benchmark/tools/locust-load-inference/sample-tfvars/tgi-sample.tfvars b/benchmarks/benchmark/tools/locust-load-inference/sample-tfvars/tgi-sample.tfvars index 9296b20b0..6dc8bb4ad 100644 --- a/benchmarks/benchmark/tools/locust-load-inference/sample-tfvars/tgi-sample.tfvars +++ b/benchmarks/benchmark/tools/locust-load-inference/sample-tfvars/tgi-sample.tfvars @@ -7,7 +7,9 @@ project_id = "$PROJECT_ID" namespace = "benchmark" ksa = "benchmark-ksa" -k8s_hf_secret = "hf-token" + +# This is needed for loading the gated models on HF +# k8s_hf_secret = "hf-token" # Locust service configuration artifact_registry = "us-central1-docker.pkg.dev/$PROJECT_ID/ai-benchmark"