From b54c7a96102b7b247d1936fbc831d7e9837703f1 Mon Sep 17 00:00:00 2001
From: Brendan Slabe <bslabe123@gmail.com>
Date: Fri, 23 Aug 2024 23:01:52 +0800
Subject: [PATCH] Update README.md

---
 .../tools/profile-generator/README.md         | 48 +++++--------------
 1 file changed, 13 insertions(+), 35 deletions(-)

diff --git a/benchmarks/benchmark/tools/profile-generator/README.md b/benchmarks/benchmark/tools/profile-generator/README.md
index 7145cd23b..f7df0c738 100644
--- a/benchmarks/benchmark/tools/profile-generator/README.md
+++ b/benchmarks/benchmark/tools/profile-generator/README.md
@@ -10,7 +10,6 @@
     * [Step 4: create and configure terraform.tfvars](#step-4--create-and-configure-terraformtfvars)
       * [[optional] set-up credentials config with kubeconfig](#optional-set-up-credentials-config-with-kubeconfig)
       * [[optional] set up secret token in Secret Manager](#optional-set-up-secret-token-in-secret-manager)
-    * [Step 5: login to gcloud](#step-5--login-to-gcloud)
     * [Step 6: terraform initialize, plan and apply](#step-6--terraform-initialize-plan-and-apply)
   * [Inputs](#inputs)
 <!-- TOC -->
@@ -41,8 +40,13 @@ Set the `output_bucket` in your `terraform.tfvars` to this gcs bucket.
 
 The Latency profile generator requires storage.admin access to write output to
 the given output gcs bucket. If you followed steps in `../../infra`, then you
-already have a kubernetes and gcloud service account created that has the proper
-access to the created output bucket.
+already be logged into gcloud have a kubernetes and gcloud service account 
+created that has the proper access to the created output bucket. If you are
+not logged into gcloud, run the following:
+
+```bash
+gcloud auth application-default login
+```
 
 To give viewer permissions on the gcs bucket to the gcloud service account,
 run the following:
@@ -54,12 +58,13 @@ gcloud storage buckets add-iam-policy-binding  gs://$OUTPUT_BUCKET/
 
 Your kubernetes service account will inherit the reader permissions.
 
-You will set the `lantency_profile_kubernetes_service_account` in your
+You will set the `latency_profile_kubernetes_service_account` in your
 `terraform.tfvars` to the kubernetes service account name.
 
 ### Step 3: create artifact repository for automated Latency Profile Generator docker build
 
-The latency profile generator rebuilds the docker file on each terraform apply.
+The latency profile generator rebuilds the docker file on each terraform apply 
+if `build_latency_profile_generator_image` is set to true (default is true).
 The containers will be pushed to the given `artifact_registry`. This artifact
 repository is expected to already exist. If you created your cluster via
 `../../infra/`, then an artifact repository was created for you with the same
@@ -70,7 +75,6 @@ own via this command:
 gcloud artifacts repositories create ai-benchmark --location=us-central1 --repository-format=docker
 ```
 
-
 ### Step 4: create and configure terraform.tfvars
 
 Create a `terraform.tfvars` file. `./sample-tfvars` is provided as an example
@@ -86,11 +90,11 @@ Fill out your `terraform.tfvars` with the desired model and server configuration
 - `credentials_config` - credentials for cluster to deploy Latency Profile Generator benchmark tool on
 - `project_id` - project id for enabling dependent services for building Latency Profile Generator artifacts
 - `artifact_registry` - artifact registry to upload Latency Profile Generator artifacts to
-- `inference_server_service` - an accessible service name for inference workload to be benchmarked **(Note: If you are using a non-80 port for your model server service, it should be specified here. Example: `my-service-name:9000`)**
-- `tokenizer` - must match the model running on the inference workload to be benchmarked
-- `inference_server_framework` - the inference workload framework
+- `build_latency_profile_generator_image` - Whether latency profile generator image will be built or not
+- `targets` - Which model servers are we targeting for benchmarking? Set `manual` if intending to benchmark a model server already in the cluster.
 - `output_bucket` - gcs bucket to write benchmarking metrics to.
 - `latency_profile_kubernetes_service_account` - service account giving access to latency profile generator to write to `output_bucket`
+- `k8s_hf_secret` - Name of secret for huggingface token stored in k8s
 
 #### [optional] set-up credentials config with kubeconfig
 
@@ -171,29 +175,3 @@ terraform apply
 
 A results file will appear in GCS bucket specified as `output_bucket` in input
 variables.
-
-<!-- BEGIN_TF_DOCS -->
-
-## Inputs
-
-| Name | Description | Type | Default | Required |
-|------|-------------|------|---------|:--------:|
-| <a name="input_artifact_registry"></a> [artifact\_registry](#input\_artifact\_registry) | Artifact registry for storing Latency Profile Generator container. | `string` | `null` | no |
-| <a name="input_credentials_config"></a> [credentials\_config](#input\_credentials\_config) | Configure how Terraform authenticates to the cluster. | <pre>object({<br>    fleet_host = optional(string)<br>    kubeconfig = optional(object({<br>      context = optional(string)<br>      path    = optional(string, "~/.kube/config")<br>    }))<br>  })</pre> | n/a | yes |
-| <a name="input_hugging_face_secret"></a> [hugging\_face\_secret](#input\_hugging\_face\_secret) | name of the kubectl huggingface secret token; stored in Secret Manager. Security considerations: https://kubernetes.io/docs/concepts/security/secrets-good-practices/ | `string` | `null` | no |
-| <a name="input_hugging_face_secret_version"></a> [hugging\_face\_secret\_version](#input\_hugging\_face\_secret\_version) | Secret version in Secret Manager | `string` | `null` | no |
-| <a name="input_inference_server_framework"></a> [inference\_server\_framework](#input\_inference\_server\_framework) | Benchmark server configuration for inference server framework. Can be one of: vllm, tgi, tensorrt\_llm\_triton, sax | `string` | `"tgi"` | no |
-| <a name="input_inference_server_service"></a> [inference\_server\_service](#input\_inference\_server\_service) | Inference server service | `string` | n/a | yes |
-| <a name="input_k8s_hf_secret"></a> [k8s\_hf\_secret](#input\_k8s\_hf\_secret) | Name of secret for huggingface token; stored in k8s | `string` | `null` | no |
-| <a name="input_ksa"></a> [ksa](#input\_ksa) | Kubernetes Service Account used for workload. | `string` | `"default"` | no |
-| <a name="input_latency_profile_kubernetes_service_account"></a> [latency\_profile\_kubernetes\_service\_account](#input\_latency\_profile\_kubernetes\_service\_account) | Kubernetes Service Account to be used for the latency profile generator tool | `string` | `"sample-runner-ksa"` | no |
-| <a name="input_max_num_prompts"></a> [max\_num\_prompts](#input\_max\_num\_prompts) | Benchmark server configuration for max number of prompts. | `number` | `1000` | no |
-| <a name="input_max_output_len"></a> [max\_output\_len](#input\_max\_output\_len) | Benchmark server configuration for max output length. | `number` | `256` | no |
-| <a name="input_max_prompt_len"></a> [max\_prompt\_len](#input\_max\_prompt\_len) | Benchmark server configuration for max prompt length. | `number` | `256` | no |
-| <a name="input_namespace"></a> [namespace](#input\_namespace) | Namespace used for model and benchmarking deployments. | `string` | `"default"` | no |
-| <a name="input_output_bucket"></a> [output\_bucket](#input\_output\_bucket) | Bucket name for storing results | `string` | n/a | yes |
-| <a name="input_project_id"></a> [project\_id](#input\_project\_id) | Project id of existing or created project. | `string` | n/a | yes |
-| <a name="input_templates_path"></a> [templates\_path](#input\_templates\_path) | Path where manifest templates will be read from. Set to null to use the default manifests | `string` | `null` | no |
-| <a name="input_tokenizer"></a> [tokenizer](#input\_tokenizer) | Benchmark server configuration for tokenizer. | `string` | `"tiiuae/falcon-7b"` | no |
-
-<!-- END_TF_DOCS -->