From b54c7a96102b7b247d1936fbc831d7e9837703f1 Mon Sep 17 00:00:00 2001 From: Brendan Slabe Date: Fri, 23 Aug 2024 23:01:52 +0800 Subject: [PATCH] Update README.md --- .../tools/profile-generator/README.md | 48 +++++-------------- 1 file changed, 13 insertions(+), 35 deletions(-) diff --git a/benchmarks/benchmark/tools/profile-generator/README.md b/benchmarks/benchmark/tools/profile-generator/README.md index 7145cd23b..f7df0c738 100644 --- a/benchmarks/benchmark/tools/profile-generator/README.md +++ b/benchmarks/benchmark/tools/profile-generator/README.md @@ -10,7 +10,6 @@ * [Step 4: create and configure terraform.tfvars](#step-4--create-and-configure-terraformtfvars) * [[optional] set-up credentials config with kubeconfig](#optional-set-up-credentials-config-with-kubeconfig) * [[optional] set up secret token in Secret Manager](#optional-set-up-secret-token-in-secret-manager) - * [Step 5: login to gcloud](#step-5--login-to-gcloud) * [Step 6: terraform initialize, plan and apply](#step-6--terraform-initialize-plan-and-apply) * [Inputs](#inputs) @@ -41,8 +40,13 @@ Set the `output_bucket` in your `terraform.tfvars` to this gcs bucket. The Latency profile generator requires storage.admin access to write output to the given output gcs bucket. If you followed steps in `../../infra`, then you -already have a kubernetes and gcloud service account created that has the proper -access to the created output bucket. +already be logged into gcloud have a kubernetes and gcloud service account +created that has the proper access to the created output bucket. If you are +not logged into gcloud, run the following: + +```bash +gcloud auth application-default login +``` To give viewer permissions on the gcs bucket to the gcloud service account, run the following: @@ -54,12 +58,13 @@ gcloud storage buckets add-iam-policy-binding gs://$OUTPUT_BUCKET/ Your kubernetes service account will inherit the reader permissions. -You will set the `lantency_profile_kubernetes_service_account` in your +You will set the `latency_profile_kubernetes_service_account` in your `terraform.tfvars` to the kubernetes service account name. ### Step 3: create artifact repository for automated Latency Profile Generator docker build -The latency profile generator rebuilds the docker file on each terraform apply. +The latency profile generator rebuilds the docker file on each terraform apply +if `build_latency_profile_generator_image` is set to true (default is true). The containers will be pushed to the given `artifact_registry`. This artifact repository is expected to already exist. If you created your cluster via `../../infra/`, then an artifact repository was created for you with the same @@ -70,7 +75,6 @@ own via this command: gcloud artifacts repositories create ai-benchmark --location=us-central1 --repository-format=docker ``` - ### Step 4: create and configure terraform.tfvars Create a `terraform.tfvars` file. `./sample-tfvars` is provided as an example @@ -86,11 +90,11 @@ Fill out your `terraform.tfvars` with the desired model and server configuration - `credentials_config` - credentials for cluster to deploy Latency Profile Generator benchmark tool on - `project_id` - project id for enabling dependent services for building Latency Profile Generator artifacts - `artifact_registry` - artifact registry to upload Latency Profile Generator artifacts to -- `inference_server_service` - an accessible service name for inference workload to be benchmarked **(Note: If you are using a non-80 port for your model server service, it should be specified here. Example: `my-service-name:9000`)** -- `tokenizer` - must match the model running on the inference workload to be benchmarked -- `inference_server_framework` - the inference workload framework +- `build_latency_profile_generator_image` - Whether latency profile generator image will be built or not +- `targets` - Which model servers are we targeting for benchmarking? Set `manual` if intending to benchmark a model server already in the cluster. - `output_bucket` - gcs bucket to write benchmarking metrics to. - `latency_profile_kubernetes_service_account` - service account giving access to latency profile generator to write to `output_bucket` +- `k8s_hf_secret` - Name of secret for huggingface token stored in k8s #### [optional] set-up credentials config with kubeconfig @@ -171,29 +175,3 @@ terraform apply A results file will appear in GCS bucket specified as `output_bucket` in input variables. - - - -## Inputs - -| Name | Description | Type | Default | Required | -|------|-------------|------|---------|:--------:| -| [artifact\_registry](#input\_artifact\_registry) | Artifact registry for storing Latency Profile Generator container. | `string` | `null` | no | -| [credentials\_config](#input\_credentials\_config) | Configure how Terraform authenticates to the cluster. |
object({
fleet_host = optional(string)
kubeconfig = optional(object({
context = optional(string)
path = optional(string, "~/.kube/config")
}))
})
| n/a | yes | -| [hugging\_face\_secret](#input\_hugging\_face\_secret) | name of the kubectl huggingface secret token; stored in Secret Manager. Security considerations: https://kubernetes.io/docs/concepts/security/secrets-good-practices/ | `string` | `null` | no | -| [hugging\_face\_secret\_version](#input\_hugging\_face\_secret\_version) | Secret version in Secret Manager | `string` | `null` | no | -| [inference\_server\_framework](#input\_inference\_server\_framework) | Benchmark server configuration for inference server framework. Can be one of: vllm, tgi, tensorrt\_llm\_triton, sax | `string` | `"tgi"` | no | -| [inference\_server\_service](#input\_inference\_server\_service) | Inference server service | `string` | n/a | yes | -| [k8s\_hf\_secret](#input\_k8s\_hf\_secret) | Name of secret for huggingface token; stored in k8s | `string` | `null` | no | -| [ksa](#input\_ksa) | Kubernetes Service Account used for workload. | `string` | `"default"` | no | -| [latency\_profile\_kubernetes\_service\_account](#input\_latency\_profile\_kubernetes\_service\_account) | Kubernetes Service Account to be used for the latency profile generator tool | `string` | `"sample-runner-ksa"` | no | -| [max\_num\_prompts](#input\_max\_num\_prompts) | Benchmark server configuration for max number of prompts. | `number` | `1000` | no | -| [max\_output\_len](#input\_max\_output\_len) | Benchmark server configuration for max output length. | `number` | `256` | no | -| [max\_prompt\_len](#input\_max\_prompt\_len) | Benchmark server configuration for max prompt length. | `number` | `256` | no | -| [namespace](#input\_namespace) | Namespace used for model and benchmarking deployments. | `string` | `"default"` | no | -| [output\_bucket](#input\_output\_bucket) | Bucket name for storing results | `string` | n/a | yes | -| [project\_id](#input\_project\_id) | Project id of existing or created project. | `string` | n/a | yes | -| [templates\_path](#input\_templates\_path) | Path where manifest templates will be read from. Set to null to use the default manifests | `string` | `null` | no | -| [tokenizer](#input\_tokenizer) | Benchmark server configuration for tokenizer. | `string` | `"tiiuae/falcon-7b"` | no | - -