Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
Bslabe123 committed Aug 23, 2024
1 parent 99907c6 commit b54c7a9
Showing 1 changed file with 13 additions and 35 deletions.
48 changes: 13 additions & 35 deletions benchmarks/benchmark/tools/profile-generator/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@
* [Step 4: create and configure terraform.tfvars](#step-4--create-and-configure-terraformtfvars)
* [[optional] set-up credentials config with kubeconfig](#optional-set-up-credentials-config-with-kubeconfig)
* [[optional] set up secret token in Secret Manager](#optional-set-up-secret-token-in-secret-manager)
* [Step 5: login to gcloud](#step-5--login-to-gcloud)
* [Step 6: terraform initialize, plan and apply](#step-6--terraform-initialize-plan-and-apply)
* [Inputs](#inputs)
<!-- TOC -->
Expand Down Expand Up @@ -41,8 +40,13 @@ Set the `output_bucket` in your `terraform.tfvars` to this gcs bucket.

The Latency profile generator requires storage.admin access to write output to
the given output gcs bucket. If you followed steps in `../../infra`, then you
already have a kubernetes and gcloud service account created that has the proper
access to the created output bucket.
already be logged into gcloud have a kubernetes and gcloud service account
created that has the proper access to the created output bucket. If you are
not logged into gcloud, run the following:

```bash
gcloud auth application-default login
```

To give viewer permissions on the gcs bucket to the gcloud service account,
run the following:
Expand All @@ -54,12 +58,13 @@ gcloud storage buckets add-iam-policy-binding gs://$OUTPUT_BUCKET/

Your kubernetes service account will inherit the reader permissions.

You will set the `lantency_profile_kubernetes_service_account` in your
You will set the `latency_profile_kubernetes_service_account` in your
`terraform.tfvars` to the kubernetes service account name.

### Step 3: create artifact repository for automated Latency Profile Generator docker build

The latency profile generator rebuilds the docker file on each terraform apply.
The latency profile generator rebuilds the docker file on each terraform apply
if `build_latency_profile_generator_image` is set to true (default is true).
The containers will be pushed to the given `artifact_registry`. This artifact
repository is expected to already exist. If you created your cluster via
`../../infra/`, then an artifact repository was created for you with the same
Expand All @@ -70,7 +75,6 @@ own via this command:
gcloud artifacts repositories create ai-benchmark --location=us-central1 --repository-format=docker
```


### Step 4: create and configure terraform.tfvars

Create a `terraform.tfvars` file. `./sample-tfvars` is provided as an example
Expand All @@ -86,11 +90,11 @@ Fill out your `terraform.tfvars` with the desired model and server configuration
- `credentials_config` - credentials for cluster to deploy Latency Profile Generator benchmark tool on
- `project_id` - project id for enabling dependent services for building Latency Profile Generator artifacts
- `artifact_registry` - artifact registry to upload Latency Profile Generator artifacts to
- `inference_server_service` - an accessible service name for inference workload to be benchmarked **(Note: If you are using a non-80 port for your model server service, it should be specified here. Example: `my-service-name:9000`)**
- `tokenizer` - must match the model running on the inference workload to be benchmarked
- `inference_server_framework` - the inference workload framework
- `build_latency_profile_generator_image` - Whether latency profile generator image will be built or not
- `targets` - Which model servers are we targeting for benchmarking? Set `manual` if intending to benchmark a model server already in the cluster.
- `output_bucket` - gcs bucket to write benchmarking metrics to.
- `latency_profile_kubernetes_service_account` - service account giving access to latency profile generator to write to `output_bucket`
- `k8s_hf_secret` - Name of secret for huggingface token stored in k8s

#### [optional] set-up credentials config with kubeconfig

Expand Down Expand Up @@ -171,29 +175,3 @@ terraform apply

A results file will appear in GCS bucket specified as `output_bucket` in input
variables.

<!-- BEGIN_TF_DOCS -->

## Inputs

| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_artifact_registry"></a> [artifact\_registry](#input\_artifact\_registry) | Artifact registry for storing Latency Profile Generator container. | `string` | `null` | no |
| <a name="input_credentials_config"></a> [credentials\_config](#input\_credentials\_config) | Configure how Terraform authenticates to the cluster. | <pre>object({<br> fleet_host = optional(string)<br> kubeconfig = optional(object({<br> context = optional(string)<br> path = optional(string, "~/.kube/config")<br> }))<br> })</pre> | n/a | yes |
| <a name="input_hugging_face_secret"></a> [hugging\_face\_secret](#input\_hugging\_face\_secret) | name of the kubectl huggingface secret token; stored in Secret Manager. Security considerations: https://kubernetes.io/docs/concepts/security/secrets-good-practices/ | `string` | `null` | no |
| <a name="input_hugging_face_secret_version"></a> [hugging\_face\_secret\_version](#input\_hugging\_face\_secret\_version) | Secret version in Secret Manager | `string` | `null` | no |
| <a name="input_inference_server_framework"></a> [inference\_server\_framework](#input\_inference\_server\_framework) | Benchmark server configuration for inference server framework. Can be one of: vllm, tgi, tensorrt\_llm\_triton, sax | `string` | `"tgi"` | no |
| <a name="input_inference_server_service"></a> [inference\_server\_service](#input\_inference\_server\_service) | Inference server service | `string` | n/a | yes |
| <a name="input_k8s_hf_secret"></a> [k8s\_hf\_secret](#input\_k8s\_hf\_secret) | Name of secret for huggingface token; stored in k8s | `string` | `null` | no |
| <a name="input_ksa"></a> [ksa](#input\_ksa) | Kubernetes Service Account used for workload. | `string` | `"default"` | no |
| <a name="input_latency_profile_kubernetes_service_account"></a> [latency\_profile\_kubernetes\_service\_account](#input\_latency\_profile\_kubernetes\_service\_account) | Kubernetes Service Account to be used for the latency profile generator tool | `string` | `"sample-runner-ksa"` | no |
| <a name="input_max_num_prompts"></a> [max\_num\_prompts](#input\_max\_num\_prompts) | Benchmark server configuration for max number of prompts. | `number` | `1000` | no |
| <a name="input_max_output_len"></a> [max\_output\_len](#input\_max\_output\_len) | Benchmark server configuration for max output length. | `number` | `256` | no |
| <a name="input_max_prompt_len"></a> [max\_prompt\_len](#input\_max\_prompt\_len) | Benchmark server configuration for max prompt length. | `number` | `256` | no |
| <a name="input_namespace"></a> [namespace](#input\_namespace) | Namespace used for model and benchmarking deployments. | `string` | `"default"` | no |
| <a name="input_output_bucket"></a> [output\_bucket](#input\_output\_bucket) | Bucket name for storing results | `string` | n/a | yes |
| <a name="input_project_id"></a> [project\_id](#input\_project\_id) | Project id of existing or created project. | `string` | n/a | yes |
| <a name="input_templates_path"></a> [templates\_path](#input\_templates\_path) | Path where manifest templates will be read from. Set to null to use the default manifests | `string` | `null` | no |
| <a name="input_tokenizer"></a> [tokenizer](#input\_tokenizer) | Benchmark server configuration for tokenizer. | `string` | `"tiiuae/falcon-7b"` | no |

<!-- END_TF_DOCS -->

0 comments on commit b54c7a9

Please sign in to comment.