Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to scale on tgi custom metrics #263

Merged
merged 9 commits into from
Mar 6, 2024
2 changes: 1 addition & 1 deletion benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,4 +132,4 @@ terraform plan
terraform apply
```

To further interact with the Locust inference benchmark, view the README.md file in `benchmark/tools/locust-load-inference`
To further interact with the Locust inference benchmark, view the README.md file in `benchmark/tools/locust-load-inference`
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
This directory contains the script for uploading a filtered and formatted file of prompts based on the "anon8231489123/ShareGPT_Vicuna_unfiltered" dataset to a given GCS path.

Example usage:
python3 upload_sharegpt.py --gcs_path="gs://$BUCKET_NAME/ShareGPT_V3_unfiltered_cleaned_split_filtered_prompts.txt"
```
python3 upload_sharegpt.py --gcs_path="gs://$BUCKET_NAME/ShareGPT_V3_unfiltered_cleaned_split_filtered_prompts.txt"
```

pre-work:
- upload_sharegpt.py assumes that the bucket already exists. If it does not exist, make sure that you create your bucket $BUCKET_NAME in your project prior to running the script. You can do that with the following command:
- upload_sharegpt.py may require additional python libraries; see below.
- upload_sharegpt.py assumes that the bucket already exists. If you've created your cluster via the terraform scripts in `./infra/stage-2`, then the bucket was created for you. (See `terraform.tfvars` in that directory for the name.) If it does not exist, make sure that you create your bucket $BUCKET_NAME in your project prior to running the script. You can do that with the following command:
```
gcloud storage buckets create gs://$BUCKET_NAME --location=BUCKET_LOCATION
```
Expand All @@ -20,7 +23,7 @@ Assumes in your environment you:
- have access to use google storage APIs via Application Default Credentials (ADC)

You may need to do the following:
- run "pip install google-cloud-storage" to install storage client library dependencies
- run "pip install wget google-cloud-storage" to install storage client library dependencies. (Optionally, you can run this within a venv, i.e. `python3 -m venv ./venv && source ./venv/bin/activate && pip install ...`)
- run "gcloud auth application-default login" to enable ADC

For more information on running the google cloud storage API, see https://cloud.google.com/python/docs/reference/storage
For more information on running the google cloud storage API, see https://cloud.google.com/python/docs/reference/storage
4 changes: 2 additions & 2 deletions benchmarks/benchmark/tools/locust-load-inference/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ The Locust workload requires storage.admin access to view the dataset in the giv
To give viewer permissions on the gcs bucket to the gcloud service account, run the following:

```
gcloud storage buckets add-iam-policy-binding gs://$BUCKET/$DATASET_FILENAME
gcloud storage buckets add-iam-policy-binding gs://$BUCKET
--member=serviceAccount:$GOOGLE_SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com --role=roles/storage.admin
```

Expand Down Expand Up @@ -237,4 +237,4 @@ To change the benchmark configuration, you will have to rerun terraform destroy
| <a name="input_sax_model"></a> [sax\_model](#input\_sax\_model) | Benchmark server configuration for sax model. Only required if framework is sax. | `string` | `""` | no |
| <a name="input_tokenizer"></a> [tokenizer](#input\_tokenizer) | Benchmark server configuration for tokenizer. | `string` | `"tiiuae/falcon-7b"` | yes |
| <a name="input_use_beam_search"></a> [use\_beam\_search](#input\_use\_beam\_search) | Benchmark server configuration for use beam search. | `bool` | `false` | no |
<!-- END_TF_DOCS -->
<!-- END_TF_DOCS -->
Original file line number Diff line number Diff line change
Expand Up @@ -2,23 +2,24 @@ credentials_config = {
fleet_host = "https://connectgateway.googleapis.com/v1/projects/$PROJECT_NUM/locations/global/gkeMemberships/ai-benchmark"
}

project_id = "change-me"
project_id = "change-me"

namespace = "benchmark"
ksa = "benchmark-ksa"

# Locust service configuration
artifact_registry = "us-central1-docker.pkg.dev/$PROJECT_ID/ai-benchmark"
inference_server_service = "tgi" # inference server service name
artifact_registry = "us-central1-docker.pkg.dev/$PROJECT_ID/ai-benchmark"
inference_server_service = "tgi" # inference server service name
locust_runner_kubernetes_service_account = "sample-runner-ksa"
output_bucket = "benchmark-output"
gcs_path = "gs://ai-on-gke-benchmark/ShareGPT_V3_unfiltered_cleaned_split_filtered_prompts.txt"
output_bucket = "benchmark-output"
gcs_path = "gs://${PROJECT_ID}-ai-gke-benchmark-fuse/ShareGPT_V3_unfiltered_cleaned_split_filtered_prompts.txt"

# Benchmark configuration for Locust Docker accessing inference server
inference_server_framework = "tgi"
tokenizer = "tiiuae/falcon-7b"
tokenizer = "tiiuae/falcon-7b"

# Benchmark configuration for triggering single test via Locust Runner
test_duration = 60
test_users = 1
test_rate = 5
# Increase test_users to allow more parallelism (especially when testing HPA)
test_users = 1
test_rate = 5
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ cp sample-terraform.tfvars terraform.tfvars

Fill out your `terraform.tfvars` with the desired model and server configuration, referring to the list of required and optional variables [here](#variables). Variables `credentials_config` are required.

Optionally configure HPA (Horizontal Pod Autoscaling) by setting `hpa_type`. Note: GMP (Google Managed Prometheus) must be enabled on this cluster (which is the default) to scale based on custom metrics. See `autoscaling.md` for more details.

#### Determine number of gpus

`gpu_count` should be configured respective to the size of the model with some overhead for the kv cache. Here's an example on figuring out how many GPUs you need to run a model:
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Autoscaling TGI

## tl;dr

Recommendation: TODO

## Autoscaling Options

### CPU

CPU scaling is a poor choice for this workload - the TGI workload starts up,
pulls the model weights, and then spends a minute or two worth of cpu time
crunching some numbers. This causes hpa to add a replica, which then spends
more cpu time, which causes hpa to add a replica, etc. Eventually, things
settle, and hpa scales down the replicas. This whole process could take up to
an hour.

### Custom Metrics

Workload/custom metrics can be viewed in
https://console.cloud.google.com/monitoring/metrics-explorer. (Just search for
the metric name, e.g. "tgi_batch_current_size". The full name should be
"prometheus/tgi_batch_current_size/gauge")

#### `tgi_batch_current_size`

TODO

### External Metrics

TODO
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Custom Metrics Stackdriver Adapter
rsgowman marked this conversation as resolved.
Show resolved Hide resolved

Adapted from https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml

## Usage

To use this module, include it from your main terraform config, i.e.:

```
module "custom_metrics_stackdriver_adapter" {
source = "./path/to/custom-metrics-stackdriver-adapter"
}
```

For a workload identity enabled cluster, some additional configuration is
needed:

```
module "custom_metrics_stackdriver_adapter" {
source = "./path/to/custom-metrics-stackdriver-adapter"
workload_identity = {
enabled = true
project_id = "<PROJECT_ID>"
}
}
```

# TODO

This module should be moved out of the text-generation-inference subdirectory,
as it should be more broadly applicable.
Loading