Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gke triton benchmark #272

Merged
merged 28 commits into from
Mar 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions benchmarks/inference-server/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ Terraform templates associated with them.

The current supported options are:
- Text Generation Inference (aka TGI)
- TensorRT-LLM on Triton Inference Server

You may also choose to manually deploy your own inference server.

Expand Down
185 changes: 185 additions & 0 deletions benchmarks/inference-server/triton/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
# AI on GKE: Benchmark TensorRT LLM on Triton Server

This guide outlines the steps for deploying and benchmarking a TensorRT Large Language Model (LLM) on the Triton Inference Server within Google Kubernetes Engine (GKE). It includes the process of building a Docker container equipped with the TensorRT LLM engine and deploying this container to a GKE cluster.

## Prerequisites

Ensure you have the following prerequisites installed and set up:
- Docker for containerization
- Google Cloud SDK for interacting with Google Cloud services
- Kubernetes CLI (kubectl) for managing Kubernetes clusters
- A Hugging Face account to access models
- NVIDIA GPU drivers and CUDA toolkit for building the TensorRT LLM Engine
## Step 1: Build the TensorRT LLM Engine and Docker Image

1. **Build the TensorRT LLM Engine:** Follow the instructions provided in the [TensorRT LLM backend repository](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/README.md) to build the TensorRT LLM Engine.

2. **Upload the Model to Google Cloud Storage (GCS)**

Transfer your model engine to GCS using:
```
gsutil cp -r your_model_folder gs://your_model_repo/all_models/
```
*Ensure to replace your_model_repo with your actual GCS repository path.**

**Alternate method: Add the Model repository and the relevant scripts to the image**

Construct a new image from Nvidia's base image and integrate the model repository and necessary scripts directly into it, bypassing the need for GCS during runtime. In the `tritonllm_backend` directory, create a Dockerfile with the following content:
```Dockerfile
FROM nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3

# Copy the necessary files into the container
COPY all_models /all_models
COPY scripts /opt/scripts

# Install Python dependencies
RUN pip install sentencepiece protobuf

RUN chmod +x /opt/scripts/start_triton_servers.sh

CMD ["/opt/scripts/start_triton_servers.sh"]
```

For the initialization script located at `/opt/scripts/start_triton_servers.sh`, follow the structure below:

```start_triton_servers.sh
#!/bin/bash


# Use the environment variable to log in to Hugging Face CLI
huggingface-cli login --token $HUGGINGFACE_TOKEN




# Launch the servers (modify this depending on the number of GPU used and the exact path to the model repo)
mpirun --allow-run-as-root -n 1 /opt/tritonserver/bin/tritonserver \
--model-repository=/all_models/inflight_batcher_llm \
--disable-auto-complete-config \
--backend-config=python,shm-region-prefix-name=prefix0_ : \
-n 1 /opt/tritonserver/bin/tritonserver \
--model-repository=/all_models/inflight_batcher_llm \
--disable-auto-complete-config \
--backend-config=python,shm-region-prefix-name=prefix1_ :
```
*Build and Push the Docker Image:*

Build the Docker image and push it to your container registry:

```
docker build -t your_registry/tritonserver_llm:latest .
docker push your_registry/tritonserver_llm:latest

```
Substitute `your_registry` with your Docker registry's path.




## Step 2: Create and Configure `terraform.tfvars`

1. **Initialize Terraform Variables:**

Start by creating a `terraform.tfvars` file from the provided template:

```bash
cp sample-terraform.tfvars terraform.tfvars
```

2. **Specify Essential Variables**

At a minimum, configure the `gcs_model_path` in `terraform.tfvars` to point to your Google Cloud Storage (GCS) repository. You may also need to adjust `image_path`, `gpu_count`, and `server_launch_command_string` according to your specific requirements.
```bash
gcs_model_path = gs://path_to_model_repo/all_models
```

If `gcs_model_path` is not defined, it is inferred that you are utilizing the method of direct image integration. In this scenario, ensure to update `image_path` along with any pertinent variables accordingly:

```bash
image_path = "path_to_your_registry/tritonserver_llm:latest"
```

3. **Configure Your Deployment:**

Edit `terraform.tfvars` to tailor your deployment's configuration, particularly the `credentials_config`. Depending on your cluster's setup, this might involve specifying fleet host credentials or isolating your cluster's kubeconfig:

***For Fleet Management Enabled Clusters:***
```bash
credentials_config = {
fleet_host = "https://connectgateway.googleapis.com/v1/projects/$PROJECT_NUMBER/locations/global/gkeMemberships/$CLUSTER_NAME"
}
```
***For Clusters without Fleet Management:***
Separate your cluster's kubeconfig:
```bash
KUBECONFIG=~/.kube/${CLUSTER_NAME}-kube.config gcloud container clusters get-credentials $CLUSTER_NAME --location $CLUSTER_LOCATION
```

And then update your `terraform.tfvars` accordingly:

```bash
credentials_config = {
kubeconfig = {
path = "~/.kube/${CLUSTER_NAME}-kube.config"
}
}
```

#### [optional] Setting Up Secret Tokens

A model may require a security token to access it. For example, Llama2 from HuggingFace is a gated model that requires a [user access token](https://huggingface.co/docs/hub/en/security-tokens). If the model you want to run does not require this, skip this step.

If you followed steps from `../../infra/stage-2`, Secret Manager and the user access token should already be set up. Alternatively you can create a Kubernetes Secret to store your Hugging Face CLI token. You can do this from the command line with kubectl:
kaushikmitr marked this conversation as resolved.
Show resolved Hide resolved
```bash
kubectl create secret generic huggingface-secret --from-literal=token='************'
```

Executing this command generates a new Secret named huggingface-secret, which incorporates a key named token that stores your Hugging Face CLI token. It is important to note that for any production or shared environments, directly storing user access tokens as literals is not advisable.


## Step 3: login to gcloud

Run the following gcloud command for authorization:

```bash
gcloud auth application-default login
```

## Step 4: terraform initialize, plan and apply

Run the following terraform commands:

```bash
# initialize terraform
terraform init

# verify changes
terraform plan

# apply changes
terraform apply
```

# Variables

| Variable | Description | Type | Default | Nullable |
|----------------------|-----------------------------------------------------------------------------------------------|---------|-------------------------------------------|----------|
| `credentials_config` | Configure how Terraform authenticates to the cluster. | Object | | No |
| `namespace` | Namespace used for Nvidia DCGM resources. | String | `"default"` | No |
| `image_path` | Image Path stored in Artifact Registry | String | "nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3" | No |
| `model_id` | Model used for inference. | String | `"meta-llama/Llama-2-7b-chat-hf"` | No |
| `gpu_count` | Parallelism based on number of gpus. | Number | `1` | No |
| `ksa` | Kubernetes Service Account used for workload. | String | `"default"` | No |
| `huggingface_secret` | Name of the kubectl huggingface secret token | String | `"huggingface-secret"` | Yes |
| `gcs_model_path` | Path where model engine in gcs will be read from. | String | null | Yes |
| `server_launch_command_string` | Command to launc the Triton Inference Server | String | "pip install sentencepiece protobuf && huggingface-cli login --token $HUGGINGFACE_TOKEN && /opt/tritonserver/bin/tritonserver --model-repository=/all_models/inflight_batcher_llm --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_" | No |




## Notes

- The `credentials_config` variable is an object that may contain either a `fleet_host` or `kubeconfig` (not both), each with optional fields. This ensures either one method of authentication is configured.
- The `templates_path` variable can be set to `null` to use default manifests, indicating its nullable nature and the use of a default value.
- Default values are provided for several variables (`namespace`, `model_id`, `gpu_count`, `ksa`, and `huggingface-secret`), ensuring ease of use and defaults that make sense for most scenarios.
- The `nullable = false` for most variables indicates they must explicitly be provided or have a default value, ensuring crucial configuration is not omitted.
37 changes: 37 additions & 0 deletions benchmarks/inference-server/triton/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
/**
* Copyright 2024 Google LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

locals {

template_path = (
var.gcs_model_path == null
? "${path.module}/manifest-templates/triton-tensorrtllm-inference-docker.tftpl"
: "${path.module}/manifest-templates/triton-tensorrtllm-inference-gs.tftpl"
)
}

resource "kubernetes_manifest" "default" {
manifest = yamldecode(templatefile(local.template_path, {
namespace = var.namespace
ksa = var.ksa
image_path = var.image_path
huggingface_secret = var.huggingface_secret
gpu_count = var.gpu_count
model_id = var.model_id
gcs_model_path = var.gcs_model_path
server_launch_command_string = var.server_launch_command_string
}))
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: apps/v1
kind: Deployment
metadata:
name: triton-infer
namespace: ${namespace}
labels:
app: triton-infer
spec:
replicas: 1
selector:
matchLabels:
app: triton-infer
template:
metadata:
labels:
app: triton-infer
spec:
serviceAccountName: ${ksa}
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 1Gi
containers:
- name: triton-inference
ports:
- containerPort: 8000
name: http-triton
- containerPort: 8001
name: grpc-triton
- containerPort: 8002
name: metrics-triton

image: "${image_path}"
env:
- name: HUGGINGFACE_TOKEN
valueFrom:
secretKeyRef:
name: ${huggingface_secret}
key: token
resources:
limits:
nvidia.com/gpu: ${gpu_count} # number of gpu's allocated to workload
serviceAccountName: ${ksa}
volumeMounts:
- mountPath: /dev/shm
name: dshm

Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: apps/v1
kind: Deployment
metadata:
name: triton-infer-gcs
namespace: ${namespace} # Replace ${namespace} with your actual namespace
labels:
app: triton-infer-gcs
spec:
replicas: 1
selector:
matchLabels:
app: triton-infer-gcs
template:
metadata:
labels:
app: triton-infer-gcs
spec:
serviceAccountName: ${ksa} # Replace ${ksa} with your Kubernetes Service Account name
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 1Gi
- name: all-models-volume
emptyDir: {}
initContainers:
- name: init-gcs-download
kaushikmitr marked this conversation as resolved.
Show resolved Hide resolved
image: google/cloud-sdk
serviceAccountName: ${ksa}
command: ["/bin/sh", "-c"]
args:
- gsutil cp -r ${gcs_model_path} ./;
volumeMounts:
- name: all-models-volume
mountPath: /all_models
containers:
- name: triton-inference
image: "${image_path}" # Replace ${image_path} with the actual image path
workingDir: /opt/tritonserver
command: ["/bin/bash", "-c"]
args: ["${server_launch_command_string}"]
ports:
- containerPort: 8000
name: http-triton
- containerPort: 8001
name: grpc-triton
- containerPort: 8002
name: metrics-triton
env:
- name: HUGGINGFACE_TOKEN
valueFrom:
secretKeyRef:
name: ${huggingface-secret} # Replace ${huggingface-secret} with your secret's name
key: token
resources:
limits:
nvidia.com/gpu: ${gpu_count} # Replace ${gpu_count} with the number of GPUs
serviceAccountName: ${ksa}
volumeMounts:
kaushikmitr marked this conversation as resolved.
Show resolved Hide resolved
- name: all-models-volume
mountPath: /all_models
- mountPath: /dev/shm
name: dshm
Loading
Loading