diff --git a/benchmarks/inference-server/README.md b/benchmarks/inference-server/README.md index 217608420..8dc144fe6 100644 --- a/benchmarks/inference-server/README.md +++ b/benchmarks/inference-server/README.md @@ -3,6 +3,7 @@ Terraform templates associated with them. The current supported options are: - Text Generation Inference (aka TGI) +- TensorRT-LLM on Triton Inference Server You may also choose to manually deploy your own inference server. diff --git a/benchmarks/inference-server/triton/README.md b/benchmarks/inference-server/triton/README.md new file mode 100644 index 000000000..06f067f33 --- /dev/null +++ b/benchmarks/inference-server/triton/README.md @@ -0,0 +1,185 @@ +# AI on GKE: Benchmark TensorRT LLM on Triton Server + +This guide outlines the steps for deploying and benchmarking a TensorRT Large Language Model (LLM) on the Triton Inference Server within Google Kubernetes Engine (GKE). It includes the process of building a Docker container equipped with the TensorRT LLM engine and deploying this container to a GKE cluster. + +## Prerequisites + +Ensure you have the following prerequisites installed and set up: +- Docker for containerization +- Google Cloud SDK for interacting with Google Cloud services +- Kubernetes CLI (kubectl) for managing Kubernetes clusters +- A Hugging Face account to access models +- NVIDIA GPU drivers and CUDA toolkit for building the TensorRT LLM Engine +## Step 1: Build the TensorRT LLM Engine and Docker Image + +1. **Build the TensorRT LLM Engine:** Follow the instructions provided in the [TensorRT LLM backend repository](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/README.md) to build the TensorRT LLM Engine. + +2. **Upload the Model to Google Cloud Storage (GCS)** + + Transfer your model engine to GCS using: + ``` + gsutil cp -r your_model_folder gs://your_model_repo/all_models/ + ``` + *Ensure to replace your_model_repo with your actual GCS repository path.** + + **Alternate method: Add the Model repository and the relevant scripts to the image** + + Construct a new image from Nvidia's base image and integrate the model repository and necessary scripts directly into it, bypassing the need for GCS during runtime. In the `tritonllm_backend` directory, create a Dockerfile with the following content: + ```Dockerfile + FROM nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3 + + # Copy the necessary files into the container + COPY all_models /all_models + COPY scripts /opt/scripts + + # Install Python dependencies + RUN pip install sentencepiece protobuf + + RUN chmod +x /opt/scripts/start_triton_servers.sh + + CMD ["/opt/scripts/start_triton_servers.sh"] + ``` + + For the initialization script located at `/opt/scripts/start_triton_servers.sh`, follow the structure below: + + ```start_triton_servers.sh + #!/bin/bash + + + # Use the environment variable to log in to Hugging Face CLI + huggingface-cli login --token $HUGGINGFACE_TOKEN + + + + + # Launch the servers (modify this depending on the number of GPU used and the exact path to the model repo) + mpirun --allow-run-as-root -n 1 /opt/tritonserver/bin/tritonserver \ + --model-repository=/all_models/inflight_batcher_llm \ + --disable-auto-complete-config \ + --backend-config=python,shm-region-prefix-name=prefix0_ : \ + -n 1 /opt/tritonserver/bin/tritonserver \ + --model-repository=/all_models/inflight_batcher_llm \ + --disable-auto-complete-config \ + --backend-config=python,shm-region-prefix-name=prefix1_ : + ``` + *Build and Push the Docker Image:* + + Build the Docker image and push it to your container registry: + + ``` + docker build -t your_registry/tritonserver_llm:latest . + docker push your_registry/tritonserver_llm:latest + + ``` + Substitute `your_registry` with your Docker registry's path. + + + + +## Step 2: Create and Configure `terraform.tfvars` + +1. **Initialize Terraform Variables:** + + Start by creating a `terraform.tfvars` file from the provided template: + + ```bash + cp sample-terraform.tfvars terraform.tfvars + ``` + +2. **Specify Essential Variables** + + At a minimum, configure the `gcs_model_path` in `terraform.tfvars` to point to your Google Cloud Storage (GCS) repository. You may also need to adjust `image_path`, `gpu_count`, and `server_launch_command_string` according to your specific requirements. + ```bash + gcs_model_path = gs://path_to_model_repo/all_models + ``` + + If `gcs_model_path` is not defined, it is inferred that you are utilizing the method of direct image integration. In this scenario, ensure to update `image_path` along with any pertinent variables accordingly: + + ```bash + image_path = "path_to_your_registry/tritonserver_llm:latest" + ``` + +3. **Configure Your Deployment:** + + Edit `terraform.tfvars` to tailor your deployment's configuration, particularly the `credentials_config`. Depending on your cluster's setup, this might involve specifying fleet host credentials or isolating your cluster's kubeconfig: + + ***For Fleet Management Enabled Clusters:*** +```bash +credentials_config = { + fleet_host = "https://connectgateway.googleapis.com/v1/projects/$PROJECT_NUMBER/locations/global/gkeMemberships/$CLUSTER_NAME" +} +``` + ***For Clusters without Fleet Management:*** +Separate your cluster's kubeconfig: +```bash +KUBECONFIG=~/.kube/${CLUSTER_NAME}-kube.config gcloud container clusters get-credentials $CLUSTER_NAME --location $CLUSTER_LOCATION +``` + +And then update your `terraform.tfvars` accordingly: + +```bash +credentials_config = { + kubeconfig = { + path = "~/.kube/${CLUSTER_NAME}-kube.config" + } +} +``` + +#### [optional] Setting Up Secret Tokens + +A model may require a security token to access it. For example, Llama2 from HuggingFace is a gated model that requires a [user access token](https://huggingface.co/docs/hub/en/security-tokens). If the model you want to run does not require this, skip this step. + +If you followed steps from `../../infra/stage-2`, Secret Manager and the user access token should already be set up. Alternatively you can create a Kubernetes Secret to store your Hugging Face CLI token. You can do this from the command line with kubectl: +```bash +kubectl create secret generic huggingface-secret --from-literal=token='************' +``` + +Executing this command generates a new Secret named huggingface-secret, which incorporates a key named token that stores your Hugging Face CLI token. It is important to note that for any production or shared environments, directly storing user access tokens as literals is not advisable. + + +## Step 3: login to gcloud + +Run the following gcloud command for authorization: + +```bash +gcloud auth application-default login +``` + +## Step 4: terraform initialize, plan and apply + +Run the following terraform commands: + +```bash +# initialize terraform +terraform init + +# verify changes +terraform plan + +# apply changes +terraform apply +``` + +# Variables + +| Variable | Description | Type | Default | Nullable | +|----------------------|-----------------------------------------------------------------------------------------------|---------|-------------------------------------------|----------| +| `credentials_config` | Configure how Terraform authenticates to the cluster. | Object | | No | +| `namespace` | Namespace used for Nvidia DCGM resources. | String | `"default"` | No | +| `image_path` | Image Path stored in Artifact Registry | String | "nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3" | No | +| `model_id` | Model used for inference. | String | `"meta-llama/Llama-2-7b-chat-hf"` | No | +| `gpu_count` | Parallelism based on number of gpus. | Number | `1` | No | +| `ksa` | Kubernetes Service Account used for workload. | String | `"default"` | No | +| `huggingface_secret` | Name of the kubectl huggingface secret token | String | `"huggingface-secret"` | Yes | +| `gcs_model_path` | Path where model engine in gcs will be read from. | String | null | Yes | +| `server_launch_command_string` | Command to launc the Triton Inference Server | String | "pip install sentencepiece protobuf && huggingface-cli login --token $HUGGINGFACE_TOKEN && /opt/tritonserver/bin/tritonserver --model-repository=/all_models/inflight_batcher_llm --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_" | No | + + + + +## Notes + +- The `credentials_config` variable is an object that may contain either a `fleet_host` or `kubeconfig` (not both), each with optional fields. This ensures either one method of authentication is configured. +- The `templates_path` variable can be set to `null` to use default manifests, indicating its nullable nature and the use of a default value. +- Default values are provided for several variables (`namespace`, `model_id`, `gpu_count`, `ksa`, and `huggingface-secret`), ensuring ease of use and defaults that make sense for most scenarios. +- The `nullable = false` for most variables indicates they must explicitly be provided or have a default value, ensuring crucial configuration is not omitted. diff --git a/benchmarks/inference-server/triton/main.tf b/benchmarks/inference-server/triton/main.tf new file mode 100644 index 000000000..5832ad609 --- /dev/null +++ b/benchmarks/inference-server/triton/main.tf @@ -0,0 +1,37 @@ +/** + * Copyright 2024 Google LLC + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +locals { + + template_path = ( + var.gcs_model_path == null + ? "${path.module}/manifest-templates/triton-tensorrtllm-inference-docker.tftpl" + : "${path.module}/manifest-templates/triton-tensorrtllm-inference-gs.tftpl" + ) +} + +resource "kubernetes_manifest" "default" { + manifest = yamldecode(templatefile(local.template_path, { + namespace = var.namespace + ksa = var.ksa + image_path = var.image_path + huggingface_secret = var.huggingface_secret + gpu_count = var.gpu_count + model_id = var.model_id + gcs_model_path = var.gcs_model_path + server_launch_command_string = var.server_launch_command_string + })) +} diff --git a/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-docker.tftpl b/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-docker.tftpl new file mode 100644 index 000000000..fd45a5c46 --- /dev/null +++ b/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-docker.tftpl @@ -0,0 +1,62 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: apps/v1 +kind: Deployment +metadata: + name: triton-infer + namespace: ${namespace} + labels: + app: triton-infer +spec: + replicas: 1 + selector: + matchLabels: + app: triton-infer + template: + metadata: + labels: + app: triton-infer + spec: + serviceAccountName: ${ksa} + volumes: + - name: dshm + emptyDir: + medium: Memory + sizeLimit: 1Gi + containers: + - name: triton-inference + ports: + - containerPort: 8000 + name: http-triton + - containerPort: 8001 + name: grpc-triton + - containerPort: 8002 + name: metrics-triton + + image: "${image_path}" + env: + - name: HUGGINGFACE_TOKEN + valueFrom: + secretKeyRef: + name: ${huggingface_secret} + key: token + resources: + limits: + nvidia.com/gpu: ${gpu_count} # number of gpu's allocated to workload + serviceAccountName: ${ksa} + volumeMounts: + - mountPath: /dev/shm + name: dshm + diff --git a/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-gs.tftpl b/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-gs.tftpl new file mode 100644 index 000000000..a04f969f1 --- /dev/null +++ b/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-gs.tftpl @@ -0,0 +1,77 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: apps/v1 +kind: Deployment +metadata: + name: triton-infer-gcs + namespace: ${namespace} # Replace ${namespace} with your actual namespace + labels: + app: triton-infer-gcs +spec: + replicas: 1 + selector: + matchLabels: + app: triton-infer-gcs + template: + metadata: + labels: + app: triton-infer-gcs + spec: + serviceAccountName: ${ksa} # Replace ${ksa} with your Kubernetes Service Account name + volumes: + - name: dshm + emptyDir: + medium: Memory + sizeLimit: 1Gi + - name: all-models-volume + emptyDir: {} + initContainers: + - name: init-gcs-download + image: google/cloud-sdk + serviceAccountName: ${ksa} + command: ["/bin/sh", "-c"] + args: + - gsutil cp -r ${gcs_model_path} ./; + volumeMounts: + - name: all-models-volume + mountPath: /all_models + containers: + - name: triton-inference + image: "${image_path}" # Replace ${image_path} with the actual image path + workingDir: /opt/tritonserver + command: ["/bin/bash", "-c"] + args: ["${server_launch_command_string}"] + ports: + - containerPort: 8000 + name: http-triton + - containerPort: 8001 + name: grpc-triton + - containerPort: 8002 + name: metrics-triton + env: + - name: HUGGINGFACE_TOKEN + valueFrom: + secretKeyRef: + name: ${huggingface-secret} # Replace ${huggingface-secret} with your secret's name + key: token + resources: + limits: + nvidia.com/gpu: ${gpu_count} # Replace ${gpu_count} with the number of GPUs + serviceAccountName: ${ksa} + volumeMounts: + - name: all-models-volume + mountPath: /all_models + - mountPath: /dev/shm + name: dshm \ No newline at end of file diff --git a/benchmarks/inference-server/triton/providers.tf b/benchmarks/inference-server/triton/providers.tf new file mode 100644 index 000000000..70c82e817 --- /dev/null +++ b/benchmarks/inference-server/triton/providers.tf @@ -0,0 +1,36 @@ +/** + * Copyright 2024 Google LLC + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +data "google_client_config" "identity" { + count = var.credentials_config.fleet_host != null ? 1 : 0 +} + +provider "kubernetes" { + config_path = ( + var.credentials_config.kubeconfig == null + ? null + : pathexpand(var.credentials_config.kubeconfig.path) + ) + config_context = try( + var.credentials_config.kubeconfig.context, null + ) + host = ( + var.credentials_config.fleet_host == null + ? null + : var.credentials_config.fleet_host + ) + token = try(data.google_client_config.identity.0.access_token, null) +} diff --git a/benchmarks/inference-server/triton/sample-terraform.tfvars b/benchmarks/inference-server/triton/sample-terraform.tfvars new file mode 100644 index 000000000..ec0438c13 --- /dev/null +++ b/benchmarks/inference-server/triton/sample-terraform.tfvars @@ -0,0 +1,9 @@ +credentials_config = { + fleet_host = "https://connectgateway.googleapis.com/v1/projects/$PROJECT_NUMBER/locations/global/gkeMemberships/ai-benchmark" +} + +namespace = "benchmark" +ksa = "benchmark-ksa" +model_id = "meta-llama/Llama-2-7b-chat-hf" +gpu_count = 1 +gcs_model_path = "" \ No newline at end of file diff --git a/benchmarks/inference-server/triton/variables.tf b/benchmarks/inference-server/triton/variables.tf new file mode 100644 index 000000000..c481ef1b8 --- /dev/null +++ b/benchmarks/inference-server/triton/variables.tf @@ -0,0 +1,91 @@ +/** + * Copyright 2024 Google LLC + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +variable "credentials_config" { + description = "Configure how Terraform authenticates to the cluster." + type = object({ + fleet_host = optional(string) + kubeconfig = optional(object({ + context = optional(string) + path = optional(string, "~/.kube/config") + })) + }) + nullable = false + validation { + condition = ( + (var.credentials_config.fleet_host != null) != + (var.credentials_config.kubeconfig != null) + ) + error_message = "Exactly one of fleet host or kubeconfig must be set." + } +} + +variable "namespace" { + description = "Namespace used for Nvidia Triton resources." + type = string + nullable = false + default = "default" +} + +variable "image_path" { + description = "Image Path stored in Artifact Registry" + type = string + nullable = false + default = "nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3" +} + +variable "model_id" { + description = "Model used for inference." + type = string + nullable = false + default = "meta-llama/Llama-2-7b-chat-hf" +} + +variable "gpu_count" { + description = "Parallelism based on number of gpus." + type = number + nullable = false + default = 1 +} + +variable "ksa" { + description = "Kubernetes Service Account used for workload." + type = string + nullable = false + default = "default" +} + +variable "huggingface_secret" { + description = "name of the kubectl huggingface secret token" + type = string + nullable = true + default = "huggingface-secret" +} + +variable "gcs_model_path" { + description = "Path to the GCS repo where model is stored" + type = string + nullable = true + default = null +} + +variable "server_launch_command_string" { + description = "command to launch the triton server" + type = string + nullable = true + default = "pip install sentencepiece protobuf && huggingface-cli login --token $HUGGINGFACE_TOKEN && /opt/tritonserver/bin/tritonserver --model-repository=/all_models/inflight_batcher_llm --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_" +} +