From 46107bbe6fdd10ce628ea35df78193f0c664e316 Mon Sep 17 00:00:00 2001
From: kaushikmitr <kaushikmitra.umd@gmail.com>
Date: Fri, 1 Mar 2024 22:37:23 +0000
Subject: [PATCH 01/28] adding terraform scripts for triton + tensorrtllm under
 benchmark/inference server

---
 benchmarks/inference-server/triton/README.md  | 194 ++++++++++++++++++
 benchmarks/inference-server/triton/main.tf    |  26 +++
 .../triton-tensorrtllm-inference-gs.tftpl     |  81 ++++++++
 .../triton-tensorrtllm-inference.tftpl        |  62 ++++++
 .../inference-server/triton/providers.tf      |  36 ++++
 .../triton/sample-terraform.tfvars            |  10 +
 .../inference-server/triton/variables.tf      |  82 ++++++++
 7 files changed, 491 insertions(+)
 create mode 100644 benchmarks/inference-server/triton/README.md
 create mode 100644 benchmarks/inference-server/triton/main.tf
 create mode 100644 benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-gs.tftpl
 create mode 100644 benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference.tftpl
 create mode 100644 benchmarks/inference-server/triton/providers.tf
 create mode 100644 benchmarks/inference-server/triton/sample-terraform.tfvars
 create mode 100644 benchmarks/inference-server/triton/variables.tf

diff --git a/benchmarks/inference-server/triton/README.md b/benchmarks/inference-server/triton/README.md
new file mode 100644
index 000000000..1c42bf323
--- /dev/null
+++ b/benchmarks/inference-server/triton/README.md
@@ -0,0 +1,194 @@
+# AI on GKE: Benchmark TensorRT LLM on Triton Server
+
+This guide provides instructions for deploying and benchmarking a TensorRT Large Language Model (LLM) on Triton Inference Server within a Google Kubernetes Engine (GKE) environment. The process involves building a Docker image with the TensorRT LLM engine and deploying it to a GKE cluster. 
+
+## Prerequisites
+
+- Docker
+- Google Cloud SDK
+- Kubernetes CLI (kubectl)
+- Hugging Face account for model access
+- NVIDIA GPU drivers and CUDA toolkit (to build the TensorRTLLM Engine)
+
+## Step 1: Build the TensorRT LLM Engine and Docker Image
+
+1. **Build the TensorRT LLM Engine:** Follow the instructions provided in the [TensorRT LLM backend repository](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/README.md) to build the TensorRT LLM Engine.
+
+2. **Setup the Docker image:**
+   
+   ***Method 1: Add the Model repository and the relevant scripts to the image***
+   Inside the `tritonllm_backend` directory, create a Dockerfile with the following content ensuring the Triton Server is ready with the necessary models and scripts.
+
+   ```Dockerfile
+   FROM nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3
+
+   # Copy the necessary files into the container
+   COPY all_models /all_models
+   COPY scripts /opt/scripts
+
+   # Install Python dependencies
+   RUN pip install sentencepiece protobuf
+
+   RUN chmod +x /opt/scripts/start_triton_servers.sh
+
+   CMD ["/opt/scripts/start_triton_servers.sh"]
+   ```
+
+   The Shell script `/opt/scripts/start_triton_servers.sh` is like below: 
+
+   ```start_triton_servers.sh
+   #!/bin/bash
+
+
+    # Use the environment variable to log in to Hugging Face CLI
+    huggingface-cli login --token $HUGGINGFACE_TOKEN
+
+
+
+
+    # Launch the servers (modify this depending on the number of GPU used and the exact path to the model repo)
+    mpirun --allow-run-as-root -n 1 /opt/tritonserver/bin/tritonserver \
+    --model-repository=/all_models/inflight_batcher_llm \
+    --disable-auto-complete-config \
+    --backend-config=python,shm-region-prefix-name=prefix0_ : \
+    -n 1 /opt/tritonserver/bin/tritonserver \
+    --model-repository=/all_models/inflight_batcher_llm \
+    --disable-auto-complete-config \
+    --backend-config=python,shm-region-prefix-name=prefix1_ :```
+   ```
+   *Build and Push the Docker Image:*
+
+   Build the Docker image and push it to your container registry:
+
+   ```
+   docker build -t your_registry/tritonserver_llm:latest .
+   docker push your_registry/tritonserver_llm:latest
+
+   ```
+   Replace `your_registry` with your actual Docker registry path.
+   ***Method 2: Upload Model repository and the relevant scripts to gcs***
+   In this method we can directly upload the model engine and scripts to gcs
+   ```
+   gsutil cp -r gs://your_model_repo/scripts/ ./
+   gsutil cp -r gs://your_model_repo/all_models/ ./
+   ```
+   Replace `your_model_repo` with your actual gcs repo path.
+
+
+
+
+## Step 2: Create and Configure `terraform.tfvars`
+
+1. **Initialize Terraform Variables:**
+
+   Create a `terraform.tfvars` file by copying the provided example:
+
+   ```bash
+   cp sample-terraform.tfvars terraform.tfvars
+   ```
+
+2. Define `template_path` variable  in `terraform.tfvars`
+   If usinng method 1 In Step 1 above:
+   ``
+   template_path = "path_to_manifest_template/triton-tensorrtllm-inference.tftpl" 
+   ``
+   If using method 2 In Step 1 above:
+   
+   ``
+   template_path = "path_to_manifest_template/triton-tensorrtllm-inference-gs.tftpl"
+   ``
+   and also update the `triton-tensorrtllm-inference-gs.tftpl` with the path to the gcs repo under InitContainer and the command to launch the container under Container section.
+
+3. **Configure Your Deployment:**
+
+   Edit the `terraform.tfvars` file to include your specific configuration details. At a minimum, you must specify the `credentials_config` for accessing Google Cloud resources and the `image_path` to your Docker image in the registry. Update the gpu_count to make it consistent with the TensorRTLLM Engine configuration when it was [built](https://github.com/NVIDIA/TensorRT-LLM/tree/0f041b7b57d118037f943fd6107db846ed147b91/examples/llama).
+
+Example `terraform.tfvars` content:
+
+   ```bash
+   credentials_config = {
+     kubeconfig = "path/to/your/gcloud/credentials.json"
+   }
+   image_path = "your_registry/tritonserver_llm:latest"
+   gpu_count = 2
+   ```
+
+####[optional] set-up credentials config with kubeconfig
+
+If you created your cluster with steps from `../../infra/` or with fleet management enabled, the existing `credentials_config` must use the fleet host credentials like this:
+```bash
+credentials_config = {
+  fleet_host = "https://connectgateway.googleapis.com/v1/projects/$PROJECT_NUMBER/locations/global/gkeMemberships/$CLUSTER_NAME"
+}
+```
+
+
+If you created your own cluster without fleet management enabled, you can use your cluster's kubeconfig in the `credentials_config`. You must isolate your cluster's kubeconfig from other clusters in the default kube.config file. To do this, run the following command:
+
+```bash
+KUBECONFIG=~/.kube/${CLUSTER_NAME}-kube.config gcloud container clusters get-credentials $CLUSTER_NAME --location $CLUSTER_LOCATION
+```
+
+Then update your `terraform.tfvars` `credentials_config` to the following:
+
+```bash
+credentials_config = {
+  kubeconfig = {
+    path = "~/.kube/${CLUSTER_NAME}-kube.config"
+  }
+}
+```
+
+#### [optional] set up secret token in Secret Manager
+
+A model may require a security token to access it. For example, Llama2 from HuggingFace is a gated model that requires a [user access token](https://huggingface.co/docs/hub/en/security-tokens). If the model you want to run does not require this, skip this step.
+
+If you followed steps from `../../infra/stage-2`, Secret Manager and the user access token should already be set up. Alternatively you can create a Kubernetes Secret to store your Hugging Face CLI token. You can do this from the command line with kubectl:
+```bash
+kubectl create secret generic huggingface-secret --from-literal=token='************'
+```
+
+This command creates a new Secret named huggingface-secret, which has a key token containing your Hugging Face CLI token.
+
+## Step 3: login to gcloud
+
+Run the following gcloud command for authorization:
+
+```bash
+gcloud auth application-default login
+```
+
+## Step 4: terraform initialize, plan and apply
+
+Run the following terraform commands:
+
+```bash
+# initialize terraform
+terraform init
+
+# verify changes
+terraform plan
+
+# apply changes
+terraform apply
+```
+
+# Variables
+
+| Variable             | Description                                                                                   | Type    | Default                                   | Nullable |
+|----------------------|-----------------------------------------------------------------------------------------------|---------|-------------------------------------------|----------|
+| `credentials_config` | Configure how Terraform authenticates to the cluster.                                         | Object  |                                           | No       |
+| `namespace`          | Namespace used for Nvidia DCGM resources.                                                     | String  | `"default"`                               | No       |
+| `image_path`         | Image Path stored in Artifact Registry                                                        | String  |                                           | No       |
+| `model_id`           | Model used for inference.                                                                     | String  | `"meta-llama/Llama-2-7b-chat-hf"`         | No       |
+| `gpu_count`          | Parallelism based on number of gpus.                                                          | Number  | `1`                                       | No       |
+| `ksa`                | Kubernetes Service Account used for workload.                                                 | String  | `"default"`                               | No       |
+| `huggingface-secret` | Name of the kubectl huggingface secret token                                                  | String  | `"huggingface-secret"`                    | No       |
+| `templates_path`     | Path where manifest templates will be read from. Set to null to use the default manifests     | String  | `null`                                    | Yes      |
+
+## Notes
+
+- The `credentials_config` variable is an object that may contain either a `fleet_host` or `kubeconfig` (not both), each with optional fields. This ensures either one method of authentication is configured.
+- The `templates_path` variable can be set to `null` to use default manifests, indicating its nullable nature and the use of a default value.
+- Default values are provided for several variables (`namespace`, `model_id`, `gpu_count`, `ksa`, and `huggingface-secret`), ensuring ease of use and defaults that make sense for most scenarios.
+- The `nullable = false` for most variables indicates they must explicitly be provided or have a default value, ensuring crucial configuration is not omitted.
diff --git a/benchmarks/inference-server/triton/main.tf b/benchmarks/inference-server/triton/main.tf
new file mode 100644
index 000000000..a7beb399d
--- /dev/null
+++ b/benchmarks/inference-server/triton/main.tf
@@ -0,0 +1,26 @@
+/**
+ * Copyright 2023 Google LLC
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+resource "kubernetes_manifest" "default" {
+  manifest = yamldecode(templatefile(var.template_path, {
+    namespace            = var.namespace
+    ksa                  = var.ksa
+    image_path           = var.image_path
+    huggingface-secret   = var.huggingface-secret
+    gpu_count            = var.gpu_count
+    model_id             = var.model_id
+  }))
+}
diff --git a/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-gs.tftpl b/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-gs.tftpl
new file mode 100644
index 000000000..645da0841
--- /dev/null
+++ b/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-gs.tftpl
@@ -0,0 +1,81 @@
+# Copyright 20234Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     https://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: triton-infer-gcs
+  namespace: ${namespace}  # Replace ${namespace} with your actual namespace
+  labels:
+    app: triton-infer-gcs
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: triton-infer-gcs
+  template:
+    metadata:
+      labels:
+        app: triton-infer-gcs
+    spec:
+      serviceAccountName: ${ksa}  # Replace ${ksa} with your Kubernetes Service Account name
+      volumes:
+        - name: dshm
+          emptyDir:
+            medium: Memory
+            sizeLimit: 1Gi
+        - name: all-models-volume
+          emptyDir: {}
+        - name: scripts-volume
+          emptyDir: {}
+      initContainers:
+        - name: init-gcs-download
+          image: google/cloud-sdk
+          serviceAccountName: ${ksa}
+          command: ["/bin/sh", "-c"] 
+          args:
+            - gsutil cp -r gs://your_gcs_repo/all_models ./ && gsutil cp -r gs://your_gcs_repo/scripts ./;
+          volumeMounts:
+            - name: all-models-volume
+              mountPath: /all_models
+            - name: scripts-volume
+              mountPath: /scripts
+      containers:
+        - name: triton-inference
+          image: "${image_path}"  # Replace ${image_path} with the actual image path
+          workingDir: /opt/tritonserver
+          command: ["/bin/bash", "-c"]
+          args: ["pip install sentencepiece protobuf && huggingface-cli login --token $HUGGINGFACE_TOKEN && python /scripts/launch_triton_server.py --model_repo /all_models/inflight_batcher_llm --world_size 1"]
+          ports:
+            - containerPort: 8000
+              name: http-triton
+            - containerPort: 8001
+              name: grpc-triton
+            - containerPort: 8002
+              name: metrics-triton
+          env:
+            - name: HUGGINGFACE_TOKEN
+              valueFrom:
+                secretKeyRef:
+                  name: ${huggingface-secret}  # Replace ${huggingface-secret} with your secret's name
+                  key: token
+          resources:
+            limits:
+              nvidia.com/gpu: ${gpu_count}  # Replace ${gpu_count} with the number of GPUs
+          serviceAccountName: ${ksa}
+          volumeMounts:
+            - name: all-models-volume
+              mountPath: /all_models
+            - name: scripts-volume
+              mountPath: /scripts
diff --git a/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference.tftpl b/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference.tftpl
new file mode 100644
index 000000000..47effa9ee
--- /dev/null
+++ b/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference.tftpl
@@ -0,0 +1,62 @@
+# Copyright 2023 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     https://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: triton-infer
+  namespace: ${namespace}
+  labels:
+    app: triton-infer
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: triton-infer
+  template:
+    metadata:
+      labels:
+        app: triton-infer
+    spec:
+      serviceAccountName: ${ksa}
+      volumes:
+        - name: dshm
+          emptyDir:
+            medium: Memory
+            sizeLimit: 1Gi
+      containers:
+        - name: triton-inference
+          ports:
+          - containerPort: 8000
+            name: http-triton
+          - containerPort: 8001
+            name: grpc-triton
+          - containerPort: 8002
+            name: metrics-triton
+
+          image: "${image_path}"
+          env:
+          - name: HUGGINGFACE_TOKEN
+            valueFrom:
+              secretKeyRef:
+                name: ${huggingface-secret}
+                key: token
+          resources:
+            limits:
+              nvidia.com/gpu: ${gpu_count} # number of gpu's allocated to workload
+          serviceAccountName: ${ksa}
+          volumeMounts:
+            - mountPath: /dev/shm
+              name: dshm
+
diff --git a/benchmarks/inference-server/triton/providers.tf b/benchmarks/inference-server/triton/providers.tf
new file mode 100644
index 000000000..70c82e817
--- /dev/null
+++ b/benchmarks/inference-server/triton/providers.tf
@@ -0,0 +1,36 @@
+/**
+ * Copyright 2024 Google LLC
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+data "google_client_config" "identity" {
+  count = var.credentials_config.fleet_host != null ? 1 : 0
+}
+
+provider "kubernetes" {
+  config_path = (
+    var.credentials_config.kubeconfig == null
+    ? null
+    : pathexpand(var.credentials_config.kubeconfig.path)
+  )
+  config_context = try(
+    var.credentials_config.kubeconfig.context, null
+  )
+  host = (
+    var.credentials_config.fleet_host == null
+    ? null
+    : var.credentials_config.fleet_host
+  )
+  token = try(data.google_client_config.identity.0.access_token, null)
+}
diff --git a/benchmarks/inference-server/triton/sample-terraform.tfvars b/benchmarks/inference-server/triton/sample-terraform.tfvars
new file mode 100644
index 000000000..4b9a4029d
--- /dev/null
+++ b/benchmarks/inference-server/triton/sample-terraform.tfvars
@@ -0,0 +1,10 @@
+credentials_config = {
+  fleet_host = "https://connectgateway.googleapis.com/v1/projects/$PROJECT_NUMBER/locations/global/gkeMemberships/ai-benchmark"
+}
+
+namespace = "benchmark"
+ksa  = "benchmark-ksa"
+model_id = "meta-llama/Llama-2-7b-chat-hf"
+gpu_count = 1
+image_path = ""
+template_path = ""
\ No newline at end of file
diff --git a/benchmarks/inference-server/triton/variables.tf b/benchmarks/inference-server/triton/variables.tf
new file mode 100644
index 000000000..2e675766d
--- /dev/null
+++ b/benchmarks/inference-server/triton/variables.tf
@@ -0,0 +1,82 @@
+/**
+ * Copyright 2023 Google LLC
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+variable "credentials_config" {
+  description = "Configure how Terraform authenticates to the cluster."
+  type = object({
+    fleet_host = optional(string)
+    kubeconfig = optional(object({
+      context = optional(string)
+      path    = optional(string, "~/.kube/config")
+    }))
+  })
+  nullable = false
+  validation {
+    condition = (
+      (var.credentials_config.fleet_host != null) !=
+      (var.credentials_config.kubeconfig != null)
+    )
+    error_message = "Exactly one of fleet host or kubeconfig must be set."
+  }
+}
+
+variable "namespace" {
+  description = "Namespace used for Nvidia DCGM resources."
+  type        = string
+  nullable    = false
+  default     = "default"
+}
+
+variable "image_path" {
+  description = "Image Path stored in Artifact Registry"
+  type        = string
+  nullable    = false
+}
+
+variable "model_id" {
+  description = "Model used for inference."
+  type        = string
+  nullable    = false
+  default     = "meta-llama/Llama-2-7b-chat-hf"
+}
+
+variable "gpu_count" {
+  description = "Parallelism based on number of gpus."
+  type        = number
+  nullable    = false
+  default     = 1
+}
+
+variable "ksa" {
+  description = "Kubernetes Service Account used for workload."
+  type        = string
+  nullable    = false
+  default     = "default"
+}
+
+variable "huggingface-secret" {
+  description = "name of the kubectl huggingface secret token"
+  type        = string
+  nullable    = true
+  default     = "huggingface-secret"
+}
+
+variable "template_path" {
+  description = "Path where manifest templates will be read from. Set to null to use the default manifests"
+  type        = string
+  default     = null
+}
+

From 6be4683c6eb67adbb3f16a149275e98784e3956a Mon Sep 17 00:00:00 2001
From: kaushikmitr <kaushikmitra.umd@gmail.com>
Date: Fri, 1 Mar 2024 22:43:47 +0000
Subject: [PATCH 02/28] fix readme

---
 benchmarks/inference-server/triton/README.md    | 2 +-
 benchmarks/inference-server/triton/variables.tf | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/benchmarks/inference-server/triton/README.md b/benchmarks/inference-server/triton/README.md
index 1c42bf323..485265863 100644
--- a/benchmarks/inference-server/triton/README.md
+++ b/benchmarks/inference-server/triton/README.md
@@ -184,7 +184,7 @@ terraform apply
 | `gpu_count`          | Parallelism based on number of gpus.                                                          | Number  | `1`                                       | No       |
 | `ksa`                | Kubernetes Service Account used for workload.                                                 | String  | `"default"`                               | No       |
 | `huggingface-secret` | Name of the kubectl huggingface secret token                                                  | String  | `"huggingface-secret"`                    | No       |
-| `templates_path`     | Path where manifest templates will be read from. Set to null to use the default manifests     | String  | `null`                                    | Yes      |
+| `templates_path`     | Path where manifest templates will be read from.     | String  |                                    | No      |
 
 ## Notes
 
diff --git a/benchmarks/inference-server/triton/variables.tf b/benchmarks/inference-server/triton/variables.tf
index 2e675766d..fbeb6d269 100644
--- a/benchmarks/inference-server/triton/variables.tf
+++ b/benchmarks/inference-server/triton/variables.tf
@@ -75,8 +75,8 @@ variable "huggingface-secret" {
 }
 
 variable "template_path" {
-  description = "Path where manifest templates will be read from. Set to null to use the default manifests"
+  description = "Path where manifest templates will be read from."
   type        = string
-  default     = null
+  nullable    = false
 }
 

From a8d5c2ca118f28d9376602282731d3dca61a9451 Mon Sep 17 00:00:00 2001
From: kaushikmitr <kaushikmitra.umd@gmail.com>
Date: Fri, 1 Mar 2024 22:45:46 +0000
Subject: [PATCH 03/28] fix readme

---
 benchmarks/inference-server/triton/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/benchmarks/inference-server/triton/README.md b/benchmarks/inference-server/triton/README.md
index 485265863..bfdb48cdc 100644
--- a/benchmarks/inference-server/triton/README.md
+++ b/benchmarks/inference-server/triton/README.md
@@ -113,7 +113,7 @@ Example `terraform.tfvars` content:
    gpu_count = 2
    ```
 
-####[optional] set-up credentials config with kubeconfig
+####[optional] set-up credentials config with kubeconfig####
 
 If you created your cluster with steps from `../../infra/` or with fleet management enabled, the existing `credentials_config` must use the fleet host credentials like this:
 ```bash

From 1bed2e77f18e353fac0e5a87a209f59cf9ce2cc7 Mon Sep 17 00:00:00 2001
From: kaushikmitr <kaushikmitra.umd@gmail.com>
Date: Fri, 1 Mar 2024 22:47:58 +0000
Subject: [PATCH 04/28] fix readme

---
 benchmarks/inference-server/triton/README.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/benchmarks/inference-server/triton/README.md b/benchmarks/inference-server/triton/README.md
index bfdb48cdc..b0b902d3a 100644
--- a/benchmarks/inference-server/triton/README.md
+++ b/benchmarks/inference-server/triton/README.md
@@ -66,7 +66,9 @@ This guide provides instructions for deploying and benchmarking a TensorRT Large
 
    ```
    Replace `your_registry` with your actual Docker registry path.
+
    ***Method 2: Upload Model repository and the relevant scripts to gcs***
+   
    In this method we can directly upload the model engine and scripts to gcs
    ```
    gsutil cp -r gs://your_model_repo/scripts/ ./
@@ -113,7 +115,7 @@ Example `terraform.tfvars` content:
    gpu_count = 2
    ```
 
-####[optional] set-up credentials config with kubeconfig####
+#### [optional] set-up credentials config with kubeconfig 
 
 If you created your cluster with steps from `../../infra/` or with fleet management enabled, the existing `credentials_config` must use the fleet host credentials like this:
 ```bash

From f101f52dd07f89af8ba85173201746c223cba84b Mon Sep 17 00:00:00 2001
From: kaushikmitr <kaushikmitra.umd@gmail.com>
Date: Fri, 1 Mar 2024 22:52:36 +0000
Subject: [PATCH 05/28] fix readme

---
 benchmarks/inference-server/triton/README.md | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/benchmarks/inference-server/triton/README.md b/benchmarks/inference-server/triton/README.md
index b0b902d3a..f0df4c81d 100644
--- a/benchmarks/inference-server/triton/README.md
+++ b/benchmarks/inference-server/triton/README.md
@@ -68,8 +68,8 @@ This guide provides instructions for deploying and benchmarking a TensorRT Large
    Replace `your_registry` with your actual Docker registry path.
 
    ***Method 2: Upload Model repository and the relevant scripts to gcs***
-   
-   In this method we can directly upload the model engine and scripts to gcs
+
+   In this method we can directly upload the model engine and scripts to gcs and use the base image provided by Nvidia: 
    ```
    gsutil cp -r gs://your_model_repo/scripts/ ./
    gsutil cp -r gs://your_model_repo/all_models/ ./
@@ -91,14 +91,18 @@ This guide provides instructions for deploying and benchmarking a TensorRT Large
 
 2. Define `template_path` variable  in `terraform.tfvars`
    If usinng method 1 In Step 1 above:
-   ``
+   ```bash
    template_path = "path_to_manifest_template/triton-tensorrtllm-inference.tftpl" 
-   ``
+   image_path = "path_to_your_registry/tritonserver_llm:latest"
+   gpu_count = X
+   ```
    If using method 2 In Step 1 above:
    
-   ``
+   ```bash
    template_path = "path_to_manifest_template/triton-tensorrtllm-inference-gs.tftpl"
-   ``
+   image_path = "nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3"
+   gpu_count = X
+   ```
    and also update the `triton-tensorrtllm-inference-gs.tftpl` with the path to the gcs repo under InitContainer and the command to launch the container under Container section.
 
 3. **Configure Your Deployment:**
@@ -111,8 +115,6 @@ Example `terraform.tfvars` content:
    credentials_config = {
      kubeconfig = "path/to/your/gcloud/credentials.json"
    }
-   image_path = "your_registry/tritonserver_llm:latest"
-   gpu_count = 2
    ```
 
 #### [optional] set-up credentials config with kubeconfig 

From f668bb462eddabf7cf3bb35ed1e3161354b05690 Mon Sep 17 00:00:00 2001
From: kaushikmitr <kaushikmitra.umd@gmail.com>
Date: Fri, 1 Mar 2024 22:53:10 +0000
Subject: [PATCH 06/28] fix readme

---
 benchmarks/inference-server/triton/README.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/benchmarks/inference-server/triton/README.md b/benchmarks/inference-server/triton/README.md
index f0df4c81d..30097646a 100644
--- a/benchmarks/inference-server/triton/README.md
+++ b/benchmarks/inference-server/triton/README.md
@@ -17,6 +17,7 @@ This guide provides instructions for deploying and benchmarking a TensorRT Large
 2. **Setup the Docker image:**
    
    ***Method 1: Add the Model repository and the relevant scripts to the image***
+   
    Inside the `tritonllm_backend` directory, create a Dockerfile with the following content ensuring the Triton Server is ready with the necessary models and scripts.
 
    ```Dockerfile

From 879b2260e7ed9b0b5b77e23541a6a6bfeeb88150 Mon Sep 17 00:00:00 2001
From: kaushikmitr <kaushikmitra.umd@gmail.com>
Date: Fri, 1 Mar 2024 22:53:53 +0000
Subject: [PATCH 07/28] fix readme

---
 benchmarks/inference-server/triton/README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/benchmarks/inference-server/triton/README.md b/benchmarks/inference-server/triton/README.md
index 30097646a..8bce8cd31 100644
--- a/benchmarks/inference-server/triton/README.md
+++ b/benchmarks/inference-server/triton/README.md
@@ -17,7 +17,7 @@ This guide provides instructions for deploying and benchmarking a TensorRT Large
 2. **Setup the Docker image:**
    
    ***Method 1: Add the Model repository and the relevant scripts to the image***
-   
+
    Inside the `tritonllm_backend` directory, create a Dockerfile with the following content ensuring the Triton Server is ready with the necessary models and scripts.
 
    ```Dockerfile
@@ -91,7 +91,7 @@ This guide provides instructions for deploying and benchmarking a TensorRT Large
    ```
 
 2. Define `template_path` variable  in `terraform.tfvars`
-   If usinng method 1 In Step 1 above:
+   If using method 1 In Step 1 above:
    ```bash
    template_path = "path_to_manifest_template/triton-tensorrtllm-inference.tftpl" 
    image_path = "path_to_your_registry/tritonserver_llm:latest"

From d4fce179c2f30a6dbeb3b791681c5712bbfa6eba Mon Sep 17 00:00:00 2001
From: kaushikmitr <kaushikmitra.umd@gmail.com>
Date: Fri, 1 Mar 2024 22:54:42 +0000
Subject: [PATCH 08/28] fix readme

---
 benchmarks/inference-server/triton/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/benchmarks/inference-server/triton/README.md b/benchmarks/inference-server/triton/README.md
index 8bce8cd31..59e94bb9d 100644
--- a/benchmarks/inference-server/triton/README.md
+++ b/benchmarks/inference-server/triton/README.md
@@ -90,7 +90,7 @@ This guide provides instructions for deploying and benchmarking a TensorRT Large
    cp sample-terraform.tfvars terraform.tfvars
    ```
 
-2. Define `template_path` variable  in `terraform.tfvars`
+2. Define `template_path`, `image_path` and `gpu_count` variable  in `terraform.tfvars`
    If using method 1 In Step 1 above:
    ```bash
    template_path = "path_to_manifest_template/triton-tensorrtllm-inference.tftpl" 

From 23a7db59cfa40e4a8b9ff633b48e388b11ba48ee Mon Sep 17 00:00:00 2001
From: kaushikmitr <kaushikmitra.umd@gmail.com>
Date: Fri, 1 Mar 2024 22:55:31 +0000
Subject: [PATCH 09/28] fix readme

---
 benchmarks/inference-server/triton/README.md | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/benchmarks/inference-server/triton/README.md b/benchmarks/inference-server/triton/README.md
index 59e94bb9d..cfe39f7d2 100644
--- a/benchmarks/inference-server/triton/README.md
+++ b/benchmarks/inference-server/triton/README.md
@@ -90,14 +90,15 @@ This guide provides instructions for deploying and benchmarking a TensorRT Large
    cp sample-terraform.tfvars terraform.tfvars
    ```
 
-2. Define `template_path`, `image_path` and `gpu_count` variable  in `terraform.tfvars`
-   If using method 1 In Step 1 above:
+2. Define `template_path`, `image_path` and `gpu_count` variable  in `terraform.tfvars`.
+   
+   * If using method 1 In Step 1 above:
    ```bash
    template_path = "path_to_manifest_template/triton-tensorrtllm-inference.tftpl" 
    image_path = "path_to_your_registry/tritonserver_llm:latest"
    gpu_count = X
    ```
-   If using method 2 In Step 1 above:
+   * If using method 2 In Step 1 above:
    
    ```bash
    template_path = "path_to_manifest_template/triton-tensorrtllm-inference-gs.tftpl"

From 9f7b63c6441183c45cc07a3953f7340e07c9881e Mon Sep 17 00:00:00 2001
From: kaushikmitr <kaushikmitra.umd@gmail.com>
Date: Fri, 1 Mar 2024 22:56:35 +0000
Subject: [PATCH 10/28] fix readme

---
 benchmarks/inference-server/triton/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/benchmarks/inference-server/triton/README.md b/benchmarks/inference-server/triton/README.md
index cfe39f7d2..1d1babad4 100644
--- a/benchmarks/inference-server/triton/README.md
+++ b/benchmarks/inference-server/triton/README.md
@@ -90,7 +90,7 @@ This guide provides instructions for deploying and benchmarking a TensorRT Large
    cp sample-terraform.tfvars terraform.tfvars
    ```
 
-2. Define `template_path`, `image_path` and `gpu_count` variable  in `terraform.tfvars`.
+2. Define `template_path`, `image_path` and `gpu_count` variables  in `terraform.tfvars`.
    
    * If using method 1 In Step 1 above:
    ```bash

From 9e4dfefa6e7ac8dd366d546463cb27a6eeae39fc Mon Sep 17 00:00:00 2001
From: kaushikmitr <kaushikmitra.umd@gmail.com>
Date: Fri, 1 Mar 2024 22:59:25 +0000
Subject: [PATCH 11/28] fix readme

---
 benchmarks/inference-server/triton/README.md | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/benchmarks/inference-server/triton/README.md b/benchmarks/inference-server/triton/README.md
index 1d1babad4..998c45f15 100644
--- a/benchmarks/inference-server/triton/README.md
+++ b/benchmarks/inference-server/triton/README.md
@@ -109,9 +109,8 @@ This guide provides instructions for deploying and benchmarking a TensorRT Large
 
 3. **Configure Your Deployment:**
 
-   Edit the `terraform.tfvars` file to include your specific configuration details. At a minimum, you must specify the `credentials_config` for accessing Google Cloud resources and the `image_path` to your Docker image in the registry. Update the gpu_count to make it consistent with the TensorRTLLM Engine configuration when it was [built](https://github.com/NVIDIA/TensorRT-LLM/tree/0f041b7b57d118037f943fd6107db846ed147b91/examples/llama).
-
-Example `terraform.tfvars` content:
+   Edit the `terraform.tfvars` file to include your specific configuration details with the variable `credentials_config`. 
+   Example 
 
    ```bash
    credentials_config = {

From 33f9a5551ea9b2842a582939e4371c4eedd7f5e6 Mon Sep 17 00:00:00 2001
From: kaushikmitr <kaushikmitra.umd@gmail.com>
Date: Fri, 1 Mar 2024 23:04:23 +0000
Subject: [PATCH 12/28] fix readme

---
 benchmarks/inference-server/triton/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/benchmarks/inference-server/triton/README.md b/benchmarks/inference-server/triton/README.md
index 998c45f15..3551f4b64 100644
--- a/benchmarks/inference-server/triton/README.md
+++ b/benchmarks/inference-server/triton/README.md
@@ -1,6 +1,6 @@
 # AI on GKE: Benchmark TensorRT LLM on Triton Server
 
-This guide provides instructions for deploying and benchmarking a TensorRT Large Language Model (LLM) on Triton Inference Server within a Google Kubernetes Engine (GKE) environment. The process involves building a Docker image with the TensorRT LLM engine and deploying it to a GKE cluster. 
+This guide provides instructions for deploying and benchmarking a TensorRT Large Language Model (LLM) on Triton Inference Server within a Google Kubernetes Engine (GKE) environment. The process involves building a Docker container with the TensorRT LLM engine and deploying it to a GKE cluster. 
 
 ## Prerequisites
 

From f7dd7fc7f6153fbac33d21f8070ebf3c36422e7f Mon Sep 17 00:00:00 2001
From: kaushikmitr <kaushikmitra.umd@gmail.com>
Date: Fri, 1 Mar 2024 23:09:37 +0000
Subject: [PATCH 13/28] fix readme

---
 benchmarks/inference-server/README.md        | 1 +
 benchmarks/inference-server/triton/README.md | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/benchmarks/inference-server/README.md b/benchmarks/inference-server/README.md
index 217608420..8dc144fe6 100644
--- a/benchmarks/inference-server/README.md
+++ b/benchmarks/inference-server/README.md
@@ -3,6 +3,7 @@ Terraform templates associated with them.
 
 The current supported options are:
 - Text Generation Inference (aka TGI)
+- TensorRT-LLM on Triton Inference Server 
 
 You may also choose to manually deploy your own inference server.
 
diff --git a/benchmarks/inference-server/triton/README.md b/benchmarks/inference-server/triton/README.md
index 3551f4b64..22f6670a1 100644
--- a/benchmarks/inference-server/triton/README.md
+++ b/benchmarks/inference-server/triton/README.md
@@ -14,7 +14,7 @@ This guide provides instructions for deploying and benchmarking a TensorRT Large
 
 1. **Build the TensorRT LLM Engine:** Follow the instructions provided in the [TensorRT LLM backend repository](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/README.md) to build the TensorRT LLM Engine.
 
-2. **Setup the Docker image:**
+2. **Setup the Docker container:**
    
    ***Method 1: Add the Model repository and the relevant scripts to the image***
 

From 0304a77a3c875000ee4099a33ccba43a16666f70 Mon Sep 17 00:00:00 2001
From: kaushikmitr <kaushikmitra.umd@gmail.com>
Date: Fri, 1 Mar 2024 23:14:54 +0000
Subject: [PATCH 14/28] fix readme

---
 benchmarks/inference-server/triton/README.md    | 4 ++--
 benchmarks/inference-server/triton/main.tf      | 2 +-
 benchmarks/inference-server/triton/variables.tf | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/benchmarks/inference-server/triton/README.md b/benchmarks/inference-server/triton/README.md
index 22f6670a1..b7930c3e3 100644
--- a/benchmarks/inference-server/triton/README.md
+++ b/benchmarks/inference-server/triton/README.md
@@ -72,8 +72,8 @@ This guide provides instructions for deploying and benchmarking a TensorRT Large
 
    In this method we can directly upload the model engine and scripts to gcs and use the base image provided by Nvidia: 
    ```
-   gsutil cp -r gs://your_model_repo/scripts/ ./
-   gsutil cp -r gs://your_model_repo/all_models/ ./
+   gsutil cp -r your_script_folder gs://your_model_repo/scripts/
+   gsutil cp -r your_model_folder gs://your_model_repo/all_models/ 
    ```
    Replace `your_model_repo` with your actual gcs repo path.
 
diff --git a/benchmarks/inference-server/triton/main.tf b/benchmarks/inference-server/triton/main.tf
index a7beb399d..92f129270 100644
--- a/benchmarks/inference-server/triton/main.tf
+++ b/benchmarks/inference-server/triton/main.tf
@@ -1,5 +1,5 @@
 /**
- * Copyright 2023 Google LLC
+ * Copyright 2024 Google LLC
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
diff --git a/benchmarks/inference-server/triton/variables.tf b/benchmarks/inference-server/triton/variables.tf
index fbeb6d269..c77732139 100644
--- a/benchmarks/inference-server/triton/variables.tf
+++ b/benchmarks/inference-server/triton/variables.tf
@@ -1,5 +1,5 @@
 /**
- * Copyright 2023 Google LLC
+ * Copyright 2024 Google LLC
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.

From adef0c0e859c9a6a26e40c21d313bd1bdb4cd67c Mon Sep 17 00:00:00 2001
From: kaushikmitr <kaushikmitra.umd@gmail.com>
Date: Fri, 1 Mar 2024 23:38:23 +0000
Subject: [PATCH 15/28] fix readme

---
 benchmarks/inference-server/triton/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/benchmarks/inference-server/triton/README.md b/benchmarks/inference-server/triton/README.md
index b7930c3e3..572274c06 100644
--- a/benchmarks/inference-server/triton/README.md
+++ b/benchmarks/inference-server/triton/README.md
@@ -188,7 +188,7 @@ terraform apply
 | `model_id`           | Model used for inference.                                                                     | String  | `"meta-llama/Llama-2-7b-chat-hf"`         | No       |
 | `gpu_count`          | Parallelism based on number of gpus.                                                          | Number  | `1`                                       | No       |
 | `ksa`                | Kubernetes Service Account used for workload.                                                 | String  | `"default"`                               | No       |
-| `huggingface-secret` | Name of the kubectl huggingface secret token                                                  | String  | `"huggingface-secret"`                    | No       |
+| `huggingface-secret` | Name of the kubectl huggingface secret token                                                  | String  | `"huggingface-secret"`                    | Yes       |
 | `templates_path`     | Path where manifest templates will be read from.     | String  |                                    | No      |
 
 ## Notes

From 75bf52be210b797fa2528c1d420f31b094bf10d9 Mon Sep 17 00:00:00 2001
From: kaushikmitr <kaushikmitra.umd@gmail.com>
Date: Sat, 2 Mar 2024 00:24:13 +0000
Subject: [PATCH 16/28] fix readme

---
 .../manifest-templates/triton-tensorrtllm-inference-gs.tftpl    | 2 +-
 .../manifest-templates/triton-tensorrtllm-inference.tftpl       | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-gs.tftpl b/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-gs.tftpl
index 645da0841..5b6df1ff7 100644
--- a/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-gs.tftpl
+++ b/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-gs.tftpl
@@ -1,4 +1,4 @@
-# Copyright 20234Google LLC
+# Copyright 2024 Google LLC
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
diff --git a/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference.tftpl b/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference.tftpl
index 47effa9ee..cc28d38d4 100644
--- a/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference.tftpl
+++ b/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference.tftpl
@@ -1,4 +1,4 @@
-# Copyright 2023 Google LLC
+# Copyright 2024 Google LLC
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.

From 9108fba8fd28d312d434d768a508acf232979890 Mon Sep 17 00:00:00 2001
From: kaushikmitr <kaushikmitra.umd@gmail.com>
Date: Mon, 4 Mar 2024 02:29:17 +0000
Subject: [PATCH 17/28] fix readme

---
 benchmarks/inference-server/triton/README.md        |  5 ++---
 .../triton-tensorrtllm-inference-gs.tftpl           | 13 ++++---------
 2 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/benchmarks/inference-server/triton/README.md b/benchmarks/inference-server/triton/README.md
index 572274c06..64422b375 100644
--- a/benchmarks/inference-server/triton/README.md
+++ b/benchmarks/inference-server/triton/README.md
@@ -68,11 +68,10 @@ This guide provides instructions for deploying and benchmarking a TensorRT Large
    ```
    Replace `your_registry` with your actual Docker registry path.
 
-   ***Method 2: Upload Model repository and the relevant scripts to gcs***
+   ***Method 2: Upload Model repository to gcs***
 
-   In this method we can directly upload the model engine and scripts to gcs and use the base image provided by Nvidia: 
+   In this method we can directly upload the model engine to gcs and use the base image provided by Nvidia and specify the command to launch the triton server via the deployment yaml file: 
    ```
-   gsutil cp -r your_script_folder gs://your_model_repo/scripts/
    gsutil cp -r your_model_folder gs://your_model_repo/all_models/ 
    ```
    Replace `your_model_repo` with your actual gcs repo path.
diff --git a/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-gs.tftpl b/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-gs.tftpl
index 5b6df1ff7..7f0c679f9 100644
--- a/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-gs.tftpl
+++ b/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-gs.tftpl
@@ -37,26 +37,23 @@ spec:
             sizeLimit: 1Gi
         - name: all-models-volume
           emptyDir: {}
-        - name: scripts-volume
-          emptyDir: {}
       initContainers:
         - name: init-gcs-download
           image: google/cloud-sdk
           serviceAccountName: ${ksa}
           command: ["/bin/sh", "-c"] 
           args:
-            - gsutil cp -r gs://your_gcs_repo/all_models ./ && gsutil cp -r gs://your_gcs_repo/scripts ./;
+            - gsutil cp -r gs://your_gcs_repo/all_models ./;
           volumeMounts:
             - name: all-models-volume
               mountPath: /all_models
-            - name: scripts-volume
-              mountPath: /scripts
       containers:
         - name: triton-inference
           image: "${image_path}"  # Replace ${image_path} with the actual image path
           workingDir: /opt/tritonserver
+          #command: ["/bin/sleep", "3600"]
           command: ["/bin/bash", "-c"]
-          args: ["pip install sentencepiece protobuf && huggingface-cli login --token $HUGGINGFACE_TOKEN && python /scripts/launch_triton_server.py --model_repo /all_models/inflight_batcher_llm --world_size 1"]
+          args: ["pip install sentencepiece protobuf && huggingface-cli login --token $HUGGINGFACE_TOKEN && mpirun --allow-run-as-root -n 1 /opt/tritonserver/bin/tritonserver --model-repository=/all_models/inflight_batcher_llm --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_ :"]
           ports:
             - containerPort: 8000
               name: http-triton
@@ -76,6 +73,4 @@ spec:
           serviceAccountName: ${ksa}
           volumeMounts:
             - name: all-models-volume
-              mountPath: /all_models
-            - name: scripts-volume
-              mountPath: /scripts
+              mountPath: /all_models
\ No newline at end of file

From b9565ec4193c04cea89257ad5badd16cad7f972f Mon Sep 17 00:00:00 2001
From: kaushikmitr <kaushikmitra.umd@gmail.com>
Date: Thu, 7 Mar 2024 23:34:16 +0000
Subject: [PATCH 18/28] updated gcs method as default and docker method as
 alternate

---
 benchmarks/inference-server/triton/README.md  | 111 ++++++++----------
 benchmarks/inference-server/triton/main.tf    |  20 ++--
 ...triton-tensorrtllm-inference-docker.tftpl} |   0
 .../triton-tensorrtllm-inference-gs.tftpl     |   8 +-
 .../triton/sample-terraform.tfvars            |   2 +-
 .../inference-server/triton/variables.tf      |  17 ++-
 6 files changed, 81 insertions(+), 77 deletions(-)
 rename benchmarks/inference-server/triton/manifest-templates/{triton-tensorrtllm-inference.tftpl => triton-tensorrtllm-inference-docker.tftpl} (100%)

diff --git a/benchmarks/inference-server/triton/README.md b/benchmarks/inference-server/triton/README.md
index 64422b375..17e52fc28 100644
--- a/benchmarks/inference-server/triton/README.md
+++ b/benchmarks/inference-server/triton/README.md
@@ -1,25 +1,30 @@
 # AI on GKE: Benchmark TensorRT LLM on Triton Server
 
-This guide provides instructions for deploying and benchmarking a TensorRT Large Language Model (LLM) on Triton Inference Server within a Google Kubernetes Engine (GKE) environment. The process involves building a Docker container with the TensorRT LLM engine and deploying it to a GKE cluster. 
+This guide outlines the steps for deploying and benchmarking a TensorRT Large Language Model (LLM) on the Triton Inference Server within Google Kubernetes Engine (GKE). It includes the process of building a Docker container equipped with the TensorRT LLM engine and deploying this container to a GKE cluster.
 
 ## Prerequisites
 
-- Docker
-- Google Cloud SDK
-- Kubernetes CLI (kubectl)
-- Hugging Face account for model access
-- NVIDIA GPU drivers and CUDA toolkit (to build the TensorRTLLM Engine)
-
+Ensure you have the following prerequisites installed and set up:
+- Docker for containerization
+- Google Cloud SDK for interacting with Google Cloud services
+- Kubernetes CLI (kubectl) for managing Kubernetes clusters
+- A Hugging Face account to access models
+- NVIDIA GPU drivers and CUDA toolkit for building the TensorRT LLM Engine
 ## Step 1: Build the TensorRT LLM Engine and Docker Image
 
 1. **Build the TensorRT LLM Engine:** Follow the instructions provided in the [TensorRT LLM backend repository](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/README.md) to build the TensorRT LLM Engine.
 
-2. **Setup the Docker container:**
-   
-   ***Method 1: Add the Model repository and the relevant scripts to the image***
+2. **Upload the Model to Google Cloud Storage (GCS)**
 
-   Inside the `tritonllm_backend` directory, create a Dockerfile with the following content ensuring the Triton Server is ready with the necessary models and scripts.
+   Transfer your model engine to GCS using:
+   ```
+   gsutil cp -r your_model_folder gs://your_model_repo/all_models/ 
+   ```
+   *Ensure to replace your_model_repo with your actual GCS repository path.**
+   
+   **Alternate method: Add the Model repository and the relevant scripts to the image**
 
+   Construct a new image from Nvidia's base image and integrate the model repository and necessary scripts directly into it, bypassing the need for GCS during runtime. In the `tritonllm_backend` directory, create a Dockerfile with the following content:
    ```Dockerfile
    FROM nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3
 
@@ -35,7 +40,7 @@ This guide provides instructions for deploying and benchmarking a TensorRT Large
    CMD ["/opt/scripts/start_triton_servers.sh"]
    ```
 
-   The Shell script `/opt/scripts/start_triton_servers.sh` is like below: 
+    For the initialization script located at `/opt/scripts/start_triton_servers.sh`, follow the structure below:
 
    ```start_triton_servers.sh
    #!/bin/bash
@@ -48,33 +53,25 @@ This guide provides instructions for deploying and benchmarking a TensorRT Large
 
 
     # Launch the servers (modify this depending on the number of GPU used and the exact path to the model repo)
-    mpirun --allow-run-as-root -n 1 /opt/tritonserver/bin/tritonserver \
+    /opt/tritonserver/bin/tritonserver \
     --model-repository=/all_models/inflight_batcher_llm \
     --disable-auto-complete-config \
-    --backend-config=python,shm-region-prefix-name=prefix0_ : \
-    -n 1 /opt/tritonserver/bin/tritonserver \
+    --backend-config=python,shm-region-prefix-name=prefix0_\
+   /opt/tritonserver/bin/tritonserver \
     --model-repository=/all_models/inflight_batcher_llm \
     --disable-auto-complete-config \
-    --backend-config=python,shm-region-prefix-name=prefix1_ :```
+    --backend-config=python,shm-region-prefix-name=prefix1_```
    ```
-   *Build and Push the Docker Image:*
+    *Build and Push the Docker Image:*
 
-   Build the Docker image and push it to your container registry:
+    Build the Docker image and push it to your container registry:
 
    ```
    docker build -t your_registry/tritonserver_llm:latest .
    docker push your_registry/tritonserver_llm:latest
 
    ```
-   Replace `your_registry` with your actual Docker registry path.
-
-   ***Method 2: Upload Model repository to gcs***
-
-   In this method we can directly upload the model engine to gcs and use the base image provided by Nvidia and specify the command to launch the triton server via the deployment yaml file: 
-   ```
-   gsutil cp -r your_model_folder gs://your_model_repo/all_models/ 
-   ```
-   Replace `your_model_repo` with your actual gcs repo path.
+    Substitute `your_registry` with your Docker registry's path.
 
 
 
@@ -83,57 +80,42 @@ This guide provides instructions for deploying and benchmarking a TensorRT Large
 
 1. **Initialize Terraform Variables:**
 
-   Create a `terraform.tfvars` file by copying the provided example:
+   Start by creating a `terraform.tfvars` file from the provided template:
 
    ```bash
    cp sample-terraform.tfvars terraform.tfvars
    ```
 
-2. Define `template_path`, `image_path` and `gpu_count` variables  in `terraform.tfvars`.
-   
-   * If using method 1 In Step 1 above:
-   ```bash
-   template_path = "path_to_manifest_template/triton-tensorrtllm-inference.tftpl" 
-   image_path = "path_to_your_registry/tritonserver_llm:latest"
-   gpu_count = X
-   ```
-   * If using method 2 In Step 1 above:
+2. **Specify Essential Variables**
    
-   ```bash
-   template_path = "path_to_manifest_template/triton-tensorrtllm-inference-gs.tftpl"
-   image_path = "nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3"
-   gpu_count = X
+   At a minimum, configure the `gcs_model_path` in `terraform.tfvars` to point to your Google Cloud Storage (GCS) repository. You may also need to adjust `image_path`, `gpu_count`, and `server_launch_command_string` according to your specific requirements.
+   ```bash   
+   gcs_model_path = gs://path_to_model_repo/all_models
    ```
-   and also update the `triton-tensorrtllm-inference-gs.tftpl` with the path to the gcs repo under InitContainer and the command to launch the container under Container section.
-
-3. **Configure Your Deployment:**
 
-   Edit the `terraform.tfvars` file to include your specific configuration details with the variable `credentials_config`. 
-   Example 
+   If `gcs_model_path` is not defined, it is inferred that you are utilizing the method of direct image integration. In this scenario, ensure to update `image_path` along with any pertinent variables accordingly:
 
-   ```bash
-   credentials_config = {
-     kubeconfig = "path/to/your/gcloud/credentials.json"
-   }
+  ```bash
+   image_path = "path_to_your_registry/tritonserver_llm:latest"
    ```
 
-#### [optional] set-up credentials config with kubeconfig 
+3. **Configure Your Deployment:**
+
+  Edit `terraform.tfvars` to tailor your deployment's configuration, particularly the `credentials_config`. Depending on your cluster's setup, this might involve specifying fleet host credentials or isolating your cluster's kubeconfig:
 
-If you created your cluster with steps from `../../infra/` or with fleet management enabled, the existing `credentials_config` must use the fleet host credentials like this:
+  ***For Fleet Management Enabled Clusters:***
 ```bash
 credentials_config = {
   fleet_host = "https://connectgateway.googleapis.com/v1/projects/$PROJECT_NUMBER/locations/global/gkeMemberships/$CLUSTER_NAME"
 }
-```
-
-
-If you created your own cluster without fleet management enabled, you can use your cluster's kubeconfig in the `credentials_config`. You must isolate your cluster's kubeconfig from other clusters in the default kube.config file. To do this, run the following command:
-
+``` 
+  ***For Clusters without Fleet Management:***
+Separate your cluster's kubeconfig:
 ```bash
 KUBECONFIG=~/.kube/${CLUSTER_NAME}-kube.config gcloud container clusters get-credentials $CLUSTER_NAME --location $CLUSTER_LOCATION
 ```
 
-Then update your `terraform.tfvars` `credentials_config` to the following:
+And then update your `terraform.tfvars`  accordingly:
 
 ```bash
 credentials_config = {
@@ -143,7 +125,7 @@ credentials_config = {
 }
 ```
 
-#### [optional] set up secret token in Secret Manager
+#### [optional] Setting Up Secret Tokens
 
 A model may require a security token to access it. For example, Llama2 from HuggingFace is a gated model that requires a [user access token](https://huggingface.co/docs/hub/en/security-tokens). If the model you want to run does not require this, skip this step.
 
@@ -152,7 +134,8 @@ If you followed steps from `../../infra/stage-2`, Secret Manager and the user ac
 kubectl create secret generic huggingface-secret --from-literal=token='************'
 ```
 
-This command creates a new Secret named huggingface-secret, which has a key token containing your Hugging Face CLI token.
+Executing this command generates a new Secret named huggingface-secret, which incorporates a key named token that stores your Hugging Face CLI token. It is important to note that for any production or shared environments, directly storing user access tokens as literals is not advisable.
+
 
 ## Step 3: login to gcloud
 
@@ -183,12 +166,16 @@ terraform apply
 |----------------------|-----------------------------------------------------------------------------------------------|---------|-------------------------------------------|----------|
 | `credentials_config` | Configure how Terraform authenticates to the cluster.                                         | Object  |                                           | No       |
 | `namespace`          | Namespace used for Nvidia DCGM resources.                                                     | String  | `"default"`                               | No       |
-| `image_path`         | Image Path stored in Artifact Registry                                                        | String  |                                           | No       |
+| `image_path`         | Image Path stored in Artifact Registry                                                        | String  |    "nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3"                                       | No       |
 | `model_id`           | Model used for inference.                                                                     | String  | `"meta-llama/Llama-2-7b-chat-hf"`         | No       |
 | `gpu_count`          | Parallelism based on number of gpus.                                                          | Number  | `1`                                       | No       |
 | `ksa`                | Kubernetes Service Account used for workload.                                                 | String  | `"default"`                               | No       |
 | `huggingface-secret` | Name of the kubectl huggingface secret token                                                  | String  | `"huggingface-secret"`                    | Yes       |
-| `templates_path`     | Path where manifest templates will be read from.     | String  |                                    | No      |
+| `gcs_model_path`     | Path where model engine in gcs will be read from.     | String  |    ""                                | No      |
+| `server_launch_command_string`     | Command to launc the Triton Inference Server     | String  |   "pip install sentencepiece protobuf && huggingface-cli login --token $HUGGINGFACE_TOKEN && /opt/tritonserver/bin/tritonserver --model-repository=/all_models/inflight_batcher_llm --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_ :"                                 | No      |
+
+
+
 
 ## Notes
 
diff --git a/benchmarks/inference-server/triton/main.tf b/benchmarks/inference-server/triton/main.tf
index 92f129270..21093c6ef 100644
--- a/benchmarks/inference-server/triton/main.tf
+++ b/benchmarks/inference-server/triton/main.tf
@@ -14,13 +14,19 @@
  * limitations under the License.
  */
 
+ locals {
+  template_path = var.gcs_model_path == null ? "${path.module}/manifest-templates/triton-tensorrtllm-inference-docker.tftpl" : "${path.module}/manifest-templates/triton-tensorrtllm-inference-gs.tftpl"
+}
+
 resource "kubernetes_manifest" "default" {
-  manifest = yamldecode(templatefile(var.template_path, {
-    namespace            = var.namespace
-    ksa                  = var.ksa
-    image_path           = var.image_path
-    huggingface-secret   = var.huggingface-secret
-    gpu_count            = var.gpu_count
-    model_id             = var.model_id
+  manifest = yamldecode(templatefile(local.template_path, {
+    namespace                    = var.namespace
+    ksa                          = var.ksa
+    image_path                   = var.image_path
+    huggingface-secret           = var.huggingface-secret
+    gpu_count                    = var.gpu_count
+    model_id                     = var.model_id
+    gcs_model_path               = var.gcs_model_path
+    server_launch_command_string = var.server_launch_command_string
   }))
 }
diff --git a/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference.tftpl b/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-docker.tftpl
similarity index 100%
rename from benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference.tftpl
rename to benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-docker.tftpl
diff --git a/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-gs.tftpl b/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-gs.tftpl
index 7f0c679f9..30940a4cd 100644
--- a/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-gs.tftpl
+++ b/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-gs.tftpl
@@ -43,7 +43,7 @@ spec:
           serviceAccountName: ${ksa}
           command: ["/bin/sh", "-c"] 
           args:
-            - gsutil cp -r gs://your_gcs_repo/all_models ./;
+            - gsutil cp -r ${gcs_model_path} ./;
           volumeMounts:
             - name: all-models-volume
               mountPath: /all_models
@@ -53,7 +53,7 @@ spec:
           workingDir: /opt/tritonserver
           #command: ["/bin/sleep", "3600"]
           command: ["/bin/bash", "-c"]
-          args: ["pip install sentencepiece protobuf && huggingface-cli login --token $HUGGINGFACE_TOKEN && mpirun --allow-run-as-root -n 1 /opt/tritonserver/bin/tritonserver --model-repository=/all_models/inflight_batcher_llm --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_ :"]
+          args: ["${server_launch_command_string}"]
           ports:
             - containerPort: 8000
               name: http-triton
@@ -73,4 +73,6 @@ spec:
           serviceAccountName: ${ksa}
           volumeMounts:
             - name: all-models-volume
-              mountPath: /all_models
\ No newline at end of file
+              mountPath: /all_models
+            - mountPath: /dev/shm
+              name: dshm
\ No newline at end of file
diff --git a/benchmarks/inference-server/triton/sample-terraform.tfvars b/benchmarks/inference-server/triton/sample-terraform.tfvars
index 4b9a4029d..5372d8d7a 100644
--- a/benchmarks/inference-server/triton/sample-terraform.tfvars
+++ b/benchmarks/inference-server/triton/sample-terraform.tfvars
@@ -7,4 +7,4 @@ ksa  = "benchmark-ksa"
 model_id = "meta-llama/Llama-2-7b-chat-hf"
 gpu_count = 1
 image_path = ""
-template_path = ""
\ No newline at end of file
+gcs_model_path = ""
\ No newline at end of file
diff --git a/benchmarks/inference-server/triton/variables.tf b/benchmarks/inference-server/triton/variables.tf
index c77732139..51d65eaeb 100644
--- a/benchmarks/inference-server/triton/variables.tf
+++ b/benchmarks/inference-server/triton/variables.tf
@@ -34,7 +34,7 @@ variable "credentials_config" {
 }
 
 variable "namespace" {
-  description = "Namespace used for Nvidia DCGM resources."
+  description = "Namespace used for Nvidia Triton resources."
   type        = string
   nullable    = false
   default     = "default"
@@ -44,6 +44,7 @@ variable "image_path" {
   description = "Image Path stored in Artifact Registry"
   type        = string
   nullable    = false
+  default     = "nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3"
 }
 
 variable "model_id" {
@@ -74,9 +75,17 @@ variable "huggingface-secret" {
   default     = "huggingface-secret"
 }
 
-variable "template_path" {
-  description = "Path where manifest templates will be read from."
+variable "gcs_model_path" {
+  description = "Path to the GCS repo where model is stored"
   type        = string
-  nullable    = false
+  nullable    = true
+  default     = null
+}
+
+variable "server_launch_command_string" {
+  description = "command to launch the triton server"
+  type        = string
+  nullable    = true
+  default     = "pip install sentencepiece protobuf && huggingface-cli login --token $HUGGINGFACE_TOKEN && /opt/tritonserver/bin/tritonserver --model-repository=/all_models/inflight_batcher_llm --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_"
 }
 

From 006a06af6e25fb5d0a6f7d45423892321bb90e70 Mon Sep 17 00:00:00 2001
From: kaushikmitr <kaushikmitra.umd@gmail.com>
Date: Thu, 7 Mar 2024 23:37:04 +0000
Subject: [PATCH 19/28] remove comment in tftpl

---
 .../manifest-templates/triton-tensorrtllm-inference-gs.tftpl     | 1 -
 1 file changed, 1 deletion(-)

diff --git a/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-gs.tftpl b/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-gs.tftpl
index 30940a4cd..a04f969f1 100644
--- a/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-gs.tftpl
+++ b/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-gs.tftpl
@@ -51,7 +51,6 @@ spec:
         - name: triton-inference
           image: "${image_path}"  # Replace ${image_path} with the actual image path
           workingDir: /opt/tritonserver
-          #command: ["/bin/sleep", "3600"]
           command: ["/bin/bash", "-c"]
           args: ["${server_launch_command_string}"]
           ports:

From 8e688a1e80570012cecedc0b65fad65876fe05c7 Mon Sep 17 00:00:00 2001
From: kaushikmitr <kaushikmitra.umd@gmail.com>
Date: Thu, 7 Mar 2024 23:40:47 +0000
Subject: [PATCH 20/28] fix readme

---
 benchmarks/inference-server/triton/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/benchmarks/inference-server/triton/README.md b/benchmarks/inference-server/triton/README.md
index 17e52fc28..e2f3e5966 100644
--- a/benchmarks/inference-server/triton/README.md
+++ b/benchmarks/inference-server/triton/README.md
@@ -172,7 +172,7 @@ terraform apply
 | `ksa`                | Kubernetes Service Account used for workload.                                                 | String  | `"default"`                               | No       |
 | `huggingface-secret` | Name of the kubectl huggingface secret token                                                  | String  | `"huggingface-secret"`                    | Yes       |
 | `gcs_model_path`     | Path where model engine in gcs will be read from.     | String  |    ""                                | No      |
-| `server_launch_command_string`     | Command to launc the Triton Inference Server     | String  |   "pip install sentencepiece protobuf && huggingface-cli login --token $HUGGINGFACE_TOKEN && /opt/tritonserver/bin/tritonserver --model-repository=/all_models/inflight_batcher_llm --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_ :"                                 | No      |
+| `server_launch_command_string`     | Command to launc the Triton Inference Server     | String  |   "pip install sentencepiece protobuf && huggingface-cli login --token $HUGGINGFACE_TOKEN && /opt/tritonserver/bin/tritonserver --model-repository=/all_models/inflight_batcher_llm --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_"                                 | No      |
 
 
 

From 52dd065cda4d914145f5056f90d1f4ef970e2a9c Mon Sep 17 00:00:00 2001
From: kaushikmitr <kaushikmitra.umd@gmail.com>
Date: Thu, 7 Mar 2024 23:48:15 +0000
Subject: [PATCH 21/28] format tf

---
 benchmarks/inference-server/triton/main.tf                 | 4 +++-
 benchmarks/inference-server/triton/sample-terraform.tfvars | 1 -
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/benchmarks/inference-server/triton/main.tf b/benchmarks/inference-server/triton/main.tf
index 21093c6ef..82187c344 100644
--- a/benchmarks/inference-server/triton/main.tf
+++ b/benchmarks/inference-server/triton/main.tf
@@ -15,7 +15,9 @@
  */
 
  locals {
-  template_path = var.gcs_model_path == null ? "${path.module}/manifest-templates/triton-tensorrtllm-inference-docker.tftpl" : "${path.module}/manifest-templates/triton-tensorrtllm-inference-gs.tftpl"
+  template_path = var.gcs_model_path == null 
+  ? "${path.module}/manifest-templates/triton-tensorrtllm-inference-docker.tftpl" 
+  : "${path.module}/manifest-templates/triton-tensorrtllm-inference-gs.tftpl"
 }
 
 resource "kubernetes_manifest" "default" {
diff --git a/benchmarks/inference-server/triton/sample-terraform.tfvars b/benchmarks/inference-server/triton/sample-terraform.tfvars
index 5372d8d7a..ed07e8821 100644
--- a/benchmarks/inference-server/triton/sample-terraform.tfvars
+++ b/benchmarks/inference-server/triton/sample-terraform.tfvars
@@ -6,5 +6,4 @@ namespace = "benchmark"
 ksa  = "benchmark-ksa"
 model_id = "meta-llama/Llama-2-7b-chat-hf"
 gpu_count = 1
-image_path = ""
 gcs_model_path = ""
\ No newline at end of file

From 1e1f393b76cd23c6b1086bdc55b6fed2c20fdf86 Mon Sep 17 00:00:00 2001
From: kaushikmitr <kaushikmitra.umd@gmail.com>
Date: Fri, 8 Mar 2024 00:02:25 +0000
Subject: [PATCH 22/28] fix readme

---
 benchmarks/inference-server/triton/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/benchmarks/inference-server/triton/README.md b/benchmarks/inference-server/triton/README.md
index e2f3e5966..7bd422b90 100644
--- a/benchmarks/inference-server/triton/README.md
+++ b/benchmarks/inference-server/triton/README.md
@@ -60,7 +60,7 @@ Ensure you have the following prerequisites installed and set up:
    /opt/tritonserver/bin/tritonserver \
     --model-repository=/all_models/inflight_batcher_llm \
     --disable-auto-complete-config \
-    --backend-config=python,shm-region-prefix-name=prefix1_```
+    --backend-config=python,shm-region-prefix-name=prefix1_
    ```
     *Build and Push the Docker Image:*
 

From 904a233055bae8594db2f79454816a58b0ae43f3 Mon Sep 17 00:00:00 2001
From: kaushikmitr <kaushikmitra.umd@gmail.com>
Date: Fri, 8 Mar 2024 00:35:42 +0000
Subject: [PATCH 23/28] fix readme

---
 benchmarks/inference-server/triton/main.tf                 | 7 +++++--
 .../triton-tensorrtllm-inference-docker.tftpl              | 2 +-
 benchmarks/inference-server/triton/variables.tf            | 2 +-
 3 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/benchmarks/inference-server/triton/main.tf b/benchmarks/inference-server/triton/main.tf
index 82187c344..4a43dcb3a 100644
--- a/benchmarks/inference-server/triton/main.tf
+++ b/benchmarks/inference-server/triton/main.tf
@@ -15,9 +15,12 @@
  */
 
  locals {
-  template_path = var.gcs_model_path == null 
+  
+  template_path = (
+    var.gcs_model_path == null 
   ? "${path.module}/manifest-templates/triton-tensorrtllm-inference-docker.tftpl" 
   : "${path.module}/manifest-templates/triton-tensorrtllm-inference-gs.tftpl"
+  )
 }
 
 resource "kubernetes_manifest" "default" {
@@ -25,7 +28,7 @@ resource "kubernetes_manifest" "default" {
     namespace                    = var.namespace
     ksa                          = var.ksa
     image_path                   = var.image_path
-    huggingface-secret           = var.huggingface-secret
+    huggingface_secret           = var.huggingface_secret
     gpu_count                    = var.gpu_count
     model_id                     = var.model_id
     gcs_model_path               = var.gcs_model_path
diff --git a/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-docker.tftpl b/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-docker.tftpl
index cc28d38d4..fd45a5c46 100644
--- a/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-docker.tftpl
+++ b/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-docker.tftpl
@@ -50,7 +50,7 @@ spec:
           - name: HUGGINGFACE_TOKEN
             valueFrom:
               secretKeyRef:
-                name: ${huggingface-secret}
+                name: ${huggingface_secret}
                 key: token
           resources:
             limits:
diff --git a/benchmarks/inference-server/triton/variables.tf b/benchmarks/inference-server/triton/variables.tf
index 51d65eaeb..c481ef1b8 100644
--- a/benchmarks/inference-server/triton/variables.tf
+++ b/benchmarks/inference-server/triton/variables.tf
@@ -68,7 +68,7 @@ variable "ksa" {
   default     = "default"
 }
 
-variable "huggingface-secret" {
+variable "huggingface_secret" {
   description = "name of the kubectl huggingface secret token"
   type        = string
   nullable    = true

From f983f7df4f0a494429754a8bd8897ad4e5bc05cb Mon Sep 17 00:00:00 2001
From: kaushikmitr <kaushikmitra.umd@gmail.com>
Date: Fri, 8 Mar 2024 00:37:16 +0000
Subject: [PATCH 24/28] fix readme

---
 benchmarks/inference-server/triton/main.tf | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/benchmarks/inference-server/triton/main.tf b/benchmarks/inference-server/triton/main.tf
index 4a43dcb3a..48206651a 100644
--- a/benchmarks/inference-server/triton/main.tf
+++ b/benchmarks/inference-server/triton/main.tf
@@ -17,7 +17,7 @@
  locals {
   
   template_path = (
-    var.gcs_model_path == null 
+  var.gcs_model_path == null 
   ? "${path.module}/manifest-templates/triton-tensorrtllm-inference-docker.tftpl" 
   : "${path.module}/manifest-templates/triton-tensorrtllm-inference-gs.tftpl"
   )

From 1f22e32fe9e4c968b98703f511958d805e370f31 Mon Sep 17 00:00:00 2001
From: kaushikmitr <kaushikmitra.umd@gmail.com>
Date: Fri, 8 Mar 2024 00:49:26 +0000
Subject: [PATCH 25/28] fix readme

---
 benchmarks/inference-server/triton/main.tf             | 10 +++++-----
 .../inference-server/triton/sample-terraform.tfvars    |  8 ++++----
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/benchmarks/inference-server/triton/main.tf b/benchmarks/inference-server/triton/main.tf
index 48206651a..5832ad609 100644
--- a/benchmarks/inference-server/triton/main.tf
+++ b/benchmarks/inference-server/triton/main.tf
@@ -14,12 +14,12 @@
  * limitations under the License.
  */
 
- locals {
-  
+locals {
+
   template_path = (
-  var.gcs_model_path == null 
-  ? "${path.module}/manifest-templates/triton-tensorrtllm-inference-docker.tftpl" 
-  : "${path.module}/manifest-templates/triton-tensorrtllm-inference-gs.tftpl"
+    var.gcs_model_path == null
+    ? "${path.module}/manifest-templates/triton-tensorrtllm-inference-docker.tftpl"
+    : "${path.module}/manifest-templates/triton-tensorrtllm-inference-gs.tftpl"
   )
 }
 
diff --git a/benchmarks/inference-server/triton/sample-terraform.tfvars b/benchmarks/inference-server/triton/sample-terraform.tfvars
index ed07e8821..ec0438c13 100644
--- a/benchmarks/inference-server/triton/sample-terraform.tfvars
+++ b/benchmarks/inference-server/triton/sample-terraform.tfvars
@@ -2,8 +2,8 @@ credentials_config = {
   fleet_host = "https://connectgateway.googleapis.com/v1/projects/$PROJECT_NUMBER/locations/global/gkeMemberships/ai-benchmark"
 }
 
-namespace = "benchmark"
-ksa  = "benchmark-ksa"
-model_id = "meta-llama/Llama-2-7b-chat-hf"
-gpu_count = 1
+namespace      = "benchmark"
+ksa            = "benchmark-ksa"
+model_id       = "meta-llama/Llama-2-7b-chat-hf"
+gpu_count      = 1
 gcs_model_path = ""
\ No newline at end of file

From af4bec4915a11092caa1358f09d6e4a5ffa15c81 Mon Sep 17 00:00:00 2001
From: kaushikmitr <kaushikmitra.umd@gmail.com>
Date: Fri, 8 Mar 2024 00:54:23 +0000
Subject: [PATCH 26/28] fix readme

---
 benchmarks/inference-server/triton/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/benchmarks/inference-server/triton/README.md b/benchmarks/inference-server/triton/README.md
index 7bd422b90..814091dc8 100644
--- a/benchmarks/inference-server/triton/README.md
+++ b/benchmarks/inference-server/triton/README.md
@@ -171,7 +171,7 @@ terraform apply
 | `gpu_count`          | Parallelism based on number of gpus.                                                          | Number  | `1`                                       | No       |
 | `ksa`                | Kubernetes Service Account used for workload.                                                 | String  | `"default"`                               | No       |
 | `huggingface-secret` | Name of the kubectl huggingface secret token                                                  | String  | `"huggingface-secret"`                    | Yes       |
-| `gcs_model_path`     | Path where model engine in gcs will be read from.     | String  |    ""                                | No      |
+| `gcs_model_path`     | Path where model engine in gcs will be read from.     | String  |    null                                | Yes      |
 | `server_launch_command_string`     | Command to launc the Triton Inference Server     | String  |   "pip install sentencepiece protobuf && huggingface-cli login --token $HUGGINGFACE_TOKEN && /opt/tritonserver/bin/tritonserver --model-repository=/all_models/inflight_batcher_llm --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_"                                 | No      |
 
 

From 961b0c7e219c6c5793697d99033791663b8a39f9 Mon Sep 17 00:00:00 2001
From: kaushikmitr <kaushikmitra.umd@gmail.com>
Date: Fri, 8 Mar 2024 00:56:21 +0000
Subject: [PATCH 27/28] fix readme

---
 benchmarks/inference-server/triton/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/benchmarks/inference-server/triton/README.md b/benchmarks/inference-server/triton/README.md
index 814091dc8..81911e177 100644
--- a/benchmarks/inference-server/triton/README.md
+++ b/benchmarks/inference-server/triton/README.md
@@ -170,7 +170,7 @@ terraform apply
 | `model_id`           | Model used for inference.                                                                     | String  | `"meta-llama/Llama-2-7b-chat-hf"`         | No       |
 | `gpu_count`          | Parallelism based on number of gpus.                                                          | Number  | `1`                                       | No       |
 | `ksa`                | Kubernetes Service Account used for workload.                                                 | String  | `"default"`                               | No       |
-| `huggingface-secret` | Name of the kubectl huggingface secret token                                                  | String  | `"huggingface-secret"`                    | Yes       |
+| `huggingface_secret` | Name of the kubectl huggingface secret token                                                  | String  | `"huggingface-secret"`                    | Yes       |
 | `gcs_model_path`     | Path where model engine in gcs will be read from.     | String  |    null                                | Yes      |
 | `server_launch_command_string`     | Command to launc the Triton Inference Server     | String  |   "pip install sentencepiece protobuf && huggingface-cli login --token $HUGGINGFACE_TOKEN && /opt/tritonserver/bin/tritonserver --model-repository=/all_models/inflight_batcher_llm --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_"                                 | No      |
 

From f25a071215d77d4a61aedf4d9c4e77d4b8b9cb39 Mon Sep 17 00:00:00 2001
From: kaushikmitr <kaushikmitra.umd@gmail.com>
Date: Tue, 12 Mar 2024 03:13:37 +0000
Subject: [PATCH 28/28] add mpirun to docker cmd

---
 benchmarks/inference-server/triton/README.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/benchmarks/inference-server/triton/README.md b/benchmarks/inference-server/triton/README.md
index 81911e177..06f067f33 100644
--- a/benchmarks/inference-server/triton/README.md
+++ b/benchmarks/inference-server/triton/README.md
@@ -53,14 +53,14 @@ Ensure you have the following prerequisites installed and set up:
 
 
     # Launch the servers (modify this depending on the number of GPU used and the exact path to the model repo)
-    /opt/tritonserver/bin/tritonserver \
+    mpirun --allow-run-as-root  -n 1 /opt/tritonserver/bin/tritonserver \
     --model-repository=/all_models/inflight_batcher_llm \
     --disable-auto-complete-config \
-    --backend-config=python,shm-region-prefix-name=prefix0_\
-   /opt/tritonserver/bin/tritonserver \
+    --backend-config=python,shm-region-prefix-name=prefix0_ : \
+    -n 1 /opt/tritonserver/bin/tritonserver \
     --model-repository=/all_models/inflight_batcher_llm \
     --disable-auto-complete-config \
-    --backend-config=python,shm-region-prefix-name=prefix1_
+    --backend-config=python,shm-region-prefix-name=prefix1_ :
    ```
     *Build and Push the Docker Image:*