Name		Name	Last commit message	Last commit date
parent directory ..
multinode_helm_chart		multinode_helm_chart
Configure_EKS_Cluster.md		Configure_EKS_Cluster.md
Deploy_Triton.md		Deploy_Triton.md
README.md		README.md
update_triton_configs.sh		update_triton_configs.sh

README.md

Multi-Node Triton + TRT-LLM Deployment on EKS

This example provides instructions for multi-node deployment of LLMs on EKS (Amazon Elastic Kubernetes Service). This includes instructions for building custom image to enable features like EFA, Helm chart and associated Python script. This deployment flow uses NVIDIA TensorRT-LLM as the inference engine and NVIDIA Triton Inference Server as the model server.

We have 1 pod per node, so the main challenge in deploying models that require multi-node is that one instance of the model spans multiple nodes hence multiple pods. Consequently, the atomic unit that needs to be ready before requests can be served, as well as the unit that needs to be scaled becomes group of pods. This example shows how to get around these problems and provides code to set up the following

LeaderWorkerSet for launching Triton+TRT-LLM on groups of pods: To launch Triton and TRT-LLM across nodes you use MPI to have one node launch TRT-LLM processes on all the nodes (including itself) that will make up one instance of the model. Doing this requires knowing the hostnames of all involved nodes. Consequently we need to spawn groups of pods and know which model instance group they belong to. To achieve this we use LeaderWorkerSet, which lets us create "megapods" that consist of a group of pods - one leader pod and a specified number of worker pods - and provides pod labels identifying group membership. We configure the LeaderWorkerSet and launch Triton+TRT-LLM via MPI in deployment.yaml and server.py.
Gang Scheduling: Gang scheduling simply means ensuring all pods that make up a model instance are ready before Triton+TRT-LLM is launched. We show how to use kubessh to achieve this in the wait_for_workers function of server.py.
Autoscaling: By default the Horizontal Pod Autoscaler (HPA) scales individual pods, but LeaderWorkerSet makes it possible to scale each "megapod". However, since these are GPU workloads we don't want to use cpu and host memory usage for autoscaling. We show how to leverage the metrics Triton Server exposes through Prometheus and set up GPU utilization recording rules in triton-metrics_prometheus-rule.yaml. We also demonstrate how to properly set up PodMonitors and an HPA in pod-monitor.yaml and hpa.yaml (the key is to only scrape metrics from the leader pods). Instructions for properly setting up Prometheus and exposing GPU metrics are found in [Configure EKS Cluster and Install Dependencies](./2. Configure_EKS_Cluster.md). To enable deployment to dynamically add more nodes in reponse to HPA, we also setup Cluster Autoscaler.
LoadBalancer Setup: Although there are multiple pods in each instance of the model, only one pod within each group accepts requests. We show how to correctly set up a LoadBalancer Service to allow external clients to submit requests in service.yaml

Architecture

Figure 1 – This is the high-level architecture diagram of what you will be provisioning in this blog. You will provision EKS Resources (the Control Plane VPC is provisioned in AWS’ Account, and is fully managed), including the pods and worker node groups (p5.48xlarge in this blog). Additionally, we will provision the Data Plane VPC and associated resources – Subnets (Public & Private), NAT Gateways, and the Internet Gateway. We provision a CPU instance to handle all tasks that don’t require the GPU. We will leverage the LWS Kubernetes API to partition a model across nodes using the concept of a “superpod”. We also provision a load balancer and scale superpods with the HPA and CAS.

Figure 2 – This is a zoomed-in diagram of what LWS looks like to partition the Llama3.1-405B model. Using the LWS Controller, Kubernetes Custom Resource Definition (CRD) and Kubernetes Stateful Sets (STS), we provision the superpods, which consist of a leader pod in one GPU node, along with worker pods in the other GPU nodes. All of these pods together (i.e., one superpod) form one instance of the model. We leverage EFA to enable minimal latency communication between these GPU nodes. We scale these superpods using the Horizontal Pod Autoscaler (HPA) and when pods are “unschedulable”, we use the Cluster Autoscaler (CAS) to provision new p5.48xlarge instances (to host new superpods). We scrape custom metrics via Prometheus for observability from the leader pods. Lastly, the Application Load Balancer routes traffic to these superpods (Kubernetes Ingress resource of type ALB).

Setup and Installation

First create the EFA-enabled EKS Cluster following the steps in Create_EKS_Cluster.md
Then follow the Configure_EKS_Cluster.md guide to install necessary components like Prometheus Kubernetes Stack, EFA Plugin, LeaderWorkerSet, etc within the EKS cluster.
Finally, follow the Deploy_Triton.md guide to build TRT-LLM engines for LLama 3.1 405B model, setup Triton model repository and install the multi-node deployment helm chart. This guide also covers testing the Horizontal Pod Autoscaler and Cluster Autoscaler and benchmarking LLM inference performance using genai-perf.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multinode-triton-trtllm-inference

multinode-triton-trtllm-inference

README.md

Multi-Node Triton + TRT-LLM Deployment on EKS

Architecture

Setup and Installation

Files

multinode-triton-trtllm-inference

Directory actions

More options

Directory actions

More options

Latest commit

History

multinode-triton-trtllm-inference

Folders and files

parent directory

README.md

Multi-Node Triton + TRT-LLM Deployment on EKS

Architecture

Setup and Installation