This example provides instructions for multi-node deployment of LLMs on EKS (Amazon Elastic Kubernetes Service). This includes instructions for building custom image to enable features like EFA, Helm chart and associated Python script. This deployment flow uses NVIDIA TensorRT-LLM as the inference engine and NVIDIA Triton Inference Server as the model server.
We have 1 pod per node, so the main challenge in deploying models that require multi-node is that one instance of the model spans multiple nodes hence multiple pods. Consequently, the atomic unit that needs to be ready before requests can be served, as well as the unit that needs to be scaled becomes group of pods. This example shows how to get around these problems and provides code to set up the following
- LeaderWorkerSet for launching Triton+TRT-LLM on groups of pods: To launch Triton and TRT-LLM across nodes you use MPI to have one node launch TRT-LLM processes on all the nodes (including itself) that will make up one instance of the model. Doing this requires knowing the hostnames of all involved nodes. Consequently we need to spawn groups of pods and know which model instance group they belong to. To achieve this we use LeaderWorkerSet, which lets us create "megapods" that consist of a group of pods - one leader pod and a specified number of worker pods - and provides pod labels identifying group membership. We configure the LeaderWorkerSet and launch Triton+TRT-LLM via MPI in
deployment.yaml
and server.py. - Gang Scheduling: Gang scheduling simply means ensuring all pods that make up a model instance are ready before Triton+TRT-LLM is launched. We show how to use
kubessh
to achieve this in thewait_for_workers
function of server.py. - Autoscaling: By default the Horizontal Pod Autoscaler (HPA) scales individual pods, but LeaderWorkerSet makes it possible to scale each "megapod". However, since these are GPU workloads we don't want to use cpu and host memory usage for autoscaling. We show how to leverage the metrics Triton Server exposes through Prometheus and set up GPU utilization recording rules in
triton-metrics_prometheus-rule.yaml
. We also demonstrate how to properly set up PodMonitors and an HPA inpod-monitor.yaml
andhpa.yaml
(the key is to only scrape metrics from the leader pods). Instructions for properly setting up Prometheus and exposing GPU metrics are found in [Configure EKS Cluster and Install Dependencies](./2. Configure_EKS_Cluster.md). To enable deployment to dynamically add more nodes in reponse to HPA, we also setup Cluster Autoscaler. - LoadBalancer Setup: Although there are multiple pods in each instance of the model, only one pod within each group accepts requests. We show how to correctly set up a LoadBalancer Service to allow external clients to submit requests in
service.yaml
Figure 1 – This is the high-level architecture diagram of what you will be provisioning in this blog. You will provision EKS Resources (the Control Plane VPC is provisioned in AWS’ Account, and is fully managed), including the pods and worker node groups (p5.48xlarge in this blog). Additionally, we will provision the Data Plane VPC and associated resources – Subnets (Public & Private), NAT Gateways, and the Internet Gateway. We provision a CPU instance to handle all tasks that don’t require the GPU. We will leverage the LWS Kubernetes API to partition a model across nodes using the concept of a “superpod”. We also provision a load balancer and scale superpods with the HPA and CAS.
Figure 2 – This is a zoomed-in diagram of what LWS looks like to partition the Llama3.1-405B model. Using the LWS Controller, Kubernetes Custom Resource Definition (CRD) and Kubernetes Stateful Sets (STS), we provision the superpods, which consist of a leader pod in one GPU node, along with worker pods in the other GPU nodes. All of these pods together (i.e., one superpod) form one instance of the model. We leverage EFA to enable minimal latency communication between these GPU nodes. We scale these superpods using the Horizontal Pod Autoscaler (HPA) and when pods are “unschedulable”, we use the Cluster Autoscaler (CAS) to provision new p5.48xlarge instances (to host new superpods). We scrape custom metrics via Prometheus for observability from the leader pods. Lastly, the Application Load Balancer routes traffic to these superpods (Kubernetes Ingress resource of type ALB).
- First create the EFA-enabled EKS Cluster following the steps in Create_EKS_Cluster.md
- Then follow the Configure_EKS_Cluster.md guide to install necessary components like Prometheus Kubernetes Stack, EFA Plugin, LeaderWorkerSet, etc within the EKS cluster.
- Finally, follow the Deploy_Triton.md guide to build TRT-LLM engines for LLama 3.1 405B model, setup Triton model repository and install the multi-node deployment helm chart. This guide also covers testing the Horizontal Pod Autoscaler and Cluster Autoscaler and benchmarking LLM inference performance using genai-perf.