Skip to content

YAMLs to deploy LLM on AKS using vLLM for single or multiple GPU

License

Notifications You must be signed in to change notification settings

palash-fin/vLLM_Deploy_AKS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vLLM Deployment YAMLs for Azure Kubernetes Service

This repository contains the necessary deployment configurations for running vLLM workloads on Kubernetes. The configurations are tailored for single-GPU and multi-GPU setups, enabling seamless deployment in GPU-accelerated environments such as AKS.

Files in the Repository

1. dep_vllm_1GPU.yaml

  • Description:
    This YAML file defines the deployment configuration for running vLLM workloads on a single GPU. It is ideal for setups with A100 GPUs (e.g., Standard_NC24ads_A100_v4).
  • Key Arguments:
    • --gpu-memory-utilization=0.95: Allocates 95% of the GPU memory to the model.
    • --enforce-eager: Ensures eager initialization of the model.

2. dep_vllm_Multi_GPU.yaml

  • Description:
    This YAML file defines the deployment configuration for running vLLM workloads on multiple GPUs. It is designed for setups with T4 GPUs (e.g., Standard_NC64as_T4_v3) or other multi-GPU configurations.
  • Key Arguments:
    • --tensor-parallel-size=4: Enables distributed inference by splitting the model across 4 GPUs.
    • --max-model-len=24300: Sets the maximum token length for the model.
    • --dtype=float: Specifies the data type for the model.

Prerequisites

1. Kubernetes Secret for Hugging Face Token

Before deploying, create a Kubernetes secret to store your Hugging Face token. This token is required to pull the model during container initialization.

Using Secret Yaml:

https://github.com/palash-fin/vLLM_Deploy_AKS/blob/main/secret_hf.yaml

Using kubectl:

kubectl create secret generic vllm-model-pull-hf \
  --from-literal=HUGGINGFACE_TOKEN=<your_token> \
  -n genai-deployment
  

About

YAMLs to deploy LLM on AKS using vLLM for single or multiple GPU

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published