This repository contains the necessary deployment configurations for running vLLM workloads on Kubernetes. The configurations are tailored for single-GPU and multi-GPU setups, enabling seamless deployment in GPU-accelerated environments such as AKS.
- Description:
This YAML file defines the deployment configuration for running vLLM workloads on a single GPU. It is ideal for setups with A100 GPUs (e.g.,Standard_NC24ads_A100_v4
). - Key Arguments:
--gpu-memory-utilization=0.95
: Allocates 95% of the GPU memory to the model.--enforce-eager
: Ensures eager initialization of the model.
- Description:
This YAML file defines the deployment configuration for running vLLM workloads on multiple GPUs. It is designed for setups with T4 GPUs (e.g.,Standard_NC64as_T4_v3
) or other multi-GPU configurations. - Key Arguments:
--tensor-parallel-size=4
: Enables distributed inference by splitting the model across 4 GPUs.--max-model-len=24300
: Sets the maximum token length for the model.--dtype=float
: Specifies the data type for the model.
Before deploying, create a Kubernetes secret to store your Hugging Face token. This token is required to pull the model during container initialization.
https://github.com/palash-fin/vLLM_Deploy_AKS/blob/main/secret_hf.yaml
kubectl create secret generic vllm-model-pull-hf \
--from-literal=HUGGINGFACE_TOKEN=<your_token> \
-n genai-deployment