feat: propose gpu and cache aware controller design #319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

rootfs wants to merge 1 commit into vllm-project:main from rootfs:cache-crd-design

Contributor

rootfs commented Mar 24, 2025

Since there are many ways to setup vllm and router with different routing, cache, and NLP filters, I believe it is time to expose such practices as Kubernetes CRDs so that we can use controllers to orchestrate the entire e2e flow. Although the existing helm charts can fulfill certain purposes (i.e. bootstrapping vllm and router), they lack the intelligence in scheduling, scaling, and failover.

I understand that some of the concepts are covered in projects such as Kubernetes Gateway API (inference extension), and other projects, it would be great if we can sync on these and have a reusable flow.


          feat: propose gpu and cache aware controller design

c976e96

Signed-off-by: Huamin Chen <[email protected]>

rootfs mentioned this pull request

feature: support for Kubernetes Gateway Inference Extensions #265

Open

kfswain reviewed

View reviewed changes

src/router-controller/docs/gpu-aware-controller-design.md

+              ```yaml
+              apiVersion: production-stack.vllm.ai/v1alpha1
+              kind: InferenceGateway

kfswain Mar 24, 2025

Is this going to be a GW proxy built by prod-stack?

Contributor Author

rootfs Mar 26, 2025

The idea here is to use existing Production Stack for an in-cluster gateway PoC (since there is no extproc integration yet). If this goes to another environment, i.e. Kubernetes Gateway, an Envoy proxy based setting with existing GatewayClass API is potentially the way to go.

ahg-g reviewed

View reviewed changes

src/router-controller/docs/gpu-aware-controller-design.md

+              spec:
+                modelName: "llama3-70b"
+                topologyHint:
+                  nodeSelector:

ahg-g Mar 24, 2025

where will this nodeSelector be used? is the plan that the controller will actually create the deployment and inject this selector? if so, what about all the other parameters in both DeploymentSpec and PodSpec, how can the user set them?

Contributor Author

rootfs Mar 26, 2025

nodeSelector provides an informed estimate of how to best utilize the underlying network topology for low latency GPU2GPU transfer.

This is a minimal set of requirement, more info are surely to be investigated.

YuhanLiu11 reviewed

View reviewed changes

src/router-controller/docs/gpu-aware-controller-design.md

+                sdRef: "sd-llama3-chatbot"
+              ```
+              ### SpeculativeDecoding

Collaborator

YuhanLiu11 Mar 24, 2025

Why are we adding Speculative Decoding to the InferenceCache CRD? Speculative Decoding is unrelated to KV cache if I understand correctly.
Does the cache service need to know about the speculative decoding configuration in your design?

Contributor Author

rootfs Mar 26, 2025

Right, SD is not strictly a KV cache optimization, suggestion of better terms is appreciated.

src/router-controller/docs/gpu-aware-controller-design.md


		## Overview

		The Inference Controller is a Kubernetes operator that manages the routing and orchestration of vLLM inference services. It provides a declarative way to manage model serving, caching, and routing in a Kubernetes environment.

Collaborator

YuhanLiu11 Mar 24, 2025

A higher level question: Will this new API and controller design replace the Helm chart we currently use for deployment?

src/router-controller/docs/gpu-aware-controller-design.md


		## Overview

		The Inference Controller is a Kubernetes operator that manages the routing and orchestration of vLLM inference services. It provides a declarative way to manage model serving, caching, and routing in a Kubernetes environment.

Collaborator

YuhanLiu11 Mar 24, 2025

Another higher level question I have is: will this new API design mean major changes to how our current router is implemented?

Contributor Author

rootfs Mar 26, 2025

Quite the opposite:D I am trying to reuse the existing helm chart as the blueprint for the new API design.

src/router-controller/docs/gpu-aware-controller-design.md

+                name: llm-gateway
+              spec:
+                schedulingPolicy: "gpu-load-aware"       # Use GPU metrics for routing
+                routeStrategy: "PrefixHash"              # Enable prefix-based routing for cache affinity

Collaborator

YuhanLiu11 Mar 24, 2025

Are these "gpu-load-aware" and "PrefixHash" two different algorithms for routing? If so, should we keep only one of them in this CRD?

Contributor Author

rootfs Mar 26, 2025

sounds good, will fix.

src/router-controller/docs/gpu-aware-controller-design.md

+                name: llama3-chatbot-cache
+              spec:
+                modelName: "llama3-70b"
+                kvCacheTransferPolicy:

Collaborator

YuhanLiu11 Mar 24, 2025

What is this kvCacheTransferPolicy referring to specifically? Sounds like eviction policy, given the fields evictionThreshold. If so, we should use another name for this.

Contributor Author

rootfs Mar 26, 2025

The idea is to set a KV cache transfer policy, the fields are really based on my prior cache knowledge, if they don't make sense here, please share more thoughts.

Collaborator

YuhanLiu11 commented Apr 6, 2025

@rootfs @ahg-g @kfswain I implemented an experimental version of the inference engine controller at PR #348, feel free to comment! By the way this is only a design example (not the final design).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet