Skip to content

feat: propose gpu and cache aware controller design #319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rootfs
Copy link
Contributor

@rootfs rootfs commented Mar 24, 2025

Since there are many ways to setup vllm and router with different routing, cache, and NLP filters, I believe it is time to expose such practices as Kubernetes CRDs so that we can use controllers to orchestrate the entire e2e flow. Although the existing helm charts can fulfill certain purposes (i.e. bootstrapping vllm and router), they lack the intelligence in scheduling, scaling, and failover.

I understand that some of the concepts are covered in projects such as Kubernetes Gateway API (inference extension), and other projects, it would be great if we can sync on these and have a reusable flow.


```yaml
apiVersion: production-stack.vllm.ai/v1alpha1
kind: InferenceGateway
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to be a GW proxy built by prod-stack?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is to use existing Production Stack for an in-cluster gateway PoC (since there is no extproc integration yet). If this goes to another environment, i.e. Kubernetes Gateway, an Envoy proxy based setting with existing GatewayClass API is potentially the way to go.

spec:
modelName: "llama3-70b"
topologyHint:
nodeSelector:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where will this nodeSelector be used? is the plan that the controller will actually create the deployment and inject this selector? if so, what about all the other parameters in both DeploymentSpec and PodSpec, how can the user set them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nodeSelector provides an informed estimate of how to best utilize the underlying network topology for low latency GPU2GPU transfer.

This is a minimal set of requirement, more info are surely to be investigated.

sdRef: "sd-llama3-chatbot"
```

### SpeculativeDecoding
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we adding Speculative Decoding to the InferenceCache CRD? Speculative Decoding is unrelated to KV cache if I understand correctly.
Does the cache service need to know about the speculative decoding configuration in your design?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, SD is not strictly a KV cache optimization, suggestion of better terms is appreciated.


## Overview

The Inference Controller is a Kubernetes operator that manages the routing and orchestration of vLLM inference services. It provides a declarative way to manage model serving, caching, and routing in a Kubernetes environment.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A higher level question: Will this new API and controller design replace the Helm chart we currently use for deployment?


## Overview

The Inference Controller is a Kubernetes operator that manages the routing and orchestration of vLLM inference services. It provides a declarative way to manage model serving, caching, and routing in a Kubernetes environment.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another higher level question I have is: will this new API design mean major changes to how our current router is implemented?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite the opposite:D I am trying to reuse the existing helm chart as the blueprint for the new API design.

name: llm-gateway
spec:
schedulingPolicy: "gpu-load-aware" # Use GPU metrics for routing
routeStrategy: "PrefixHash" # Enable prefix-based routing for cache affinity
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these "gpu-load-aware" and "PrefixHash" two different algorithms for routing? If so, should we keep only one of them in this CRD?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, will fix.

name: llama3-chatbot-cache
spec:
modelName: "llama3-70b"
kvCacheTransferPolicy:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this kvCacheTransferPolicy referring to specifically? Sounds like eviction policy, given the fields evictionThreshold. If so, we should use another name for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to set a KV cache transfer policy, the fields are really based on my prior cache knowledge, if they don't make sense here, please share more thoughts.

@YuhanLiu11
Copy link
Collaborator

@rootfs @ahg-g @kfswain I implemented an experimental version of the inference engine controller at PR #348, feel free to comment! By the way this is only a design example (not the final design).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants