-
Notifications
You must be signed in to change notification settings - Fork 199
feat: propose gpu and cache aware controller design #319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Huamin Chen <[email protected]>
|
||
```yaml | ||
apiVersion: production-stack.vllm.ai/v1alpha1 | ||
kind: InferenceGateway |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to be a GW proxy built by prod-stack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here is to use existing Production Stack for an in-cluster gateway PoC (since there is no extproc integration yet). If this goes to another environment, i.e. Kubernetes Gateway, an Envoy proxy based setting with existing GatewayClass API is potentially the way to go.
spec: | ||
modelName: "llama3-70b" | ||
topologyHint: | ||
nodeSelector: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where will this nodeSelector be used? is the plan that the controller will actually create the deployment and inject this selector? if so, what about all the other parameters in both DeploymentSpec and PodSpec, how can the user set them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nodeSelector provides an informed estimate of how to best utilize the underlying network topology for low latency GPU2GPU transfer.
This is a minimal set of requirement, more info are surely to be investigated.
sdRef: "sd-llama3-chatbot" | ||
``` | ||
|
||
### SpeculativeDecoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we adding Speculative Decoding to the InferenceCache CRD? Speculative Decoding is unrelated to KV cache if I understand correctly.
Does the cache service need to know about the speculative decoding configuration in your design?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, SD is not strictly a KV cache optimization, suggestion of better terms is appreciated.
|
||
## Overview | ||
|
||
The Inference Controller is a Kubernetes operator that manages the routing and orchestration of vLLM inference services. It provides a declarative way to manage model serving, caching, and routing in a Kubernetes environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A higher level question: Will this new API and controller design replace the Helm chart we currently use for deployment?
|
||
## Overview | ||
|
||
The Inference Controller is a Kubernetes operator that manages the routing and orchestration of vLLM inference services. It provides a declarative way to manage model serving, caching, and routing in a Kubernetes environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another higher level question I have is: will this new API design mean major changes to how our current router is implemented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite the opposite:D I am trying to reuse the existing helm chart as the blueprint for the new API design.
name: llm-gateway | ||
spec: | ||
schedulingPolicy: "gpu-load-aware" # Use GPU metrics for routing | ||
routeStrategy: "PrefixHash" # Enable prefix-based routing for cache affinity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these "gpu-load-aware" and "PrefixHash" two different algorithms for routing? If so, should we keep only one of them in this CRD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, will fix.
name: llama3-chatbot-cache | ||
spec: | ||
modelName: "llama3-70b" | ||
kvCacheTransferPolicy: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this kvCacheTransferPolicy
referring to specifically? Sounds like eviction policy, given the fields evictionThreshold
. If so, we should use another name for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to set a KV cache transfer policy, the fields are really based on my prior cache knowledge, if they don't make sense here, please share more thoughts.
Since there are many ways to setup vllm and router with different routing, cache, and NLP filters, I believe it is time to expose such practices as Kubernetes CRDs so that we can use controllers to orchestrate the entire e2e flow. Although the existing helm charts can fulfill certain purposes (i.e. bootstrapping vllm and router), they lack the intelligence in scheduling, scaling, and failover.
I understand that some of the concepts are covered in projects such as Kubernetes Gateway API (inference extension), and other projects, it would be great if we can sync on these and have a reusable flow.