Skip to content

[v0.1] Add Semantic Route Support #145

@Xunzhuo

Description

@Xunzhuo

Background

Currently, the global configuration contains static configs including featuregate for semantic cache, prompt guard, reasoning family and categories.

At the same time, it contains dynamic configuration like model configuration, endpoints configuration, Algorithms.

This proposal is to design an abstraction of semantic route resource to configure the dynamic routing informations, which can be used in local and also as a CRD to be list/watch in kubernetes.

This can improve maintainability and UX hugely. Meanwhile, the experiment shows different model behaviors differently in reasoning mode, introducing SemanticRoute will help optimize Model Level routing algorithms like which categories be on/off.

Goals

  • Semantic Route Design: design the next-gen routing abstraction of LLM Semantic Routing.
  • Multiple Environments Support: working with multiple environments like Local or Kubernetes.
  • Easy to be Extented: pluggable filter design to match the request for future iteration.

What was before?

Before LLM, the routing is basically binding with TCP/IP Protocol, like what GatewayAPI does:

  • L7 protocol: HTTPRoute, GRPCRoute...
  • L4 protocol: TCPRoute, UDPRoute...

And for L7 protocol, take HTTPRoute as an example, the main concept is to match a rule and route to specific backends.

Image

Here is an example, route to backend by matching header, path and hostname.

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: foo-route
spec:
  parentRefs:
  - name: example-gateway
  hostnames:
  - "foo.example.com"
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /login
    - headers:
      - type: Exact
        name: env
        value: canary
    backendRefs:
    - name: foo-svc
      port: 8080

It is working well with traditional workloads and it is not surfacing with the LLM.

What is Now?

vLLM Semantic Router introduces a very different view of how routing to the LLM workload. It is not about to match the protocol binding elements like headers, path, hostname in HTTP, or the srcIP,dstIP in TCP.

It is aimming to match the intent which contains in the request, within extensible filter chain for request control. I will list some examples of SemanticRoute. Besides intent understanding, also supporting powerful filters like:

  • PIIDetection enables PII detection and filtering
  • PromptGuard enables prompt security and jailbreak detection
  • SemanticCache enables semantic caching for performance optimization
  • ReasoningControl enables reasoning mode control
  • ToolSelection enables automatic tool selection based on semantic similarity

Simple SemanticRoute with intent detection

Here is an example of SemanticRoute looks like, matching the math and computer science category and route to a LLM Model, if match is failed, fallback to default model:

apiVersion: vllm.ai/v1alpha1
kind: SemanticRoute
metadata:
  name: reasoning-route
spec:
  rules:
  - intents:
    - category: "computer science"
    - category: "math"
    modelRefs:
    - modelName: gpt-oss
       port: 8080
       address: 127.0.0.1
    defaultModel:
       modelName: deepseek-v31
       port: 8088
       address: 127.0.0.1

Complex SemanticRoute within Filter Chain

Here is an example of SemanticRoute looks like, matching the math and computer science category and route to a LLM Model, if match is failed, fallback to default model, and enable PIIDetection and PromptGuard, as well as the reasoning control in this route

apiVersion: vllm.ai/v1alpha1
kind: SemanticRoute
metadata:
  name: complex-route
spec:
  rules:
  - intents:
    - category: "computer science"
    - category: "math"
    modelRefs:
    - modelName: gpt-oss
       port: 8080
       address: 127.0.0.1
    filters:
    - type: PIIDetection
      allowByDefault: false
      pii_types_allowed: ["EMAIL_ADDRESS", "PERSON"]
    - type: PromptGuard
      threshold: 0.7
    - type: SemanticCache
      similarityThreshold: 0.8
      maxEntries: 1000
      ttlSeconds: 3600s
    - type: ToolSelection
      similarityThreshold: 0.8
    - type: ReasoningControl
      reasonFamily: gpt-oss
      enableReasoning: true
    defaultModel:
       modelName: deepseek-v31
       port: 8088
       address: 127.0.0.1

Multiple SemanticRoute

Here is an example of multiple SemanticRoute looks like, matching the math and computer science category and route to a powerful LLM Model with reasoning on; matching the other and creative category and route to a lightweight LLM Model with reasoning off

Non-reasoning Route:

apiVersion: vllm.ai/v1alpha1
kind: SemanticRoute
metadata:
  name: lightweight-route
spec:
  rules:
  - intents:
    - category: "creative"
    - category: "other"
    modelRefs:
    - modelName: gpt-oss
       port: 8080
       address: 127.0.0.1
    filters:
    - type: ReasoningControl
      reasonFamily: gpt-oss
      enableReasoning: false
    defaultModel:
       modelName: deepseek-v31
       port: 8088
       address: 127.0.0.1

Reasoning Route:

apiVersion: vllm.ai/v1alpha1
kind: SemanticRoute
metadata:
  name: reasoning-route
spec:
  rules:
  - intents:
    - category: "computer science"
    - category: "math"
    modelRefs:
    - modelName: gpt-oss
       port: 8080
       address: 127.0.0.1
    filters:
    - type: ReasoningControl
      reasonFamily: gpt-oss
      enableReasoning: true
    defaultModel:
       modelName: deepseek-v31
       port: 8088
       address: 127.0.0.1

Mixed Reasoning SemanticRoute

Here is an example of mixed SemanticRoute looks like, matching the math and computer science category and route to a powerful LLM Model with reasoning on; matching the other and creative category and route to a lightweight LLM Model with reasoning off

apiVersion: vllm.ai/v1alpha1
kind: SemanticRoute
metadata:
  name: lightweight-route
spec:
  rules:
  - intents:
    - category: "computer science"
    - category: "math"
    modelRefs:
    - modelName: gpt-oss
       port: 8080
       address: 127.0.0.1
    filters:
    - type: ReasoningControl
      reasonFamily: gpt-oss
      enableReasoning: true
    defaultModel:
       modelName: deepseek-v31
       port: 8088
       address: 127.0.0.1
  - intents:
    - category: "creative"
    - category: "other"
    modelRefs:
    - modelName: gpt-oss
       port: 8080
       address: 127.0.0.1
    filters:
    - type: ReasoningControl
      reasonFamily: gpt-oss
      enableReasoning: false
    defaultModel:
       modelName: deepseek-v31
       port: 8088
       address: 127.0.0.1

Multiple Weighted ModelRef SemanticRoute

Here is an example of a SemanticRoute with Multiple ModelRef

apiVersion: vllm.ai/v1alpha1
kind: SemanticRoute
metadata:
  name: lightweight-route
spec:
  rules:
  - intents:
    - category: "computer science"
    - category: "math"
    modelRefs:
    - modelName: gpt-oss
       port: 8080
       address: 127.0.0.1
       weight: 80
    - modelName: qwen3
       port: 8089
       address: 127.0.0.1
       weight: 20
    filters:
    - type: ReasoningControl
      reasonFamily: gpt-oss
      enableReasoning: true
    defaultModel:
       modelName: deepseek-v31
       port: 8088
       address: 127.0.0.1

Implementation

  • Semantic Route API Design
  • Add Local Support
  • Add Kubernetes Support
  • Add E2E Test
  • Add User Facing Docs

Metadata

Metadata

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions