Support k8s gateway API inference extensions #423

yanavlasov · 2025-02-25T19:39:58Z

Support k8s gateway API inference extensions

Translate k8s gateway API inference extensions CRDs into Envoy configuration
Add e2e tests
TBD

Design notes:

Inference extension maps HTTPRoute to multiple InferencePools. Inference pool uses labels to select pods for the pool and the port number to use. There is a note that says a service selector can be used, but not sure how it might work (a well known label?).
The InferenceModel determines which pool is going to be used for serving a request. presently it does by using the model value form the request, assuming requests use OpenAI schema.
The inference extension spec allows offloading of model and endpoint selection to a remote service that implements ext_proc protocol. This is so operators can implement implement their proprietary business logic for selecting both models and then individual endpoints for serving the request.
In the most generic case request processing requires two ext_proc callouts.
1. first callout to select the model and as such inference pool for this model based on contents of the body and other requirements (i.e. cost, priority, etc).
2. second callout to select endpoint from the inference pool - this is to optimize resource utilization for serving requests. Note that in the generic case the endpoint picking extension can be specific to the pool (hence requirement for a dedicated calout).
The two steps above can be done by a single callout, if HTTPRoute is mapped to only one pool, external endpoint selection is not needed, or the callout service can select endpoints for any inference pool.
Model/pool external picker uses header (TBD: get puclic specification for the header name) to select a route/cluster corresponding to the inference pool.
External endpoint picker uses either metadata or well define header to specify primary endpoint. See proposal Note that the fallback endpoints (for retries) will be added to that proposal shortly.

Reference implementation for external endpoint picker: https://github.com/kubernetes-sigs/gateway-api-inference-extension/

Implementation details and questions:

To support proposed way of specifying endpoints the cluster either needs to use subsets (a subset for each endpoint) or use a specific LB policy.
- @yanavlasov is to open source LB policy to Envoy that will support endpoint picking based on the propsed endpoint picker protocol.
- It is unclear of Envoy Gateway supports generating endpoint assignments with subsets for each endpoint.
Can Envoy Gateway generate endpoint assignments for the cluster based on Pod labels or does it need a Service ref?

Possible iteration steps:

Consume inference extension CRDs and generate the same configuration as today. This will give us external model/pool selection, but still Envoy internal endpoint selection.
- Need to resolve the issue of InferencePool using labels to specify pods.
Open source LB policy to support external endpoint selection in Envoy (parallel to step 1).
Add config (TBD) to use remote endpoint picker - this will use two callouts per request.

The text was updated successfully, but these errors were encountered:

daixiang0 · 2025-02-26T06:16:58Z

@yanavlasov @envoyproxy/ai-gateway-maintainers I see EG also want to implement it refer to https://gateway-api-inference-extension.sigs.k8s.io/implementations/#envoy-gateway, so will we have two implement for this?

arkodg · 2025-02-26T17:24:43Z

@daixiang0 the preference is to implement this in Envoy AI Gateway first, get user feedback, iterate on the API until it moves to standard and then also directly support the implementation in Envoy Gateway

kfswain · 2025-03-07T16:20:49Z

We've worked with Envoy GW quite a bit, it's been great. Happy to continue work with y'all to get these projects workin together! Is it easiest if I just pop in on the community meeting on Th?

arkodg · 2025-03-07T17:59:47Z

thanks for picking this one up @kfswain ! you're familiar with the translation so this should be easy for you 😄
the community meeting is a great place to bring your questions

hey @mathetake can you point Kellen to the reconciliation and translation areas

mathetake · 2025-03-07T18:17:58Z

@yanavlasov said he will start working from next week, so could you @kfswain coordinate with Yan (i assume you both are working at Google)

kfswain · 2025-03-07T18:50:35Z

SGTM!

mathetake · 2025-03-12T01:30:43Z

Hi @yanavlasov @kfswain Luckily I have cycles to work on this from tomorrow, so I will do the initial work soon and i might ask for some help from you guys whenever I need. SG?

kfswain · 2025-03-12T21:01:43Z

That sounds great. Happy to help

mathetake · 2025-03-12T22:31:47Z

now i get the big picture on how to implement in here... i have one question @kfswain may i ask where's the conformance test suite? i couldn't find one in the repo

yuzisun · 2025-03-13T00:51:17Z

Let's have a design doc first, this certainly changes the scope of the envoy ai gateway project.

mathetake · 2025-03-13T00:56:39Z

@yuzisun certainly, i will write up by noon tomorrow

mathetake · 2025-03-13T22:15:45Z

I opened a PR for the high level doc: #492 and here's a couple note from my end and offline discussion:

We will allow AIServiceBackend.BackendRef to reference InferencePool
One of the requirement from is to do a fallback between multiple (external) services. According to Yan and the documentation on InferencePool.Selector says that it can be service label. So, the initial implementation is to support Service and we also expand it to support the external domain as well. In other words, I will rule out Pod support from the initial iteration.
* That means, initially we won't support InferenceModel either.
The dynamic metadata based lb/fallback is necessary component in envoyproxy/envoy and that's Yan's current focus meanwhile, I am scaffolding the controller as well as the intersection between extproc.

kfswain · 2025-03-13T22:24:59Z

That means, initially we won't support InferenceModel either.

That should be fine, a User will still need an InferenceModel (the EPP reads and uses infModels). Other gateway implementations are not reconciling on the InferenceModel either.

may i ask where's the conformance test suite?

We should have a doc detailing our conformance testing available next week

mathetake · 2025-03-13T23:02:09Z

it looks like i need to read the reference impl more

mathetake · 2025-03-13T23:13:11Z

ok i was misunderstanding the InferenceModel's role and i think it makes sense, sorry @kfswain ignore my comment

mathetake · 2025-03-14T16:52:29Z

I discussed this with @yuzisun offline and we are putting the development on hold until we see the actual envoy side impl of LB policy by @yanavlasov to know whether or not the use case by Bloomberg is covered by that

**Commit Message** This adds a proposal doc on the support for Gateway API Inference Extension in Envoy AI Gateway project. This involves the change of the project scope as well as we need to make sure that the existing API layer will co-exist nicely with the GAIE. **Related Issues/PRs (if applicable)** Preliminary to #423 --------- Signed-off-by: Takeshi Yoneda <[email protected]> Signed-off-by: Erica Hughberg <[email protected]> Co-authored-by: Erica Hughberg <[email protected]>

mathetake · 2025-03-20T16:00:11Z

merged the proposal doc: #492 !!

From tomorrow, i will resume the poc work #493 and try to make it work by the end of next Wednesday or Thursday with a minimal functionality

mathetake · 2025-03-24T23:49:16Z

Update on the impl: control plane 100% done; extproc 80% done for the mvp

**Commit Message** This commit scaffolds the foundation for the Inference Extension API [1]. The design documentation was merged in #492. The controller needs to be started with `--enableInferenceExtension=true` to not break the existing controller deployment where the Inference Extension CRDs are not installed. This commit doesn't implement the actual "metrics-aware" load balancing and instead it just does the random routing out of given (resolved) endpoints. The follow up implementations will add more advanced algorithm while expanding the metrics interface that currently only provides the setter APIs. The summary of the implementation is: * Added `kind` field to AIGatewayRouteRuleBackendRef so that it can reference InferencePool. * InferencePool.Spec.Selector is allowed to specify multiple AIServiceBackend. * When building up all the extproc config via filterapi.Config, the controller reads the referenced InferencePool and its binding InferenceModels, and group them together into a single filterapi.DynamicLoadBalancing configuration. * When the extproc loads the configuration containing DynamicLoadBalancing, it will resolve all the IP addresses for hostnames belonging to the DyanmicLoadbalancing. The presence of DynamicLoadBalancing in th config forces the config watcher reload and refresh the config regardless of the updates. That way, the list of ip addresses will always be updated (eventual consistency anyways) in a non-hot path. * On the request path, the ChatCompletionProcessor will check the existence of the DynamicLoadBalancing config for the backend selected by the router. If so, it further tries to resolve the ip:port level endpoint selection. * The selected ip:port will be set to the special header that will be routed to ORIGINAL_DST. * ORIGINAL_DST cluster will be added by the EG extension sever implementation. Also, the extension server modifies some routes to properly route to that cluster. 1: https://github.com/kubernetes-sigs/gateway-api-inference-extension **Related Issues/PRs (if applicable)** Built on #492 Contributes to #423 --------- Signed-off-by: Takeshi Yoneda <[email protected]>

mathetake · 2025-04-25T17:39:11Z

I wanted to post the update on the current status of the implementation; Our initial foundational work has landed #493 a few weeks ago. With that, at the consumer of InferencePool/Model level, it works. On the other hand, the initial PR left lots of TODOs, especially the most important load balancing logic is missing.

At the community meeting two weeks ago, @yuzisun @sivanantha321 shared some proposal to address these TODOs Metrics-Based Load Balancing for Envoy AI Gateway Inference Extension. The suggested implementation was to implement the load balancing / metrics scraping logic directly in our extproc. After the discussion, we got to the agreement that we should decouple the "endpoint picker" logics instead of direct implementation, which is that we should allow users to bring in their own extensions to perform the load balancing logic. To do so, we need;

Additional configuration nob to specify the endpoint picker extproc inserted after
a relatively large refactoring of the original feat: initial implementation of Inference Extension #493 where basically we remove all the changes around our extproc code, and add additional control plane code to load the BYO extproc.

My only remaining concern about this direction is how this will play well with the existing features such as transformation, auth and metrics, etc.

Personally, i am not currently working on this implementation, and I believe @sivanantha321 and @yuzisun will take on the task.

cc @envoyproxy/ai-gateway-maintainers

mathetake · 2025-05-02T15:08:42Z

so now i am doing a relatively large (internal) refactoring in #599 to primarily add support for Envoy Gateway's fallback/priority/retry compatibility which means we can allow AIServiceBackend level fallbacks using EG's native API without our own ones. That will affect the initial design we have landed in #493 but in a good way that now i have an idea to solve this BYO endpoint picker matter. I will work on it after the #599 which temporarily disables Inference Extension support (but on the main branch so i can fix it before the next release)

yanavlasov added the enhancement New feature or request label Feb 25, 2025

yanavlasov self-assigned this Feb 25, 2025

missBerg added this to the v0.2.0 milestone Feb 25, 2025

missBerg added the api Control Plane API label Feb 25, 2025

mathetake mentioned this issue Mar 13, 2025

docs: adds high level proposal on GAIE support #492

Merged

This comment has been minimized.

Sign in to view

howardjohn mentioned this issue Mar 17, 2025

Consider backend augmentation vs new backend type kubernetes-sigs/gateway-api-inference-extension#521

Closed

mathetake modified the milestones: v0.2.0, v0.3.0 Mar 20, 2025

mathetake self-assigned this Mar 20, 2025

mathetake mentioned this issue Mar 21, 2025

feat: initial implementation of Inference Extension #493

Merged

mathetake mentioned this issue Apr 3, 2025

docs: adds Inference Extension page #555

Merged

mathetake mentioned this issue Apr 29, 2025

[Refactor] Reorganize the location of infext controllers #597

Closed

mathetake modified the milestones: v0.3.0, v0.2.0 May 13, 2025

mathetake mentioned this issue May 14, 2025

feat: redo Inference Extension support #614

Closed

mathetake modified the milestones: v0.2.0, v0.3.0 May 29, 2025

Support k8s gateway API inference extensions #423

Support k8s gateway API inference extensions #423

Comments

yanavlasov commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

daixiang0 commented Feb 26, 2025

Uh oh!

arkodg commented Feb 26, 2025

Uh oh!

kfswain commented Mar 7, 2025

Uh oh!

arkodg commented Mar 7, 2025

Uh oh!

mathetake commented Mar 7, 2025

Uh oh!

kfswain commented Mar 7, 2025

Uh oh!

mathetake commented Mar 12, 2025

Uh oh!

kfswain commented Mar 12, 2025

Uh oh!

mathetake commented Mar 12, 2025

Uh oh!

yuzisun commented Mar 13, 2025

Uh oh!

mathetake commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mathetake commented Mar 13, 2025 • edited by yuzisun Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kfswain commented Mar 13, 2025

Uh oh!

This comment has been minimized.

mathetake commented Mar 13, 2025

Uh oh!

mathetake commented Mar 13, 2025

Uh oh!

mathetake commented Mar 14, 2025

Uh oh!

mathetake commented Mar 20, 2025

Uh oh!

mathetake commented Mar 24, 2025

Uh oh!

mathetake commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mathetake commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yanavlasov commented Feb 25, 2025 •

edited

Loading

mathetake commented Mar 13, 2025 •

edited

Loading

mathetake commented Mar 13, 2025 •

edited by yuzisun

Loading

mathetake commented Apr 25, 2025 •

edited

Loading

mathetake commented May 2, 2025 •

edited

Loading