Skip to content

Support k8s gateway API inference extensions #423

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks
yanavlasov opened this issue Feb 25, 2025 · 21 comments
Open
3 tasks

Support k8s gateway API inference extensions #423

yanavlasov opened this issue Feb 25, 2025 · 21 comments
Assignees
Labels
api Control Plane API enhancement New feature or request
Milestone

Comments

@yanavlasov
Copy link

yanavlasov commented Feb 25, 2025

Support k8s gateway API inference extensions

  • Translate k8s gateway API inference extensions CRDs into Envoy configuration
  • Add e2e tests
  • TBD

Design notes:

  • Inference extension maps HTTPRoute to multiple InferencePools. Inference pool uses labels to select pods for the pool and the port number to use. There is a note that says a service selector can be used, but not sure how it might work (a well known label?).
  • The InferenceModel determines which pool is going to be used for serving a request. presently it does by using the model value form the request, assuming requests use OpenAI schema.
  • The inference extension spec allows offloading of model and endpoint selection to a remote service that implements ext_proc protocol. This is so operators can implement implement their proprietary business logic for selecting both models and then individual endpoints for serving the request.
  • In the most generic case request processing requires two ext_proc callouts.
    1. first callout to select the model and as such inference pool for this model based on contents of the body and other requirements (i.e. cost, priority, etc).
    2. second callout to select endpoint from the inference pool - this is to optimize resource utilization for serving requests. Note that in the generic case the endpoint picking extension can be specific to the pool (hence requirement for a dedicated calout).
  • The two steps above can be done by a single callout, if HTTPRoute is mapped to only one pool, external endpoint selection is not needed, or the callout service can select endpoints for any inference pool.
  • Model/pool external picker uses header (TBD: get puclic specification for the header name) to select a route/cluster corresponding to the inference pool.
  • External endpoint picker uses either metadata or well define header to specify primary endpoint. See proposal Note that the fallback endpoints (for retries) will be added to that proposal shortly.

Reference implementation for external endpoint picker: https://github.com/kubernetes-sigs/gateway-api-inference-extension/

Implementation details and questions:

  1. To support proposed way of specifying endpoints the cluster either needs to use subsets (a subset for each endpoint) or use a specific LB policy.
    • @yanavlasov is to open source LB policy to Envoy that will support endpoint picking based on the propsed endpoint picker protocol.
    • It is unclear of Envoy Gateway supports generating endpoint assignments with subsets for each endpoint.
  2. Can Envoy Gateway generate endpoint assignments for the cluster based on Pod labels or does it need a Service ref?

Possible iteration steps:

  1. Consume inference extension CRDs and generate the same configuration as today. This will give us external model/pool selection, but still Envoy internal endpoint selection.
    • Need to resolve the issue of InferencePool using labels to specify pods.
  2. Open source LB policy to support external endpoint selection in Envoy (parallel to step 1).
  3. Add config (TBD) to use remote endpoint picker - this will use two callouts per request.
@yanavlasov yanavlasov added the enhancement New feature or request label Feb 25, 2025
@yanavlasov yanavlasov self-assigned this Feb 25, 2025
@missBerg missBerg added this to the v0.2.0 milestone Feb 25, 2025
@missBerg missBerg added the api Control Plane API label Feb 25, 2025
@daixiang0
Copy link
Member

@yanavlasov @envoyproxy/ai-gateway-maintainers I see EG also want to implement it refer to https://gateway-api-inference-extension.sigs.k8s.io/implementations/#envoy-gateway, so will we have two implement for this?

@arkodg
Copy link

arkodg commented Feb 26, 2025

@daixiang0 the preference is to implement this in Envoy AI Gateway first, get user feedback, iterate on the API until it moves to standard and then also directly support the implementation in Envoy Gateway

@kfswain
Copy link

kfswain commented Mar 7, 2025

We've worked with Envoy GW quite a bit, it's been great. Happy to continue work with y'all to get these projects workin together! Is it easiest if I just pop in on the community meeting on Th?

@arkodg
Copy link

arkodg commented Mar 7, 2025

thanks for picking this one up @kfswain ! you're familiar with the translation so this should be easy for you 😄
the community meeting is a great place to bring your questions

hey @mathetake can you point Kellen to the reconciliation and translation areas

@mathetake
Copy link
Member

@yanavlasov said he will start working from next week, so could you @kfswain coordinate with Yan (i assume you both are working at Google)

@kfswain
Copy link

kfswain commented Mar 7, 2025

SGTM!

@mathetake
Copy link
Member

Hi @yanavlasov @kfswain Luckily I have cycles to work on this from tomorrow, so I will do the initial work soon and i might ask for some help from you guys whenever I need. SG?

@kfswain
Copy link

kfswain commented Mar 12, 2025

That sounds great. Happy to help

@mathetake
Copy link
Member

now i get the big picture on how to implement in here... i have one question @kfswain may i ask where's the conformance test suite? i couldn't find one in the repo

@yuzisun
Copy link
Contributor

yuzisun commented Mar 13, 2025

Let's have a design doc first, this certainly changes the scope of the envoy ai gateway project.

@mathetake
Copy link
Member

mathetake commented Mar 13, 2025

@yuzisun certainly, i will write up by noon tomorrow

@mathetake
Copy link
Member

mathetake commented Mar 13, 2025

I opened a PR for the high level doc: #492 and here's a couple note from my end and offline discussion:

  • We will allow AIServiceBackend.BackendRef to reference InferencePool
  • One of the requirement from is to do a fallback between multiple (external) services. According to Yan and the documentation on InferencePool.Selector says that it can be service label. So, the initial implementation is to support Service and we also expand it to support the external domain as well. In other words, I will rule out Pod support from the initial iteration.
    * That means, initially we won't support InferenceModel either.
  • The dynamic metadata based lb/fallback is necessary component in envoyproxy/envoy and that's Yan's current focus meanwhile, I am scaffolding the controller as well as the intersection between extproc.

@kfswain
Copy link

kfswain commented Mar 13, 2025

That means, initially we won't support InferenceModel either.

That should be fine, a User will still need an InferenceModel (the EPP reads and uses infModels). Other gateway implementations are not reconciling on the InferenceModel either.

may i ask where's the conformance test suite?

We should have a doc detailing our conformance testing available next week

@mathetake

This comment has been minimized.

@mathetake
Copy link
Member

it looks like i need to read the reference impl more

@mathetake
Copy link
Member

ok i was misunderstanding the InferenceModel's role and i think it makes sense, sorry @kfswain ignore my comment

@mathetake
Copy link
Member

I discussed this with @yuzisun offline and we are putting the development on hold until we see the actual envoy side impl of LB policy by @yanavlasov to know whether or not the use case by Bloomberg is covered by that

@mathetake mathetake modified the milestones: v0.2.0, v0.3.0 Mar 20, 2025
@mathetake mathetake self-assigned this Mar 20, 2025
mathetake added a commit that referenced this issue Mar 20, 2025
**Commit Message**

This adds a proposal doc on the support for Gateway API Inference
Extension in Envoy AI Gateway project. This involves the change of the
project scope as well as we need to make sure that the existing API
layer will co-exist nicely with the GAIE.


**Related Issues/PRs (if applicable)**

Preliminary to #423

---------

Signed-off-by: Takeshi Yoneda <[email protected]>
Signed-off-by: Erica Hughberg <[email protected]>
Co-authored-by: Erica Hughberg <[email protected]>
@mathetake
Copy link
Member

merged the proposal doc: #492 !!

From tomorrow, i will resume the poc work #493 and try to make it work by the end of next Wednesday or Thursday with a minimal functionality

@mathetake
Copy link
Member

Update on the impl: control plane 100% done; extproc 80% done for the mvp

mathetake added a commit that referenced this issue Apr 2, 2025
**Commit Message**

This commit scaffolds the foundation for the Inference Extension API
[1]. The design documentation was merged in #492. The controller needs
to be started with `--enableInferenceExtension=true` to not break the
existing controller deployment where the Inference Extension CRDs are
not installed.

This commit doesn't implement the actual "metrics-aware" load balancing
and instead it just does the random routing out of given (resolved)
endpoints. The follow up implementations will add more advanced
algorithm while expanding the metrics interface that currently only
provides the setter APIs.

The summary of the implementation is:
* Added `kind` field to AIGatewayRouteRuleBackendRef so that it can
reference InferencePool.
* InferencePool.Spec.Selector is allowed to specify multiple
AIServiceBackend.
* When building up all the extproc config via filterapi.Config, the
controller reads the referenced InferencePool and its binding
InferenceModels, and group them together into a single
filterapi.DynamicLoadBalancing configuration.
* When the extproc loads the configuration containing
DynamicLoadBalancing, it will resolve all the IP addresses for hostnames
belonging to the DyanmicLoadbalancing. The presence of
DynamicLoadBalancing in th config forces the config watcher reload and
refresh the config regardless of the updates. That way, the list of ip
addresses will always be updated (eventual consistency anyways) in a
non-hot path.
* On the request path, the ChatCompletionProcessor will check the
existence of the DynamicLoadBalancing config for the backend selected by
the router. If so, it further tries to resolve the ip:port level
endpoint selection.
* The selected ip:port will be set to the special header that will be
routed to ORIGINAL_DST.
* ORIGINAL_DST cluster will be added by the EG extension sever
implementation. Also, the extension server modifies some routes to
properly route to that cluster.

1: https://github.com/kubernetes-sigs/gateway-api-inference-extension

**Related Issues/PRs (if applicable)**

Built on #492
Contributes to #423

---------

Signed-off-by: Takeshi Yoneda <[email protected]>
@mathetake
Copy link
Member

mathetake commented Apr 25, 2025

I wanted to post the update on the current status of the implementation; Our initial foundational work has landed #493 a few weeks ago. With that, at the consumer of InferencePool/Model level, it works. On the other hand, the initial PR left lots of TODOs, especially the most important load balancing logic is missing.

At the community meeting two weeks ago, @yuzisun @sivanantha321 shared some proposal to address these TODOs Metrics-Based Load Balancing for Envoy AI Gateway Inference Extension. The suggested implementation was to implement the load balancing / metrics scraping logic directly in our extproc. After the discussion, we got to the agreement that we should decouple the "endpoint picker" logics instead of direct implementation, which is that we should allow users to bring in their own extensions to perform the load balancing logic. To do so, we need;

  1. Additional configuration nob to specify the endpoint picker extproc inserted after
  2. a relatively large refactoring of the original feat: initial implementation of Inference Extension  #493 where basically we remove all the changes around our extproc code, and add additional control plane code to load the BYO extproc.

My only remaining concern about this direction is how this will play well with the existing features such as transformation, auth and metrics, etc.

Personally, i am not currently working on this implementation, and I believe @sivanantha321 and @yuzisun will take on the task.

cc @envoyproxy/ai-gateway-maintainers

@mathetake
Copy link
Member

mathetake commented May 2, 2025

so now i am doing a relatively large (internal) refactoring in #599 to primarily add support for Envoy Gateway's fallback/priority/retry compatibility which means we can allow AIServiceBackend level fallbacks using EG's native API without our own ones. That will affect the initial design we have landed in #493 but in a good way that now i have an idea to solve this BYO endpoint picker matter. I will work on it after the #599 which temporarily disables Inference Extension support (but on the main branch so i can fix it before the next release)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Control Plane API enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants