Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Gateway API Inference Extension Blog Post #49898

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

danehans
Copy link
Contributor

Description

Adds a blog post introducing the Gateway API inference extension project.

cc: @robscott @kfswain

@k8s-ci-robot k8s-ci-robot added the area/blog Issues or PRs related to the Kubernetes Blog subproject label Feb 25, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign nate-double-u for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added language/en Issues or PRs related to English language cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 25, 2025
Copy link

netlify bot commented Feb 25, 2025

Pull request preview available for checking

Built without sensitive environment variables

Name Link
🔨 Latest commit 7ee2d03
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-io-main-staging/deploys/67df0b4d5c35e80008f0cde8
😎 Deploy Preview https://deploy-preview-49898--kubernetes-io-main-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@danehans danehans changed the title Initial Gateway API Inference Extension Blog Post [WIP] Initial Gateway API Inference Extension Blog Post Feb 25, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 25, 2025
@danehans
Copy link
Contributor Author

danehans commented Mar 7, 2025

@robscott PTAL and let me know if you would like any modifications.

@danehans
Copy link
Contributor Author

TODO [danehans]: Add benchmarks and ref to: kubernetes-sigs/gateway-api-inference-extension#480 (when merged).

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 20, 2025
@@ -23,12 +23,13 @@ is missing.

## Enter Gateway API Inference Extension

[Gateway API Inference Extension](https://gateway-api-inference-extension.sigs.k8s.io/) was created to address this gap by building on the existing [Gateway API](https://gateway-api.sigs.k8s.io/),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On line 19 above, "... focused on HTTP path routing or ..."?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton I resolved all your feedback in the latest commit other than this comment. Feel free to rereview and/or elaborate. Thanks again for your review.

standardize routing to inference workloads across the ecosystem. Key objectives include enabling model-aware
routing, supporting per-request criticalities, facilitating safe model roll-outs, and optimizing load balancing
based on real-time model metrics​. By achieving these, the project aims to reduce latency and improve accelerator
(GPU) utilization for AI workloads.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love if you could work in

"Adding the inference extension to your existing gateway makes it an Inference Gateway - enabling you to self-host large language models with a model as a service mindset"

or similar. Roughly hitting the two points "inference extends gateway = inference gateway", and "inference gateway = self-host genai/large models as model as a service"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I really like this framing, and we should use it as much as we can throughout this post and our docs.

I started to go a bit farther with this theme and realized that we could write a very compelling blog post with this theme after KubeCon when we have more Gateway implementations ready. That post could be titled "Introducing Kubernetes Inference Gateways", and have a section describing that an Inference Gateway is an "existing gateway + inference extension". To really sell that though, I think we need to have a variety of "Inference Gateways" ready to play with.

So if we think we'll end up with two separate blog posts here, maybe this initial one is focused on the project goal of extending any existing Gateway with specialized Inference routing capabilities, and then in a follow up blog post we can focus more on the "Inference Gateway" term when we have more examples to work with.

Or maybe we should just hold off on this post until we have more Inference Gateway examples. I'm not sure, open to ideas here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like planting the "gateway + inference extension = inference gateway" seed here and using a follow-up post to drive the messaging.

@@ -64,7 +65,7 @@ steps, e.g. extensions, in the middle. Here’s a high-level example of the requ
and identifies the matching InferencePool backend.

2. **Endpoint Selection**
Instead of simply forwarding to any pod, the Gateway consults an inference-specific routing extension. This
Instead of simply forwarding to any pod, the Gateway consults an inference-specific routing extension, e.g. ​endpoint selection extension. This
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe instead of 'e.g.'

Instead of simply forwarding to any available pod, the Gateway consults an inference-specific routing extension - an ​endpoint selection extension - to pick the best of the available pods.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making the ^ change with one minor difference s/an ​endpoint selection extension/the Endpoint Selection Extension/

more, it helps ops teams deliver the right LLM services to the right users—smoothly and efficiently.

**Ready to learn more?** Visit the [project docs](https://gateway-api-inference-extension.sigs.k8s.io/) to dive deeper,
give Inference Extension a try with a few [simple steps](https://gateway-api-inference-extension.sigs.k8s.io/guides/),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest saying "... give the Inference Gateway extension a try with a few ...".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to hold off on publishing this until we've updated our guides to use proper "Inference Gateways" instead of Envoy patches. Maybe that's actually an argument for saving this until after KubeCon?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial inference extension supported landed in kgateway and I plan on adding an inference extension docs PR in the next few days.

@danehans danehans force-pushed the gie_kcon_blog branch 2 times, most recently from 942afc7 to 6bdf890 Compare March 21, 2025 19:31
@danehans danehans changed the title [WIP] Initial Gateway API Inference Extension Blog Post Initial Gateway API Inference Extension Blog Post Mar 21, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 21, 2025
Copy link
Member

@robscott robscott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work on this @danehans!

---
layout: blog
title: "Introducing Gateway API Inference Extension"
date: 2025-02-21
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danehans can we aim for a day that hasn't been claimed yet next week?

standardize routing to inference workloads across the ecosystem. Key objectives include enabling model-aware
routing, supporting per-request criticalities, facilitating safe model roll-outs, and optimizing load balancing
based on real-time model metrics​. By achieving these, the project aims to reduce latency and improve accelerator
(GPU) utilization for AI workloads.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I really like this framing, and we should use it as much as we can throughout this post and our docs.

I started to go a bit farther with this theme and realized that we could write a very compelling blog post with this theme after KubeCon when we have more Gateway implementations ready. That post could be titled "Introducing Kubernetes Inference Gateways", and have a section describing that an Inference Gateway is an "existing gateway + inference extension". To really sell that though, I think we need to have a variety of "Inference Gateways" ready to play with.

So if we think we'll end up with two separate blog posts here, maybe this initial one is focused on the project goal of extending any existing Gateway with specialized Inference routing capabilities, and then in a follow up blog post we can focus more on the "Inference Gateway" term when we have more examples to work with.

Or maybe we should just hold off on this post until we have more Inference Gateway examples. I'm not sure, open to ideas here.

more, it helps ops teams deliver the right LLM services to the right users—smoothly and efficiently.

**Ready to learn more?** Visit the [project docs](https://gateway-api-inference-extension.sigs.k8s.io/) to dive deeper,
give Inference Extension a try with a few [simple steps](https://gateway-api-inference-extension.sigs.k8s.io/guides/),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to hold off on publishing this until we've updated our guides to use proper "Inference Gateways" instead of Envoy patches. Maybe that's actually an argument for saving this until after KubeCon?


This extra step provides a smarter, model-aware routing mechanism that still feels like a normal single request to
the client.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhere in this section I think it would be useful to mention the extensible nature of this model, and that new extensions can be developed that will be compatible with any Inference Gateway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this section based on ^ feedback, PTAL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/blog Issues or PRs related to the Kubernetes Blog subproject cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. language/en Issues or PRs related to English language size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants