[Misc] SLO-aware router with profile support #1192

zhangjyr · 2025-06-12T23:01:15Z

Pull Request Description

Introducing SLO-aware router with profile support. This PR introduces three new SLO-aware routing policies:

slo (or slo-least-load-pulling)
slo-least-load
slo-pack-load
All three routing policies will prioritize requests with a profiled SLO target.

In addition to the slo-family routing policies, this PR added built-in queues to support request reordering and future delay scheduling. In particular, QueueRouter enables the pull mode within the gateway. Below is a comparison of pulling mode and default push mode:

Push mode: The router dispatches requests to the server, possibly overloading the server.
Pull mode: The server pulls requests from the router based on the server's capacity.
With profile support, the gateway now has server capacity knowledge and can achieve pull mode within the gateway.

Additional feature added in this PR:

Add a fallback routing policy mechanism for developers to designate a default routing policy if the specified routing policy fails.
Wrap the routing policy registration mechanism in RouterManager, allowing it to be reused for managing a family of related routing policies. (Including simplify Select() to follow RouterProviderFunc)
Add profile cache API to manage model-based performance profiles.
Improving the profile generator in the GPU manager to include SLO information and detailed metrics.

Preliminary results show SLO policy can achieve the SLO target for composite workload on heterogeneous GPUs:

Workload: mixed sharegpt and bird workload with a ratio of 7:4
GPU: 1A10, 4L20
SLO: Latency per token 0.05s

Related Issues

Resolves: #642 #606

Important: Before submitting, please complete the description above and review the checklist below.

Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

[Bug]: Corrections to existing functionality
[CI]: Changes to build process or CI pipeline
[Docs]: Updates or additions to documentation
[API]: Modifications to aibrix's API or interface
[CLI]: Changes or additions to the Command Line Interface
[Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

PR title includes appropriate prefix(es)
Changes are clearly explained in the PR description
New and existing tests pass successfully
Code adheres to project style and best practices
Documentation updated to reflect changes (if applicable)
Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

gemini-code-assist · 2025-06-12T23:01:21Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Jeffwan · 2025-06-14T01:29:32Z

I notice there're some refactor changes (e.g. internal interface change etc) Technically, that affects other aspects, could it be some separate changes? I mean splitting the changes into common parts (stakeholder needs to review it) and slo specific changes (review could be loose and feature can be protected by feature gate).

If the splitting is too complicated, we can have 1st round review and check how to move forward

zhangjyr · 2025-06-14T04:41:26Z

@Jeffwan, I think the only internal interface change is the Select(). The function is called only in one place, and if you find it is not appropriate, we can restore it.

cmd/plugins/main.go

config/gateway/gateway-plugin/gateway-plugin.yaml

Jeffwan · 2025-06-16T23:38:54Z

pkg/cache/cache_api.go

+	// Parameters:
+	//   deploymentName: Name of the deployment
+	//   modelName: Name of the model
+	GetModelProfileByDeploymentName(deploymentName string, modelName string) (*ModelGPUProfile, error)


TODO: we may use other objects to orchestrate pods in future. in that case, deployment might be changed in future. This looks good at this moment.

one more problem is, deployment without namespace can not be used to identify a deployment. we need to append the namespace field

In the case of deployment using other objects, the GPU optimizer would have been changed as well (it monitors deployment only). For the support of ray clusters, let me keep a note, leave this comment open, and add an issue after merging.

Can you explain the cases where "deployment without namespace can not be used to identify a deployment"?

I mean namespace/deployment_name as the key,

The key is in fact in the format aibrix:profile_[model_name]_[deployment_name], the name is unique across namespaces given:

model name are unique across namespaces.

deployment_names have the same namespace as the model name.

Cannot we deploy the same model name in different namespace?

pkg/cache/cache_impl.go

Jeffwan · 2025-06-16T23:41:46Z

pkg/cache/cache_init.go

 	metaModels utils.SyncMap[string, *Model] // model_name -> *Model

+	// Deploymnent related storage
+	deploymentProfiles utils.SyncMap[string, *ModelGPUProfile] // deployment_name -> *ModelGPUProfile


same here. we can use namespace/deployment as the key

The key is in fact in the format aibrix:profile_[model_name]_[deployment_name], the name is unique across namespaces given:

model name are unique across namespaces.

deployment_names have the same namespace as the model name.
I've updated the comment here for clarification.

pkg/cache/informers.go

Jeffwan · 2025-06-17T00:21:25Z

pkg/plugins/gateway/queue/simple_queue.go

+			}
+		}
+	}
+	q.queue, q.baseCursor = newQueue, q.baseCursor+dequeuePos


could it be a problem is the other goroutine invoke physicalPosRLocked? can we introduce something like and use it in physicalPosRLocked and setBaseCursor in expand?

func (q *SimpleQueue[V]) getBaseCursor() int64 { return atomic.LoadInt64(&q.baseCursor) }

BaseCursor is updated only in lock(), and physicalPosRLocked() is Rlocked(), which will not be executable during lock() except being specifically called. The -RLocked() naming postfix asks developers to call this function specifically in Rlock() contexts.

pkg/cache/output_predictor.go

Jeffwan · 2025-06-17T00:25:04Z

pkg/types/router_context.go

 	debugDelay   time.Duration
+	tokens       []int
+	predictor    OutputPredictor


one of my concerns is which field can be used for profile disabled routing algorithms? As a routing algorithm developer, which field should I expected to be available if I enable/disable some features.

Current SLO family routers are profile-based, if disabled, we will fallback to least-request. The profile is essential in the current implementation because the profile provides the model-based SLO target and server capacity in terms of load. For a non-model-specific general SLO (e.g., 120s latency), if the SLO can be set using environment variables, I think the SLO queue can also be combined with other stateless routers to provide some extent of improvement. OutputPredictor itself is profile-decoupled and can work independently.

pkg/plugins/gateway/queue/slo_queue.go

Jeffwan · 2025-06-17T00:28:14Z

pkg/plugins/gateway/queue/slo_queue.go

+	queueOverallSLO          bool = false
+	monogenousGPURouting     bool = true
+	monogenousGPURoutingOnly bool = monogenousGPURouting && false
+	initialTotalSubQueues    int  = 8  // Expect no more than 8 subqueues


are these magic numbers const or should be adjusted based on the available resources?

They are constants and are there only for evaluation purposes. These are features switches inherited from my router simulator. I started from the router simulator for the best configuration, and sync switches for convenience later.

Jeffwan · 2025-06-24T05:47:16Z

@zhangjyr please rebase the branch one more time. I tried to merge it but notice some rebase conflicts

Signed-off-by: Jingyuan Zhang <[email protected]>

zhangjyr · 2025-06-24T17:33:35Z

@zhangjyr please rebase the branch one more time. I tried to merge it but notice some rebase conflicts

@Jeffwan Done

Signed-off-by: Jingyuan Zhang <[email protected]>

varungup90 · 2025-06-24T18:25:40Z

pkg/utils/pod.go

@@ -36,6 +37,17 @@ const (
 	defaultPodMetricPort = 8000
 )

+var (
+	ReplicaSetDeploymentFinder = regexp.MustCompile(`^(.*)-\w+$`)     // Deployment-[random name]
+	RayClusterFleatFinder      = regexp.MustCompile(`^(.*)-\w+-\w+$`) // RayClusterFleat-[random name]-[random name]


spelling fleat -> fleet.

varungup90 · 2025-06-24T18:36:35Z

pkg/types/router_context.go


 	targetPodSet chan struct{}
 	targetPod    atomic.Pointer[v1.Pod]
-	debugDelay   time.Duration
+	lastError    atomic.Pointer[error]
+	tokens       []int           // Cache of tokenized prompts


can you elaborate the comment for tokens

Do you mean how it is used or how it is created? For use, this is just a private cache because promptLength might be called multiple times during routing.

Both, what tokens are stored and when used.

Given that now routing context as lot of metadata for request, in a follow up PR a readme can be added.

The first time the PromptTokens() or PromptLength() is called, the tokens are cached in the context. I primarily use prompt length for output prediction, so the prefix-cache method of tokenization may not be suitable. Although a model-specific tokenizer is preferred, the current output prediction only predicts on coarse-grained log2 buckets, so I think the tiktoken fits.

Signed-off-by: Jingyuan Zhang <[email protected]>

varungup90 · 2025-06-24T19:34:03Z

pkg/types/router_context.go

+func (r *RoutingContext) PromptTokens() ([]int, error) {
+	if r.tokens == nil {
+		var err error
+		r.tokens, err = utils.TokenizeInputText(r.Message)


It will depend on the tokenizer used.

utils.TokenizeInputText specifically uses tiktoken, I optimize the function by reuseing the tiktoken object.

varungup90 · 2025-06-24T22:10:50Z

pkg/plugins/gateway/algorithms/least_load.go

+		util, err := r.provider.GetUtilization(ctx, pod)
+		if err != nil {
+			lastErr = r.updateError(lastErr, err)
+			klog.ErrorS(err, "Skipped pod due to fail to get utilization in leastLoadRouter", "pod", pod.Name)


log it with request_id and this has the potential to flood the logs.

Yes, I've added a new feature to avoid multiple error logs for the same request. In particular, a profile missing or not supporting can be a common problem for all pods. Only one error log is needed if that happens.

varungup90 · 2025-06-24T22:12:19Z

pkg/plugins/gateway/algorithms/least_load.go

+			pod.Name, pod.Status.PodIP, util)
+
+		var consumption float64
+		if r.pulling {


what does pulling mean here?

"pulling" means the pull mode in the PR description:
"Below is a comparison of pulling mode and default push mode:
Push mode: The router dispatches requests to the server, possibly overloading the server.
Pull mode: The server pulls requests from the router based on the server's capacity.
With profile support, the gateway now has server capacity knowledge and can achieve pull mode within the gateway."

varungup90 · 2025-06-24T22:17:50Z

pkg/plugins/gateway/algorithms/slo.go

+)
+
+const (
+	RouterSLO                 types.RoutingAlgorithm = "slo"


can you add a readme for these algos

Signed-off-by: Jingyuan Zhang <[email protected]>

…load_aware_routing Signed-off-by: Jingyuan Zhang <[email protected]> # Conflicts: # config/gateway/gateway-plugin/gateway-plugin.yaml

Signed-off-by: Jingyuan Zhang <[email protected]>

Xunzhuo

Thanks for the works, left some small comments!

Xunzhuo · 2025-06-26T14:53:19Z

development/app/Makefile

@@ -113,5 +113,17 @@ test-gateway2:
            "max_tokens": 512 \
        }'

+test-router:


can this make target rename to test-slo-router?

In fact, test-router is just for showcases. I can change the strategy to least-request.

Xunzhuo · 2025-06-26T14:55:35Z

pkg/cache/cache_api.go

+	// Parameters:
+	//   deploymentName: Name of the deployment
+	//   modelName: Name of the model
+	GetModelProfileByDeploymentName(deploymentName string, modelName string) (*ModelGPUProfile, error)


Cannot we deploy the same model name in different namespace?

Xunzhuo · 2025-06-26T14:58:13Z

pkg/cache/cache_init.go

-func NewTestCacheWithPods(pods []*v1.Pod, model string) *Store {
-	c := &Store{}
+// NewForTest initializes the cache store for testing purposes, it can be repeated call for reset.
+func NewForTest() *Store {


Can move to cace_test.go?

This makes sense. However, this set of functions are designed to be called by tests in other packages (e.g., routing algorithm), moving NewForTest() to cache_test.go will make them inaccessible from there.

Xunzhuo · 2025-06-26T15:00:18Z

pkg/cache/cache_profile.go

+			return
+		}
+
+		for _, key := range keys {


Thoughts on doing it concurrently?

The profile update is off the request path, and there is no urgency to keep the profile real-time. On the other side, concurrent update can cause unnecessary network burst (redis get), memory footprint(store bytes), and CPU footprint(unmarshal).

Xunzhuo · 2025-06-26T15:01:41Z

pkg/cache/errors.go

+import "errors"
+
+var (
+	ErrorTypeMetricNotFound = &CacheError{error: errors.New("metric not found")}


Besides these two errors, any other errors in Cache can abstracted?

There are many. But this PR is not really about Error standardization. I created this abstraction for my part of the code, and hopefully, other collaborators will come to the realization of the necessity and adopt this approach.

Xunzhuo · 2025-06-26T15:05:28Z

pkg/cache/model_gpu_profile.go

+	Cost       float64     `json:"cost"`
+	Tputs      [][]float64 `json:"tputs"`   // Max RPS per correspondent index.
+	Indexes    [][]float64 `json:"indexes"` // [output tokens, input tokens]
+	Created    float64     `json:"created"`


What does the created mean here? It looks like a boolean var naming.

It is a timestamp in Unix sec.sub-sec format, which is used to test if the profile has been updated. I will add a comment for this field.

Thoughts on CreatedAt?

Xunzhuo · 2025-06-26T15:07:56Z

pkg/cache/model_gpu_profile_test.go

+)
+
+const (
+	profile1 = "{\"gpu\": \"simulator-llama2-7b-a100\", \"cost\": 1.0, \"tputs\": [[62.4770592037908, 32.3132063155413, 16.794872306236712, 8.506609663869995, 4.25505318548248], [60.316466916176516, 31.743326005665654, 16.648846618248662, 8.456487777894496, 4.232097897906285], [31.390153356402557, 31.44077673744191, 16.35548438044074, 8.319937915785582, 4.195196435104038], [29.659845691502795, 29.443006402919238, 15.737020124678372, 8.09950347977996, 4.097482292889981], [15.195016931044663, 15.141240615344003, 8.084202765314256, 7.635497267355755, 3.9453497193741334], [7.6749238663857575, 7.646377471511056, 7.469959534415053, 4.021491925138638, 2.0825276518486944], [3.8569367968050043, 3.8474025810402654, 3.788259676585335, 2.0288591107923666, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0]], \"indexes\": [[4, 8, 16, 32, 64, 128, 256, 512], [128, 256, 512, 1024, 2048]], \"created\": 1748888478.258241, \"e2e\": [[0.09952875833841972, 0.12239365121233277, 0.21428067207685672, 0.31731165578588844, 0.5534239103773143], [0.17259948405786418, 0.19343522920506076, 0.3083386933861766, 0.43137065838091077, 0.6822485379409045], [0.2401115512580145, 0.34489422958809884, 0.47929574372596107, 0.6583689095533919, 0.940295041234931], [0.48604107543011194, 0.6922839175269473, 0.827185874566203, 1.0668736191641073, 1.4869349837116896], [0.8164577254245523, 0.9879068507999181, 1.041013903748244, 2.1203144049376714, 2.666910813357681], [1.51887807746185, 1.7140269558737053, 2.342019975812873, 2.50050834288937, 2.8703157891728917], [2.9778688216709996, 3.245595489592524, 3.9633901528839486, 4.021366707490524, 4.36835270334268], [5.479355860430514, 5.721908029631013, 6.244554732975084, 7.508401151361177, 11.322325259115313]], \"slos\": {\"percentile\": 99, \"e2e\": 5.0}}"


suggest to separate from code, into cache/testdata/xx-xxx-xx-profile.json, and the testcase can be extensible with different scenarios

Good idea, but let's make the change if more test cases are added here.

Xunzhuo · 2025-06-26T15:10:08Z

pkg/cache/pending_load_provider.go

+// PendingLoad = 1 / PendingRequests = 1 / (Throughput * Latency),
+// where PendingRequests = Throughput * Latency follows Little's Law,
+// and Throughput(RPS) and Latency are from loaded profile per feature (input tokens, output tokens)
+type PendingLoadProvider struct {


Can add a test case for the pending load provider?

What test cases would you like to be added?

Xunzhuo · 2025-06-26T15:12:54Z

pkg/plugins/gateway/algorithms/fallback.go

+	"github.com/vllm-project/aibrix/pkg/types"
+)
+
+const DefaultFallbackAlgorithm types.RoutingAlgorithm = RouterRandom


So if the algorithm does explicitly set the fallback strategy, we use random, right?

Can we remove the pkg/plugins/gateway/algorithms/util.go SelectRandomPodAsFallback instead of the current elegant way?

So if the algorithm does explicitly set the fallback strategy, we use random, right?

If the algorithm does not explicitly set the fallback, random will be used. Yes.

Can we remove the pkg/plugins/gateway/algorithms/util.go SelectRandomPodAsFallback instead of the current elegant way?

Let's open a new code refactoring issue after this PR is merged. This PR has already been too large, I'm refraining from further refactoring.

Xunzhuo · 2025-06-26T15:19:33Z

test/e2e/routing_strategy_test.go

+	for i := 0; i < iterration; i++ {
+		req := "hello test"
+		targetPod := getTargetPodFromChatCompletion(t, req, "slo")
+		assert.NotEmpty(t, targetPod, "target pod should not be empty")


Nit, can we enrich the e2e?

Lol, you got me. The test was there because I thought unit testing could not be done for the slo policy. However, the current slo_test.go into the algorithm covers most e2e tests. I will remove this part of tests.

Xunzhuo · 2025-06-26T15:28:37Z

pkg/cache/cache_profile.go

+				continue // Skip to the next key
+			}
+
+			c.UpdateModelProfile(key, &updated, false)


a question for this, I wonder who/where added the gpu profile cache to redis, who is the producer?

The submission of profiles is a part of the previous GPU optimizer PR. See step 4 of https://aibrix.readthedocs.io/latest/features/heterogeneous-gpu.html for details.

Signed-off-by: Jingyuan Zhang <[email protected]>

…load_aware_routing Signed-off-by: Jingyuan Zhang <[email protected]> # Conflicts: # pkg/plugins/gateway/algorithms/router.go

…ewForTest will not change global store and work as stateless as expected. Signed-off-by: Jingyuan Zhang <[email protected]>

Xunzhuo

Thanks! This is massive, and in case no more conflicts, will prioritize the PR merge

zhangjyr requested review from Jeffwan, varungup90 and nwangfw June 12, 2025 23:01

Jeffwan reviewed Jun 16, 2025

View reviewed changes

cmd/plugins/main.go Outdated Show resolved Hide resolved

Jeffwan reviewed Jun 17, 2025

View reviewed changes

Xunzhuo self-requested a review June 23, 2025 09:23

zhangjyr requested a review from Jeffwan June 23, 2025 18:00

Combined feature/load_aware_routing changes

96f4c12

Signed-off-by: Jingyuan Zhang <[email protected]>

zhangjyr force-pushed the feature/load_aware_routing branch from 54966dc to 96f4c12 Compare June 24, 2025 17:29

Improve comments

3eba519

Signed-off-by: Jingyuan Zhang <[email protected]>

varungup90 reviewed Jun 24, 2025

View reviewed changes

Typo Fix

a6e81bc

Signed-off-by: Jingyuan Zhang <[email protected]>

varungup90 reviewed Jun 24, 2025

View reviewed changes

Jingyuan Zhang added 2 commits June 25, 2025 11:28

Improve comments

a4a56da

Signed-off-by: Jingyuan Zhang <[email protected]>

Merge commit 'e3bb459aff8a050ad1a1d872521f350a646dca3e' into feature/…

5f533d9

…load_aware_routing Signed-off-by: Jingyuan Zhang <[email protected]> # Conflicts: # config/gateway/gateway-plugin/gateway-plugin.yaml

zhangjyr requested a review from varungup90 June 25, 2025 18:33

Improve log overhead

dbfc1b9

Signed-off-by: Jingyuan Zhang <[email protected]>

Xunzhuo reviewed Jun 26, 2025

View reviewed changes

Jingyuan Zhang added 2 commits June 26, 2025 14:20

Amend comment for GPU profile struct and remove unnecessary tests.

1273614

Signed-off-by: Jingyuan Zhang <[email protected]>

Merge commit '8e0111fafc238c515b09c29474bb0728b4ad44ee' into feature/…

f6975ef

…load_aware_routing Signed-off-by: Jingyuan Zhang <[email protected]> # Conflicts: # pkg/plugins/gateway/algorithms/router.go

zhangjyr requested a review from Xunzhuo June 26, 2025 21:35

Add cache.InitForTest for previous cache.NewForTest. Now, the cache.N…

9b3b338

…ewForTest will not change global store and work as stateless as expected. Signed-off-by: Jingyuan Zhang <[email protected]>

Xunzhuo approved these changes Jun 27, 2025

View reviewed changes

[Misc] SLO-aware router with profile support #1192

Are you sure you want to change the base?

[Misc] SLO-aware router with profile support #1192

Conversation

zhangjyr commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Description

Related Issues

Pull Request Title Format

Submission Checklist

Uh oh!

gemini-code-assist bot commented Jun 12, 2025

Uh oh!

Jeffwan commented Jun 14, 2025

Uh oh!

zhangjyr commented Jun 14, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jeffwan commented Jun 24, 2025

Uh oh!

zhangjyr commented Jun 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Xunzhuo left a comment

Choose a reason for hiding this comment

Uh oh!

zhangjyr commented Jun 12, 2025 •

edited

Loading

zhangjyr Jun 26, 2025 •

edited

Loading