Skip to content

[Misc] SLO-aware router with profile support #1192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

zhangjyr
Copy link
Collaborator

@zhangjyr zhangjyr commented Jun 12, 2025

Pull Request Description

Introducing SLO-aware router with profile support. This PR introduces three new SLO-aware routing policies:

  1. slo (or slo-least-load-pulling)
  2. slo-least-load
  3. slo-pack-load
    All three routing policies will prioritize requests with a profiled SLO target.

In addition to the slo-family routing policies, this PR added built-in queues to support request reordering and future delay scheduling. In particular, QueueRouter enables the pull mode within the gateway. Below is a comparison of pulling mode and default push mode:

  1. Push mode: The router dispatches requests to the server, possibly overloading the server.
  2. Pull mode: The server pulls requests from the router based on the server's capacity.
    With profile support, the gateway now has server capacity knowledge and can achieve pull mode within the gateway.

Additional feature added in this PR:

  1. Add a fallback routing policy mechanism for developers to designate a default routing policy if the specified routing policy fails.
  2. Wrap the routing policy registration mechanism in RouterManager, allowing it to be reused for managing a family of related routing policies. (Including simplify Select() to follow RouterProviderFunc)
  3. Add profile cache API to manage model-based performance profiles.
  4. Improving the profile generator in the GPU manager to include SLO information and detailed metrics.

Preliminary results show SLO policy can achieve the SLO target for composite workload on heterogeneous GPUs:
image
Workload: mixed sharegpt and bird workload with a ratio of 7:4
GPU: 1A10, 4L20
SLO: Latency per token 0.05s

Related Issues

Resolves: #642 #606

Important: Before submitting, please complete the description above and review the checklist below.


Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

  • [Bug]: Corrections to existing functionality
  • [CI]: Changes to build process or CI pipeline
  • [Docs]: Updates or additions to documentation
  • [API]: Modifications to aibrix's API or interface
  • [CLI]: Changes or additions to the Command Line Interface
  • [Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

  • PR title includes appropriate prefix(es)
  • Changes are clearly explained in the PR description
  • New and existing tests pass successfully
  • Code adheres to project style and best practices
  • Documentation updated to reflect changes (if applicable)
  • Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

Copy link

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@Jeffwan
Copy link
Collaborator

Jeffwan commented Jun 14, 2025

I notice there're some refactor changes (e.g. internal interface change etc) Technically, that affects other aspects, could it be some separate changes? I mean splitting the changes into common parts (stakeholder needs to review it) and slo specific changes (review could be loose and feature can be protected by feature gate).

If the splitting is too complicated, we can have 1st round review and check how to move forward

@zhangjyr
Copy link
Collaborator Author

@Jeffwan, I think the only internal interface change is the Select(). The function is called only in one place, and if you find it is not appropriate, we can restore it.

// Parameters:
// deploymentName: Name of the deployment
// modelName: Name of the model
GetModelProfileByDeploymentName(deploymentName string, modelName string) (*ModelGPUProfile, error)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: we may use other objects to orchestrate pods in future. in that case, deployment might be changed in future. This looks good at this moment.

one more problem is, deployment without namespace can not be used to identify a deployment. we need to append the namespace field

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of deployment using other objects, the GPU optimizer would have been changed as well (it monitors deployment only). For the support of ray clusters, let me keep a note, leave this comment open, and add an issue after merging.

Can you explain the cases where "deployment without namespace can not be used to identify a deployment"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean namespace/deployment_name as the key,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key is in fact in the format aibrix:profile_[model_name]_[deployment_name], the name is unique across namespaces given:

  1. model name are unique across namespaces.
  2. deployment_names have the same namespace as the model name.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot we deploy the same model name in different namespace?

metaModels utils.SyncMap[string, *Model] // model_name -> *Model

// Deploymnent related storage
deploymentProfiles utils.SyncMap[string, *ModelGPUProfile] // deployment_name -> *ModelGPUProfile
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. we can use namespace/deployment as the key

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key is in fact in the format aibrix:profile_[model_name]_[deployment_name], the name is unique across namespaces given:

  1. model name are unique across namespaces.
  2. deployment_names have the same namespace as the model name.
    I've updated the comment here for clarification.

}
}
}
q.queue, q.baseCursor = newQueue, q.baseCursor+dequeuePos
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could it be a problem is the other goroutine invoke physicalPosRLocked? can we introduce something like and use it in physicalPosRLocked and setBaseCursor in expand?

func (q *SimpleQueue[V]) getBaseCursor() int64 {
	return atomic.LoadInt64(&q.baseCursor)
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BaseCursor is updated only in lock(), and physicalPosRLocked() is Rlocked(), which will not be executable during lock() except being specifically called. The -RLocked() naming postfix asks developers to call this function specifically in Rlock() contexts.

debugDelay time.Duration
tokens []int
predictor OutputPredictor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one of my concerns is which field can be used for profile disabled routing algorithms? As a routing algorithm developer, which field should I expected to be available if I enable/disable some features.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current SLO family routers are profile-based, if disabled, we will fallback to least-request. The profile is essential in the current implementation because the profile provides the model-based SLO target and server capacity in terms of load. For a non-model-specific general SLO (e.g., 120s latency), if the SLO can be set using environment variables, I think the SLO queue can also be combined with other stateless routers to provide some extent of improvement. OutputPredictor itself is profile-decoupled and can work independently.

queueOverallSLO bool = false
monogenousGPURouting bool = true
monogenousGPURoutingOnly bool = monogenousGPURouting && false
initialTotalSubQueues int = 8 // Expect no more than 8 subqueues
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these magic numbers const or should be adjusted based on the available resources?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are constants and are there only for evaluation purposes. These are features switches inherited from my router simulator. I started from the router simulator for the best configuration, and sync switches for convenience later.

@Xunzhuo Xunzhuo self-requested a review June 23, 2025 09:23
@zhangjyr zhangjyr requested a review from Jeffwan June 23, 2025 18:00
@Jeffwan
Copy link
Collaborator

Jeffwan commented Jun 24, 2025

@zhangjyr please rebase the branch one more time. I tried to merge it but notice some rebase conflicts

@zhangjyr zhangjyr force-pushed the feature/load_aware_routing branch from 54966dc to 96f4c12 Compare June 24, 2025 17:29
@zhangjyr
Copy link
Collaborator Author

@zhangjyr please rebase the branch one more time. I tried to merge it but notice some rebase conflicts

@Jeffwan Done

Signed-off-by: Jingyuan Zhang <[email protected]>
pkg/utils/pod.go Outdated
@@ -36,6 +37,17 @@ const (
defaultPodMetricPort = 8000
)

var (
ReplicaSetDeploymentFinder = regexp.MustCompile(`^(.*)-\w+$`) // Deployment-[random name]
RayClusterFleatFinder = regexp.MustCompile(`^(.*)-\w+-\w+$`) // RayClusterFleat-[random name]-[random name]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spelling fleat -> fleet.


targetPodSet chan struct{}
targetPod atomic.Pointer[v1.Pod]
debugDelay time.Duration
lastError atomic.Pointer[error]
tokens []int // Cache of tokenized prompts
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate the comment for tokens

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean how it is used or how it is created? For use, this is just a private cache because promptLength might be called multiple times during routing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both, what tokens are stored and when used.

Given that now routing context as lot of metadata for request, in a follow up PR a readme can be added.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first time the PromptTokens() or PromptLength() is called, the tokens are cached in the context. I primarily use prompt length for output prediction, so the prefix-cache method of tokenization may not be suitable. Although a model-specific tokenizer is preferred, the current output prediction only predicts on coarse-grained log2 buckets, so I think the tiktoken fits.

Signed-off-by: Jingyuan Zhang <[email protected]>
func (r *RoutingContext) PromptTokens() ([]int, error) {
if r.tokens == nil {
var err error
r.tokens, err = utils.TokenizeInputText(r.Message)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will depend on the tokenizer used.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utils.TokenizeInputText specifically uses tiktoken, I optimize the function by reuseing the tiktoken object.

util, err := r.provider.GetUtilization(ctx, pod)
if err != nil {
lastErr = r.updateError(lastErr, err)
klog.ErrorS(err, "Skipped pod due to fail to get utilization in leastLoadRouter", "pod", pod.Name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log it with request_id and this has the potential to flood the logs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've added a new feature to avoid multiple error logs for the same request. In particular, a profile missing or not supporting can be a common problem for all pods. Only one error log is needed if that happens.

pod.Name, pod.Status.PodIP, util)

var consumption float64
if r.pulling {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does pulling mean here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"pulling" means the pull mode in the PR description:
"Below is a comparison of pulling mode and default push mode:
Push mode: The router dispatches requests to the server, possibly overloading the server.
Pull mode: The server pulls requests from the router based on the server's capacity.
With profile support, the gateway now has server capacity knowledge and can achieve pull mode within the gateway."

)

const (
RouterSLO types.RoutingAlgorithm = "slo"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a readme for these algos

Jingyuan Zhang added 2 commits June 25, 2025 11:28
Signed-off-by: Jingyuan Zhang <[email protected]>
…load_aware_routing

Signed-off-by: Jingyuan Zhang <[email protected]>

# Conflicts:
#	config/gateway/gateway-plugin/gateway-plugin.yaml
@zhangjyr zhangjyr requested a review from varungup90 June 25, 2025 18:33
Signed-off-by: Jingyuan Zhang <[email protected]>
Copy link
Collaborator

@Xunzhuo Xunzhuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the works, left some small comments!

@@ -113,5 +113,17 @@ test-gateway2:
"max_tokens": 512 \
}'

test-router:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this make target rename to test-slo-router?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, test-router is just for showcases. I can change the strategy to least-request.

// Parameters:
// deploymentName: Name of the deployment
// modelName: Name of the model
GetModelProfileByDeploymentName(deploymentName string, modelName string) (*ModelGPUProfile, error)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot we deploy the same model name in different namespace?

func NewTestCacheWithPods(pods []*v1.Pod, model string) *Store {
c := &Store{}
// NewForTest initializes the cache store for testing purposes, it can be repeated call for reset.
func NewForTest() *Store {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can move to cace_test.go?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. However, this set of functions are designed to be called by tests in other packages (e.g., routing algorithm), moving NewForTest() to cache_test.go will make them inaccessible from there.

return
}

for _, key := range keys {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on doing it concurrently?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The profile update is off the request path, and there is no urgency to keep the profile real-time. On the other side, concurrent update can cause unnecessary network burst (redis get), memory footprint(store bytes), and CPU footprint(unmarshal).

import "errors"

var (
ErrorTypeMetricNotFound = &CacheError{error: errors.New("metric not found")}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides these two errors, any other errors in Cache can abstracted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many. But this PR is not really about Error standardization. I created this abstraction for my part of the code, and hopefully, other collaborators will come to the realization of the necessity and adopt this approach.

Cost float64 `json:"cost"`
Tputs [][]float64 `json:"tputs"` // Max RPS per correspondent index.
Indexes [][]float64 `json:"indexes"` // [output tokens, input tokens]
Created float64 `json:"created"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the created mean here? It looks like a boolean var naming.

Copy link
Collaborator Author

@zhangjyr zhangjyr Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a timestamp in Unix sec.sub-sec format, which is used to test if the profile has been updated. I will add a comment for this field.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on CreatedAt?

)

const (
profile1 = "{\"gpu\": \"simulator-llama2-7b-a100\", \"cost\": 1.0, \"tputs\": [[62.4770592037908, 32.3132063155413, 16.794872306236712, 8.506609663869995, 4.25505318548248], [60.316466916176516, 31.743326005665654, 16.648846618248662, 8.456487777894496, 4.232097897906285], [31.390153356402557, 31.44077673744191, 16.35548438044074, 8.319937915785582, 4.195196435104038], [29.659845691502795, 29.443006402919238, 15.737020124678372, 8.09950347977996, 4.097482292889981], [15.195016931044663, 15.141240615344003, 8.084202765314256, 7.635497267355755, 3.9453497193741334], [7.6749238663857575, 7.646377471511056, 7.469959534415053, 4.021491925138638, 2.0825276518486944], [3.8569367968050043, 3.8474025810402654, 3.788259676585335, 2.0288591107923666, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0]], \"indexes\": [[4, 8, 16, 32, 64, 128, 256, 512], [128, 256, 512, 1024, 2048]], \"created\": 1748888478.258241, \"e2e\": [[0.09952875833841972, 0.12239365121233277, 0.21428067207685672, 0.31731165578588844, 0.5534239103773143], [0.17259948405786418, 0.19343522920506076, 0.3083386933861766, 0.43137065838091077, 0.6822485379409045], [0.2401115512580145, 0.34489422958809884, 0.47929574372596107, 0.6583689095533919, 0.940295041234931], [0.48604107543011194, 0.6922839175269473, 0.827185874566203, 1.0668736191641073, 1.4869349837116896], [0.8164577254245523, 0.9879068507999181, 1.041013903748244, 2.1203144049376714, 2.666910813357681], [1.51887807746185, 1.7140269558737053, 2.342019975812873, 2.50050834288937, 2.8703157891728917], [2.9778688216709996, 3.245595489592524, 3.9633901528839486, 4.021366707490524, 4.36835270334268], [5.479355860430514, 5.721908029631013, 6.244554732975084, 7.508401151361177, 11.322325259115313]], \"slos\": {\"percentile\": 99, \"e2e\": 5.0}}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest to separate from code, into cache/testdata/xx-xxx-xx-profile.json, and the testcase can be extensible with different scenarios

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, but let's make the change if more test cases are added here.

// PendingLoad = 1 / PendingRequests = 1 / (Throughput * Latency),
// where PendingRequests = Throughput * Latency follows Little's Law,
// and Throughput(RPS) and Latency are from loaded profile per feature (input tokens, output tokens)
type PendingLoadProvider struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can add a test case for the pending load provider?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What test cases would you like to be added?

"github.com/vllm-project/aibrix/pkg/types"
)

const DefaultFallbackAlgorithm types.RoutingAlgorithm = RouterRandom
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if the algorithm does explicitly set the fallback strategy, we use random, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the pkg/plugins/gateway/algorithms/util.go SelectRandomPodAsFallback instead of the current elegant way?

Copy link
Collaborator Author

@zhangjyr zhangjyr Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if the algorithm does explicitly set the fallback strategy, we use random, right?

If the algorithm does not explicitly set the fallback, random will be used. Yes.

Can we remove the pkg/plugins/gateway/algorithms/util.go SelectRandomPodAsFallback instead of the current elegant way?

Let's open a new code refactoring issue after this PR is merged. This PR has already been too large, I'm refraining from further refactoring.

for i := 0; i < iterration; i++ {
req := "hello test"
targetPod := getTargetPodFromChatCompletion(t, req, "slo")
assert.NotEmpty(t, targetPod, "target pod should not be empty")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, can we enrich the e2e?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol, you got me. The test was there because I thought unit testing could not be done for the slo policy. However, the current slo_test.go into the algorithm covers most e2e tests. I will remove this part of tests.

continue // Skip to the next key
}

c.UpdateModelProfile(key, &updated, false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a question for this, I wonder who/where added the gpu profile cache to redis, who is the producer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The submission of profiles is a part of the previous GPU optimizer PR. See step 4 of https://aibrix.readthedocs.io/latest/features/heterogeneous-gpu.html for details.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Jingyuan Zhang added 2 commits June 26, 2025 14:20
…load_aware_routing

Signed-off-by: Jingyuan Zhang <[email protected]>

# Conflicts:
#	pkg/plugins/gateway/algorithms/router.go
@zhangjyr zhangjyr requested a review from Xunzhuo June 26, 2025 21:35
…ewForTest will not change global store and work as stateless as expected.

Signed-off-by: Jingyuan Zhang <[email protected]>
Copy link
Collaborator

@Xunzhuo Xunzhuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This is massive, and in case no more conflicts, will prioritize the PR merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC]: Load-aware pattern-based routing policy with profile support
4 participants