-
Notifications
You must be signed in to change notification settings - Fork 383
[Misc] SLO-aware router with profile support #1192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
I notice there're some refactor changes (e.g. internal interface change etc) Technically, that affects other aspects, could it be some separate changes? I mean splitting the changes into common parts (stakeholder needs to review it) and slo specific changes (review could be loose and feature can be protected by feature gate). If the splitting is too complicated, we can have 1st round review and check how to move forward |
@Jeffwan, I think the only internal interface change is the Select(). The function is called only in one place, and if you find it is not appropriate, we can restore it. |
// Parameters: | ||
// deploymentName: Name of the deployment | ||
// modelName: Name of the model | ||
GetModelProfileByDeploymentName(deploymentName string, modelName string) (*ModelGPUProfile, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: we may use other objects to orchestrate pods in future. in that case, deployment might be changed in future. This looks good at this moment.
one more problem is, deployment without namespace can not be used to identify a deployment. we need to append the namespace field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of deployment using other objects, the GPU optimizer would have been changed as well (it monitors deployment only). For the support of ray clusters, let me keep a note, leave this comment open, and add an issue after merging.
Can you explain the cases where "deployment without namespace can not be used to identify a deployment"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean namespace/deployment_name
as the key,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key is in fact in the format aibrix:profile_[model_name]_[deployment_name], the name is unique across namespaces given:
- model name are unique across namespaces.
- deployment_names have the same namespace as the model name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cannot we deploy the same model name in different namespace?
pkg/cache/cache_init.go
Outdated
metaModels utils.SyncMap[string, *Model] // model_name -> *Model | ||
|
||
// Deploymnent related storage | ||
deploymentProfiles utils.SyncMap[string, *ModelGPUProfile] // deployment_name -> *ModelGPUProfile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here. we can use namespace/deployment
as the key
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key is in fact in the format aibrix:profile_[model_name]_[deployment_name], the name is unique across namespaces given:
- model name are unique across namespaces.
- deployment_names have the same namespace as the model name.
I've updated the comment here for clarification.
} | ||
} | ||
} | ||
q.queue, q.baseCursor = newQueue, q.baseCursor+dequeuePos |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could it be a problem is the other goroutine invoke physicalPosRLocked
? can we introduce something like and use it in physicalPosRLocked
and setBaseCursor
in expand
?
func (q *SimpleQueue[V]) getBaseCursor() int64 {
return atomic.LoadInt64(&q.baseCursor)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BaseCursor is updated only in lock(), and physicalPosRLocked() is Rlocked(), which will not be executable during lock() except being specifically called. The -RLocked() naming postfix asks developers to call this function specifically in Rlock() contexts.
pkg/types/router_context.go
Outdated
debugDelay time.Duration | ||
tokens []int | ||
predictor OutputPredictor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one of my concerns is which field can be used for profile disabled routing algorithms? As a routing algorithm developer, which field should I expected to be available if I enable/disable some features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current SLO family routers are profile-based, if disabled, we will fallback to least-request. The profile is essential in the current implementation because the profile provides the model-based SLO target and server capacity in terms of load. For a non-model-specific general SLO (e.g., 120s latency), if the SLO can be set using environment variables, I think the SLO queue can also be combined with other stateless routers to provide some extent of improvement. OutputPredictor itself is profile-decoupled and can work independently.
queueOverallSLO bool = false | ||
monogenousGPURouting bool = true | ||
monogenousGPURoutingOnly bool = monogenousGPURouting && false | ||
initialTotalSubQueues int = 8 // Expect no more than 8 subqueues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these magic numbers const or should be adjusted based on the available resources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are constants and are there only for evaluation purposes. These are features switches inherited from my router simulator. I started from the router simulator for the best configuration, and sync switches for convenience later.
@zhangjyr please rebase the branch one more time. I tried to merge it but notice some rebase conflicts |
Signed-off-by: Jingyuan Zhang <[email protected]>
54966dc
to
96f4c12
Compare
Signed-off-by: Jingyuan Zhang <[email protected]>
pkg/utils/pod.go
Outdated
@@ -36,6 +37,17 @@ const ( | |||
defaultPodMetricPort = 8000 | |||
) | |||
|
|||
var ( | |||
ReplicaSetDeploymentFinder = regexp.MustCompile(`^(.*)-\w+$`) // Deployment-[random name] | |||
RayClusterFleatFinder = regexp.MustCompile(`^(.*)-\w+-\w+$`) // RayClusterFleat-[random name]-[random name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spelling fleat -> fleet.
|
||
targetPodSet chan struct{} | ||
targetPod atomic.Pointer[v1.Pod] | ||
debugDelay time.Duration | ||
lastError atomic.Pointer[error] | ||
tokens []int // Cache of tokenized prompts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elaborate the comment for tokens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean how it is used or how it is created? For use, this is just a private cache because promptLength might be called multiple times during routing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both, what tokens are stored and when used.
Given that now routing context as lot of metadata for request, in a follow up PR a readme can be added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first time the PromptTokens() or PromptLength() is called, the tokens are cached in the context. I primarily use prompt length for output prediction, so the prefix-cache method of tokenization may not be suitable. Although a model-specific tokenizer is preferred, the current output prediction only predicts on coarse-grained log2 buckets, so I think the tiktoken fits.
Signed-off-by: Jingyuan Zhang <[email protected]>
func (r *RoutingContext) PromptTokens() ([]int, error) { | ||
if r.tokens == nil { | ||
var err error | ||
r.tokens, err = utils.TokenizeInputText(r.Message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will depend on the tokenizer used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utils.TokenizeInputText specifically uses tiktoken, I optimize the function by reuseing the tiktoken object.
util, err := r.provider.GetUtilization(ctx, pod) | ||
if err != nil { | ||
lastErr = r.updateError(lastErr, err) | ||
klog.ErrorS(err, "Skipped pod due to fail to get utilization in leastLoadRouter", "pod", pod.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log it with request_id and this has the potential to flood the logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've added a new feature to avoid multiple error logs for the same request. In particular, a profile missing or not supporting can be a common problem for all pods. Only one error log is needed if that happens.
pod.Name, pod.Status.PodIP, util) | ||
|
||
var consumption float64 | ||
if r.pulling { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does pulling mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"pulling" means the pull mode in the PR description:
"Below is a comparison of pulling mode and default push mode:
Push mode: The router dispatches requests to the server, possibly overloading the server.
Pull mode: The server pulls requests from the router based on the server's capacity.
With profile support, the gateway now has server capacity knowledge and can achieve pull mode within the gateway."
) | ||
|
||
const ( | ||
RouterSLO types.RoutingAlgorithm = "slo" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a readme for these algos
Signed-off-by: Jingyuan Zhang <[email protected]>
…load_aware_routing Signed-off-by: Jingyuan Zhang <[email protected]> # Conflicts: # config/gateway/gateway-plugin/gateway-plugin.yaml
Signed-off-by: Jingyuan Zhang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the works, left some small comments!
@@ -113,5 +113,17 @@ test-gateway2: | |||
"max_tokens": 512 \ | |||
}' | |||
|
|||
test-router: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this make target rename to test-slo-router
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, test-router is just for showcases. I can change the strategy to least-request.
// Parameters: | ||
// deploymentName: Name of the deployment | ||
// modelName: Name of the model | ||
GetModelProfileByDeploymentName(deploymentName string, modelName string) (*ModelGPUProfile, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cannot we deploy the same model name in different namespace?
func NewTestCacheWithPods(pods []*v1.Pod, model string) *Store { | ||
c := &Store{} | ||
// NewForTest initializes the cache store for testing purposes, it can be repeated call for reset. | ||
func NewForTest() *Store { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can move to cace_test.go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. However, this set of functions are designed to be called by tests in other packages (e.g., routing algorithm), moving NewForTest() to cache_test.go will make them inaccessible from there.
return | ||
} | ||
|
||
for _, key := range keys { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on doing it concurrently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The profile update is off the request path, and there is no urgency to keep the profile real-time. On the other side, concurrent update can cause unnecessary network burst (redis get), memory footprint(store bytes), and CPU footprint(unmarshal).
import "errors" | ||
|
||
var ( | ||
ErrorTypeMetricNotFound = &CacheError{error: errors.New("metric not found")} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides these two errors, any other errors in Cache can abstracted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many. But this PR is not really about Error standardization. I created this abstraction for my part of the code, and hopefully, other collaborators will come to the realization of the necessity and adopt this approach.
pkg/cache/model_gpu_profile.go
Outdated
Cost float64 `json:"cost"` | ||
Tputs [][]float64 `json:"tputs"` // Max RPS per correspondent index. | ||
Indexes [][]float64 `json:"indexes"` // [output tokens, input tokens] | ||
Created float64 `json:"created"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the created mean here? It looks like a boolean var naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a timestamp in Unix sec.sub-sec format, which is used to test if the profile has been updated. I will add a comment for this field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on CreatedAt
?
) | ||
|
||
const ( | ||
profile1 = "{\"gpu\": \"simulator-llama2-7b-a100\", \"cost\": 1.0, \"tputs\": [[62.4770592037908, 32.3132063155413, 16.794872306236712, 8.506609663869995, 4.25505318548248], [60.316466916176516, 31.743326005665654, 16.648846618248662, 8.456487777894496, 4.232097897906285], [31.390153356402557, 31.44077673744191, 16.35548438044074, 8.319937915785582, 4.195196435104038], [29.659845691502795, 29.443006402919238, 15.737020124678372, 8.09950347977996, 4.097482292889981], [15.195016931044663, 15.141240615344003, 8.084202765314256, 7.635497267355755, 3.9453497193741334], [7.6749238663857575, 7.646377471511056, 7.469959534415053, 4.021491925138638, 2.0825276518486944], [3.8569367968050043, 3.8474025810402654, 3.788259676585335, 2.0288591107923666, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0]], \"indexes\": [[4, 8, 16, 32, 64, 128, 256, 512], [128, 256, 512, 1024, 2048]], \"created\": 1748888478.258241, \"e2e\": [[0.09952875833841972, 0.12239365121233277, 0.21428067207685672, 0.31731165578588844, 0.5534239103773143], [0.17259948405786418, 0.19343522920506076, 0.3083386933861766, 0.43137065838091077, 0.6822485379409045], [0.2401115512580145, 0.34489422958809884, 0.47929574372596107, 0.6583689095533919, 0.940295041234931], [0.48604107543011194, 0.6922839175269473, 0.827185874566203, 1.0668736191641073, 1.4869349837116896], [0.8164577254245523, 0.9879068507999181, 1.041013903748244, 2.1203144049376714, 2.666910813357681], [1.51887807746185, 1.7140269558737053, 2.342019975812873, 2.50050834288937, 2.8703157891728917], [2.9778688216709996, 3.245595489592524, 3.9633901528839486, 4.021366707490524, 4.36835270334268], [5.479355860430514, 5.721908029631013, 6.244554732975084, 7.508401151361177, 11.322325259115313]], \"slos\": {\"percentile\": 99, \"e2e\": 5.0}}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest to separate from code, into cache/testdata/xx-xxx-xx-profile.json, and the testcase can be extensible with different scenarios
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, but let's make the change if more test cases are added here.
// PendingLoad = 1 / PendingRequests = 1 / (Throughput * Latency), | ||
// where PendingRequests = Throughput * Latency follows Little's Law, | ||
// and Throughput(RPS) and Latency are from loaded profile per feature (input tokens, output tokens) | ||
type PendingLoadProvider struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can add a test case for the pending load provider?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What test cases would you like to be added?
"github.com/vllm-project/aibrix/pkg/types" | ||
) | ||
|
||
const DefaultFallbackAlgorithm types.RoutingAlgorithm = RouterRandom |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if the algorithm does explicitly set the fallback strategy, we use random, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove the pkg/plugins/gateway/algorithms/util.go SelectRandomPodAsFallback
instead of the current elegant way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if the algorithm does explicitly set the fallback strategy, we use random, right?
If the algorithm does not explicitly set the fallback, random will be used. Yes.
Can we remove the pkg/plugins/gateway/algorithms/util.go
SelectRandomPodAsFallback
instead of the current elegant way?
Let's open a new code refactoring issue after this PR is merged. This PR has already been too large, I'm refraining from further refactoring.
test/e2e/routing_strategy_test.go
Outdated
for i := 0; i < iterration; i++ { | ||
req := "hello test" | ||
targetPod := getTargetPodFromChatCompletion(t, req, "slo") | ||
assert.NotEmpty(t, targetPod, "target pod should not be empty") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, can we enrich the e2e?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol, you got me. The test was there because I thought unit testing could not be done for the slo policy. However, the current slo_test.go into the algorithm covers most e2e tests. I will remove this part of tests.
continue // Skip to the next key | ||
} | ||
|
||
c.UpdateModelProfile(key, &updated, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a question for this, I wonder who/where added the gpu profile cache to redis, who is the producer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The submission of profiles is a part of the previous GPU optimizer PR. See step 4 of https://aibrix.readthedocs.io/latest/features/heterogeneous-gpu.html for details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Signed-off-by: Jingyuan Zhang <[email protected]>
…load_aware_routing Signed-off-by: Jingyuan Zhang <[email protected]> # Conflicts: # pkg/plugins/gateway/algorithms/router.go
…ewForTest will not change global store and work as stateless as expected. Signed-off-by: Jingyuan Zhang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This is massive, and in case no more conflicts, will prioritize the PR merge
Pull Request Description
Introducing SLO-aware router with profile support. This PR introduces three new SLO-aware routing policies:
All three routing policies will prioritize requests with a profiled SLO target.
In addition to the slo-family routing policies, this PR added built-in queues to support request reordering and future delay scheduling. In particular, QueueRouter enables the pull mode within the gateway. Below is a comparison of pulling mode and default push mode:
With profile support, the gateway now has server capacity knowledge and can achieve pull mode within the gateway.
Additional feature added in this PR:
Preliminary results show SLO policy can achieve the SLO target for composite workload on heterogeneous GPUs:

Workload: mixed sharegpt and bird workload with a ratio of 7:4
GPU: 1A10, 4L20
SLO: Latency per token 0.05s
Related Issues
Resolves: #642 #606
Important: Before submitting, please complete the description above and review the checklist below.
Contribution Guidelines (Expand for Details)
We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:
Pull Request Title Format
Your PR title should start with one of these prefixes to indicate the nature of the change:
[Bug]
: Corrections to existing functionality[CI]
: Changes to build process or CI pipeline[Docs]
: Updates or additions to documentation[API]
: Modifications to aibrix's API or interface[CLI]
: Changes or additions to the Command Line Interface[Misc]
: For changes not covered above (use sparingly)Note: For changes spanning multiple categories, use multiple prefixes in order of importance.
Submission Checklist
By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.