Skip to content

[Misc] SLO-aware router with profile support #1192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion cmd/plugins/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ import (
extProcPb "github.com/envoyproxy/go-control-plane/envoy/service/ext_proc/v3"
"github.com/vllm-project/aibrix/pkg/cache"
"github.com/vllm-project/aibrix/pkg/plugins/gateway"
routing "github.com/vllm-project/aibrix/pkg/plugins/gateway/algorithms"
"github.com/vllm-project/aibrix/pkg/utils"
"google.golang.org/grpc/health"
healthPb "google.golang.org/grpc/health/grpc_health_v1"
Expand Down Expand Up @@ -77,7 +78,7 @@ func main() {
panic(err)
}

cache.InitForGateway(config, stopCh, redisClient)
cache.InitForGateway(config, stopCh, redisClient, routing.ModelRouterFactory)

k8sClient, err := kubernetes.NewForConfig(config)
if err != nil {
Expand Down
12 changes: 6 additions & 6 deletions config/gateway/gateway-plugin/gateway-plugin.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ apiVersion: v1
kind: Service
metadata:
name: gateway-plugins
namespace: aibrix-system
namespace: system
labels:
app: gateway-plugins
annotations:
Expand Down Expand Up @@ -30,7 +30,7 @@ apiVersion: apps/v1
kind: Deployment
metadata:
name: gateway-plugins
namespace: aibrix-system
namespace: system
spec:
strategy:
type: RollingUpdate
Expand Down Expand Up @@ -133,7 +133,7 @@ apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: reserved-router-models-endpoint
namespace: aibrix-system
namespace: system
spec:
parentRefs:
- name: aibrix-eg
Expand All @@ -150,7 +150,7 @@ apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyExtensionPolicy
metadata:
name: skip-ext-proc
namespace: aibrix-system
namespace: system
spec:
targetRef:
group: gateway.networking.k8s.io
Expand All @@ -165,7 +165,7 @@ apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: reserved-router
namespace: aibrix-system
namespace: system
spec:
parentRefs:
- name: aibrix-eg
Expand All @@ -185,7 +185,7 @@ apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyExtensionPolicy
metadata:
name: gateway-plugins-extension-policy
namespace: aibrix-system
namespace: system
spec:
targetRef:
group: gateway.networking.k8s.io
Expand Down
38 changes: 38 additions & 0 deletions config/overlays/dev/gateway-plugin/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: aibrix-system

namePrefix: aibrix-

resources:
- ../../../gateway/gateway-plugin

images:
- name: gateway-plugins
newName: aibrix/gateway-plugins
newTag: nightly

patches:
- patch: |- # Use the '|' and '-' for inline patching
apiVersion: apps/v1
kind: Deployment
metadata:
name: gateway-plugins
spec:
template:
spec:
containers:
- name: gateway-plugin
args:
- -v=5
env:
- name: AIBRIX_POD_METRIC_REFRESH_INTERVAL_MS
value: "60000"
- name: AIBRIX_GPU_OPTIMIZER_TRACING_FLAG
value: "true"
target:
kind: Deployment
name: gateway-plugins
namespace: system
version: v1
4 changes: 1 addition & 3 deletions config/overlays/dev/manager/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,4 @@ patches:
kind: Deployment
name: controller-manager
namespace: system
version: v1

apiVersion: kustomize.config.k8s.io/v1beta1
version: v1
39 changes: 39 additions & 0 deletions config/overlays/vke-dev/gateway-plugin/gateway_plugins_patch.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: gateway-plugins
namespace: aibrix-system
spec:
replicas: 1
template:
spec:
affinity:
nodeAffinity: # prevent gateway pod to be placed on gpu node.
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: vke.node.gpu.schedule
operator: NotIn
values:
- nvidia
containers:
- name: gateway-plugin
resources:
limits:
cpu: "2"
memory: 8Gi
requests:
cpu: "2"
memory: 8Gi
env:
- name: AIBRIX_PREFIX_CACHE_TOKENIZER_TYPE
value: "character"
- name: AIBRIX_PREFIX_CACHE_BLOCK_SIZE
value: "128"
- name: AIBRIX_PREFIX_CACHE_BLOCK_NUMBER
value: "200000"
- name: AIBRIX_PREFIX_CACHE_POD_RUNNING_REQUEST_IMBALANCE_ABS_COUNT
value: "16"
- name: AIBRIX_PREFIX_CACHE_STANDARD_DEVIATION_FACTOR
value: "2"
11 changes: 5 additions & 6 deletions config/overlays/vke-dev/gateway-plugin/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -1,17 +1,16 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: aibrix-system

namePrefix: aibrix-

resources:
- ../../../gateway/gateway-plugin
- ../../dev/gateway-plugin

patches:
- path: gateway_plugins_patch.yaml

images:
- name: busybox
newName: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/busybox
newTag: stable
- name: gateway-plugins
- name: aibrix/gateway-plugins
newName: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/gateway-plugins
newTag: nightly
12 changes: 12 additions & 0 deletions development/app/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -113,5 +113,17 @@ test-gateway2:
"max_tokens": 512 \
}'

test-router:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this make target rename to test-slo-router?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, test-router is just for showcases. I can change the strategy to least-request.

curl -v http://localhost:8888/v1/chat/completions \
-H "model: llama2-7b" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer any_key" \
-H "routing-strategy: least-request" \
-d '{ \
"model": "llama2-7b", \
"messages": [{"role": "user", "content": "Say this is a test!"}], \
"temperature": 0.7 \
}'

metrics:
curl http://localhost:8000/metrics
31 changes: 28 additions & 3 deletions pkg/cache/cache_api.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,9 @@ type Cache interface {
ModelCache
MetricCache
RequestTracker
ProfileCache
types.OutputPredictorProvider
types.RouterProvider
}

// PodCache defines operations for pod information caching
Expand Down Expand Up @@ -106,7 +109,10 @@ type MetricCache interface {

// RequestTracker defines operations for track workload statistics
type RequestTracker interface {
// AddRequestCount starts tracking request count
// AddRequestCount tracks the start of a request after routing.
// To support realtime statistics update and access, AddRequestCount can be called multiple times for a request.
// As the result, implementation should ensure thread-safe access to the counterm and idempotency.
//
// Parameters:
// ctx: Routing context
// requestID: Unique request identifier
Expand All @@ -115,14 +121,18 @@ type RequestTracker interface {
// int64: Trace term identifier
AddRequestCount(ctx *types.RoutingContext, requestID string, modelName string) (traceTerm int64)

// DoneRequestCount completes request count tracking, only one DoneRequestXXX should be called for a request
// DoneRequestCount tracks the completion of a request without usage information like inputTokens and outputTokens.
// Only one DoneRequestXXX should be called for a request. Idemptency is not required.
//
// Parameters:
// requestID: Unique request identifier
// modelName: Name of the model
// traceTerm: Trace term identifier
DoneRequestCount(ctx *types.RoutingContext, requestID string, modelName string, traceTerm int64)

// DoneRequestTrace completes request tracing, only one DoneRequestXXX should be called for a request
// DoneRequestTrace tracks the completion of a request with usage information like inputTokens and outputTokens.
// Only one DoneRequestXXX should be called for a request. Idemptency is not required.
//
// Parameters:
// ctx: Routing context
// requestID: Unique request identifier
Expand All @@ -132,3 +142,18 @@ type RequestTracker interface {
// traceTerm: Trace term identifier
DoneRequestTrace(ctx *types.RoutingContext, requestID string, modelName string, inputTokens, outputTokens, traceTerm int64)
}

// ProfileCache defines operations for model profiles
type ProfileCache interface {
// GetModelProfileByPod gets model profile for a pod
// Parameters:
// pod: Pod object
// modelName: Name of the model
GetModelProfileByPod(pod *v1.Pod, modelName string) (*ModelGPUProfile, error)

// GetModelProfileByDeploymentName gets model profile for a deployment
// Parameters:
// deploymentName: Name of the deployment
// modelName: Name of the model
GetModelProfileByDeploymentName(deploymentName string, modelName string) (*ModelGPUProfile, error)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: we may use other objects to orchestrate pods in future. in that case, deployment might be changed in future. This looks good at this moment.

one more problem is, deployment without namespace can not be used to identify a deployment. we need to append the namespace field

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of deployment using other objects, the GPU optimizer would have been changed as well (it monitors deployment only). For the support of ray clusters, let me keep a note, leave this comment open, and add an issue after merging.

Can you explain the cases where "deployment without namespace can not be used to identify a deployment"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean namespace/deployment_name as the key,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key is in fact in the format aibrix:profile_[model_name]_[deployment_name], the name is unique across namespaces given:

  1. model name are unique across namespaces.
  2. deployment_names have the same namespace as the model name.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot we deploy the same model name in different namespace?

}
Loading