Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model Loading Requests Contention #469

Open
GolanLevy opened this issue Jan 3, 2024 · 0 comments
Open

Model Loading Requests Contention #469

GolanLevy opened this issue Jan 3, 2024 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@GolanLevy
Copy link

Describe the bug

The model loading requests are not balanced evenly across predictors.
Each moment, the system can receive many requests to different (mostly unloaded) models.
Instead of balancing the loading requests across all predictors, we see that one predictor can receive ~30 requests (out of ~50) while other predictors are completely idle (both in terms of model loading and inference processing).
This obviously create temporary hotspots. These hotspots are not static, as the popular predictor changes over time, resulting "waves" of model loading requests per predictors (see image of 3 different predictors over time).

We suspect that each model loading request is routed to the same mm instance X since X is at the top of the priority queue from the perspective of each of the mm instances.
Since it takes a few seconds for the system to understand that X is concurrently receiving many request and is considered "busy", X is receiving all the requests for a short period of time.

Is this hypothesis correct? If not, how can we debug this?

image

Configuration

We have a high workload environment, containing thousands of registered models, each of them is requested every couple of minutes, resulting a very high models-swap events rate (many unloading and loading events in a short period of time).
We have a few dozen predictors, each can hold ~20 models.
The modelmesh containers are load balanced (round robin) using GRPC, with the mm-balanced header set to true.
Modelmesh is configured to use rpm-based decisions (busyness, scaling...) and not the experimental latency-based (is it worth trying?).

@GolanLevy GolanLevy added the bug Something isn't working label Jan 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants