You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The model loading requests are not balanced evenly across predictors.
Each moment, the system can receive many requests to different (mostly unloaded) models.
Instead of balancing the loading requests across all predictors, we see that one predictor can receive ~30 requests (out of ~50) while other predictors are completely idle (both in terms of model loading and inference processing).
This obviously create temporary hotspots. These hotspots are not static, as the popular predictor changes over time, resulting "waves" of model loading requests per predictors (see image of 3 different predictors over time).
We suspect that each model loading request is routed to the same mm instance X since X is at the top of the priority queue from the perspective of each of the mm instances.
Since it takes a few seconds for the system to understand that X is concurrently receiving many request and is considered "busy", X is receiving all the requests for a short period of time.
Is this hypothesis correct? If not, how can we debug this?
Configuration
We have a high workload environment, containing thousands of registered models, each of them is requested every couple of minutes, resulting a very high models-swap events rate (many unloading and loading events in a short period of time).
We have a few dozen predictors, each can hold ~20 models.
The modelmesh containers are load balanced (round robin) using GRPC, with the mm-balanced header set to true.
Modelmesh is configured to use rpm-based decisions (busyness, scaling...) and not the experimental latency-based (is it worth trying?).
The text was updated successfully, but these errors were encountered:
Describe the bug
The model loading requests are not balanced evenly across predictors.
Each moment, the system can receive many requests to different (mostly unloaded) models.
Instead of balancing the loading requests across all predictors, we see that one predictor can receive ~30 requests (out of ~50) while other predictors are completely idle (both in terms of model loading and inference processing).
This obviously create temporary hotspots. These hotspots are not static, as the popular predictor changes over time, resulting "waves" of model loading requests per predictors (see image of 3 different predictors over time).
We suspect that each model loading request is routed to the same mm instance X since X is at the top of the priority queue from the perspective of each of the mm instances.
Since it takes a few seconds for the system to understand that X is concurrently receiving many request and is considered "busy", X is receiving all the requests for a short period of time.
Is this hypothesis correct? If not, how can we debug this?
Configuration
We have a high workload environment, containing thousands of registered models, each of them is requested every couple of minutes, resulting a very high models-swap events rate (many unloading and loading events in a short period of time).
We have a few dozen predictors, each can hold ~20 models.
The modelmesh containers are load balanced (round robin) using GRPC, with the
mm-balanced
header set totrue
.Modelmesh is configured to use rpm-based decisions (busyness, scaling...) and not the experimental latency-based (is it worth trying?).
The text was updated successfully, but these errors were encountered: