You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using the MMS server to host a single model in AWS Sagemaker. The model is loaded on MMS startup with sagemaker_inference.model_server.start_model_server with a custom handler_service.
The problem is that loading the model can take significant amount of time (up to 20 minutes). In the time that model is being loaded MMS server /ping endpoint returns 200 OK responses, causing sagemaker endpoint to assume the model is already loaded.
Furthermore, I tested MMS locally to see how the server behaves in the model loading stage and it looks like the model details endpoint returns model worker status as READY before the worker is finished loading the model.
In the logs below you can see that workers start and immediately there's:
I was just looking into this to understand what how ping and Sagemaker UpdateEndpoint would work together with MMS. As far as I can tell, the ping handler can already return non-200 (a 503 to be exact). The problem is how the running check is determined here. If you trace the code, you'd see that the check is essentially equivalent to to the first WorkerThread.getState() == WorkerState.WORKER_MODEL_LOADED.
The problem is that BOTH the server thread (the main thread/process running model_service_worker.py) and worker threads are implemented as WorkerThread. For the server thread's WorkerThread, it's state is set to WORKER_MODEL_LOADED right when the WorkerThread starts and remains that way. This corresponds with the first log line you showed
Since the model is thought to be loaded when the first WorkerThread is running instead of all of them, the ping handler quickly responds with 200, even if your model hasn't actually been loaded
I think the quickest fix on your end without a code change on MMS is to preload the model, in which case the model will attempt to be loaded when the Python backend, model_service_worker.py, is initially started
Good morning,
I'm using the MMS server to host a single model in AWS Sagemaker. The model is loaded on MMS startup with
sagemaker_inference.model_server.start_model_server
with a custom handler_service.The problem is that loading the model can take significant amount of time (up to 20 minutes). In the time that model is being loaded MMS server
/ping
endpoint returns200 OK
responses, causing sagemaker endpoint to assume the model is already loaded.Furthermore, I tested MMS locally to see how the server behaves in the model loading stage and it looks like the model details endpoint returns model worker status as
READY
before the worker is finished loading the model.In the logs below you can see that workers start and immediately there's:
after that there's
and only finally:
In the meantime
/ping
returns200 OK
andlocalhost:8081/models/test
returns:All logs:
Questions:
/ping
endpoint to return something else than200 OK
when workers are still loading the models?thank you! :)
The text was updated successfully, but these errors were encountered: