Make healthcheck dependent on models loaded #1027

wojciechrauk-plutoflume · 2024-04-23T11:09:46Z

Good morning,

I'm using the MMS server to host a single model in AWS Sagemaker. The model is loaded on MMS startup with sagemaker_inference.model_server.start_model_server with a custom handler_service.

The problem is that loading the model can take significant amount of time (up to 20 minutes). In the time that model is being loaded MMS server /ping endpoint returns 200 OK responses, causing sagemaker endpoint to assume the model is already loaded.

Furthermore, I tested MMS locally to see how the server behaves in the model loading stage and it looks like the model details endpoint returns model worker status as READY before the worker is finished loading the model.

In the logs below you can see that workers start and immediately there's:

2024-04-23T11:01:33,313 [DEBUG] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-test State change null -> WORKER_MODEL_LOADED

after that there's

2024-04-23T11:01:33,317 [DEBUG] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-test State change null -> WORKER_STARTED
...
2024-04-23T11:01:33,322 [DEBUG] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-test State change null -> WORKER_STARTED

and only finally:

2024-04-23T11:03:33,657 [DEBUG] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-test State change WORKER_STARTED -> WORKER_MODEL_LOADED
...
2024-04-23T11:03:33,657 [DEBUG] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-test State change WORKER_STARTED -> WORKER_MODEL_LOADED

In the meantime /ping returns 200 OK and localhost:8081/models/test returns:

{
  "modelName": "test",
  "modelUrl": "test.mar",
  "runtime": "python3",
  "minWorkers": 2,
  "maxWorkers": 2,
  "batchSize": 1,
  "maxBatchDelay": 100,
  "loadedAtStartup": true,
  "workers": [
    {
      "id": "9000-57e7cf49",
      "startTime": "2024-04-23T11:01:33.316Z",
      "status": "READY",
      "gpu": false,
      "memoryUsage": 13099008
    },
    {
      "id": "9000-be19b2af",
      "startTime": "2024-04-23T11:01:33.317Z",
      "status": "READY",
      "gpu": false,
      "memoryUsage": 13434880
    }
  ]
}

All logs:


Removing orphan pid file.
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2024-04-23T11:01:33,126 [INFO ] main com.amazonaws.ml.mms.ModelServer - 
MMS Home: /home/ec2-user/SageMaker/notebooks/environment/user/user-env-base-3-10/user-env-base-3-10/pex_venv/lib/python3.10/site-packages
Current directory: /tmp/models/test
Temp directory: /tmp
Number of GPUs: 0
Number of CPUs: 2
Max heap size: 978 M
Python executable: /home/ec2-user/SageMaker/notebooks/environment/user/user-env-base-3-10/user-env-base-3-10/pex_venv/bin/python3.10
Config file: config.properties
Inference address: http://127.0.0.1:8080
Management address: http://127.0.0.1:8081
Model Store: /tmp/models/test/export
Initial Models: test.mar
Log dir: null
Metrics dir: null
Netty threads: 0
Netty client threads: 0
Default workers per model: 2
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Preload model: false
Prefer direct buffer: false
2024-04-23T11:01:33,133 [INFO ] main com.amazonaws.ml.mms.ModelServer - Loading initial models: test.mar preload_model: false
2024-04-23T11:01:33,189 [INFO ] main com.amazonaws.ml.mms.archive.ModelArchive - model folder already exists: f05f8ac8472b74dad6310273915eb6a5de447740
2024-04-23T11:01:33,217 [WARN ] W-9000-test com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-9000-test
2024-04-23T11:01:33,309 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - model_service_worker started with args: --sock-type unix --sock-name /tmp/.mms.sock.9000 --handler /tmp/models/test/model_handler.py:handle --model-path /tmp/models/f05f8ac8472b74dad6310273915eb6a5de447740 --model-name test --preload-model false --tmp-dir /tmp
2024-04-23T11:01:33,311 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Listening on port: /tmp/.mms.sock.9000
2024-04-23T11:01:33,311 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - [PID] 30549
2024-04-23T11:01:33,312 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - MMS worker started.
2024-04-23T11:01:33,313 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Python runtime: 3.10.12
2024-04-23T11:01:33,313 [DEBUG] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-test State change null -> WORKER_MODEL_LOADED
2024-04-23T11:01:33,314 [INFO ] W-9000-test MMS_METRICS - W-9000-test.ms:106|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870093
2024-04-23T11:01:33,314 [INFO ] main com.amazonaws.ml.mms.wlm.ModelManager - Model test loaded.
2024-04-23T11:01:33,315 [DEBUG] main com.amazonaws.ml.mms.wlm.ModelManager - updateModel: test, count: 2
2024-04-23T11:01:33,317 [DEBUG] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-test State change null -> WORKER_STARTED
2024-04-23T11:01:33,320 [INFO ] main com.amazonaws.ml.mms.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2024-04-23T11:01:33,322 [DEBUG] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-test State change null -> WORKER_STARTED
2024-04-23T11:01:33,334 [INFO ] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /tmp/.mms.sock.9000
2024-04-23T11:01:33,333 [INFO ] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /tmp/.mms.sock.9000
2024-04-23T11:01:33,415 [INFO ] main com.amazonaws.ml.mms.ModelServer - Inference API bind to: http://127.0.0.1:8080
2024-04-23T11:01:33,416 [INFO ] main com.amazonaws.ml.mms.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2024-04-23T11:01:33,426 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /tmp/.mms.sock.9000.
2024-04-23T11:01:33,429 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /tmp/.mms.sock.9000.
2024-04-23T11:01:33,441 [INFO ] main com.amazonaws.ml.mms.ModelServer - Management API bind to: http://127.0.0.1:8081
Model server started.
2024-04-23T11:01:33,480 [WARN ] pool-3-thread-1 com.amazonaws.ml.mms.metrics.MetricCollector - worker pid is not available yet.
2024-04-23T11:01:33,553 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - model initialising, waiting for 2 minutes
2024-04-23T11:01:33,554 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - model initialising, waiting for 2 minutes
2024-04-23T11:01:33,592 [INFO ] pool-3-thread-1 MMS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870093
2024-04-23T11:01:33,598 [INFO ] pool-3-thread-1 MMS_METRICS - DiskAvailable.Gigabytes:48.51470947265625|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870093
2024-04-23T11:01:33,599 [INFO ] pool-3-thread-1 MMS_METRICS - DiskUsage.Gigabytes:86.47355270385742|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870093
2024-04-23T11:01:33,600 [INFO ] pool-3-thread-1 MMS_METRICS - DiskUtilization.Percent:64.1|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870093
2024-04-23T11:01:33,600 [INFO ] pool-3-thread-1 MMS_METRICS - MemoryAvailable.Megabytes:2041.51171875|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870093
2024-04-23T11:01:33,600 [INFO ] pool-3-thread-1 MMS_METRICS - MemoryUsed.Megabytes:1586.08984375|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870093
2024-04-23T11:01:33,602 [INFO ] pool-3-thread-1 MMS_METRICS - MemoryUtilization.Percent:47.8|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870093
2024-04-23T11:02:33,459 [WARN ] pool-3-thread-1 com.amazonaws.ml.mms.metrics.MetricCollector - worker pid is not available yet.
2024-04-23T11:02:33,528 [INFO ] pool-3-thread-1 MMS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870153
2024-04-23T11:02:33,529 [INFO ] pool-3-thread-1 MMS_METRICS - DiskAvailable.Gigabytes:48.51460266113281|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870153
2024-04-23T11:02:33,529 [INFO ] pool-3-thread-1 MMS_METRICS - DiskUsage.Gigabytes:86.47365951538086|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870153
2024-04-23T11:02:33,530 [INFO ] pool-3-thread-1 MMS_METRICS - DiskUtilization.Percent:64.1|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870153
2024-04-23T11:02:33,530 [INFO ] pool-3-thread-1 MMS_METRICS - MemoryAvailable.Megabytes:2014.546875|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870153
2024-04-23T11:02:33,530 [INFO ] pool-3-thread-1 MMS_METRICS - MemoryUsed.Megabytes:1613.0546875|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870153
2024-04-23T11:02:33,531 [INFO ] pool-3-thread-1 MMS_METRICS - MemoryUtilization.Percent:48.5|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870153
2024-04-23T11:03:33,445 [WARN ] pool-3-thread-2 com.amazonaws.ml.mms.metrics.MetricCollector - worker pid is not available yet.
2024-04-23T11:03:33,515 [INFO ] pool-3-thread-2 MMS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870213
2024-04-23T11:03:33,518 [INFO ] pool-3-thread-2 MMS_METRICS - DiskAvailable.Gigabytes:48.51456069946289|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870213
2024-04-23T11:03:33,519 [INFO ] pool-3-thread-2 MMS_METRICS - DiskUsage.Gigabytes:86.47370147705078|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870213
2024-04-23T11:03:33,520 [INFO ] pool-3-thread-2 MMS_METRICS - DiskUtilization.Percent:64.1|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870213
2024-04-23T11:03:33,521 [INFO ] pool-3-thread-2 MMS_METRICS - MemoryAvailable.Megabytes:2015.05078125|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870213
2024-04-23T11:03:33,523 [INFO ] pool-3-thread-2 MMS_METRICS - MemoryUsed.Megabytes:1612.55078125|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870213
2024-04-23T11:03:33,523 [INFO ] pool-3-thread-2 MMS_METRICS - MemoryUtilization.Percent:48.4|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870213
2024-04-23T11:03:33,653 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - model finished waiting
2024-04-23T11:03:33,655 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Model test loaded io_fd=0242b6fffe39090c-00007712-00000001-85c1e7725a6e3c25-be19b2af
2024-04-23T11:03:33,656 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - model finished waiting
2024-04-23T11:03:33,655 [INFO ] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 120099
2024-04-23T11:03:33,656 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Model test loaded io_fd=0242b6fffe39090c-00007712-00000002-b0c9e7725a6e3c25-57e7cf49
2024-04-23T11:03:33,655 [INFO ] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 120104
2024-04-23T11:03:33,657 [DEBUG] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-test State change WORKER_STARTED -> WORKER_MODEL_LOADED
2024-04-23T11:03:33,658 [INFO ] W-9000-test MMS_METRICS - W-9000-test.ms:120341|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870213
2024-04-23T11:03:33,658 [WARN ] W-9000-test com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-test-2
2024-04-23T11:03:33,657 [DEBUG] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-test State change WORKER_STARTED -> WORKER_MODEL_LOADED
2024-04-23T11:03:33,659 [INFO ] W-9000-test MMS_METRICS - W-9000-test.ms:120343|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870213

Questions:

Is that the intended behaviour?
Is it possible for /ping endpoint to return something else than 200 OK when workers are still loading the models?

thank you! :)

The text was updated successfully, but these errors were encountered:

btruhand · 2024-06-05T19:01:12Z

I was just looking into this to understand what how ping and Sagemaker UpdateEndpoint would work together with MMS. As far as I can tell, the ping handler can already return non-200 (a 503 to be exact). The problem is how the running check is determined here. If you trace the code, you'd see that the check is essentially equivalent to to the first WorkerThread.getState() == WorkerState.WORKER_MODEL_LOADED.

The problem is that BOTH the server thread (the main thread/process running model_service_worker.py) and worker threads are implemented as WorkerThread. For the server thread's WorkerThread, it's state is set to WORKER_MODEL_LOADED right when the WorkerThread starts and remains that way. This corresponds with the first log line you showed

Since the model is thought to be loaded when the first WorkerThread is running instead of all of them, the ping handler quickly responds with 200, even if your model hasn't actually been loaded

I think the quickest fix on your end without a code change on MMS is to preload the model, in which case the model will attempt to be loaded when the Python backend, model_service_worker.py, is initially started

Note: if you're using this in Sagemaker, Sagemaker docs says that /ping should return a 200 within 8 minutes after container startup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make healthcheck dependent on models loaded #1027

Make healthcheck dependent on models loaded #1027

wojciechrauk-plutoflume commented Apr 23, 2024

btruhand commented Jun 5, 2024

Make healthcheck dependent on models loaded #1027

Make healthcheck dependent on models loaded #1027

Comments

wojciechrauk-plutoflume commented Apr 23, 2024

btruhand commented Jun 5, 2024