You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We're encountering an issue with MMS and deployment of MXNET models. We thought it was related to the way we're packing the model, but after some digging, it seems that it's related to MMS with MXNET in CPU mode.
The errors we're seeing are related to metrics throwing exceptions, from both a host with and without GPU devices, steps to reproduce:
1 docker pull 763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-inference:1.8.0-cpu-py37-ubuntu16.04
2 docker run -ti --entrypoint="/bin/bash" -p 60000:8080 -p 60001:8081 8828975689bb (change your image)
3 multi-model-server --start --models squeezenet=https://s3.amazonaws.com/model-server/model_archive_1.0/squeezenet_v1.1.mar
And:
% docker run -ti --entrypoint="/bin/bash" -p 60000:8080 -p 60001:8081 8828975689bb
root@eb4f03280c9c:/# multi-model-server --start --models squeezenet=https://s3.amazonaws.com/model-server/model_archive_1.0/squeezenet_v1.1.mar
root@eb4f03280c9c:/# 2022-02-04T22:35:40,112 [INFO ] main com.amazonaws.ml.mms.ModelServer -
MMS Home: /usr/local/lib/python3.7/site-packages
Current directory: /
Temp directory: /home/model-server/tmp
Number of GPUs: 0
Number of CPUs: 2
Max heap size: 1547 M
Python executable: /usr/local/bin/python3.7
Config file: N/A
Inference address: http://127.0.0.1:8080
Management address: http://127.0.0.1:8081
Model Store: N/A
Initial Models: squeezenet=https://s3.amazonaws.com/model-server/model_archive_1.0/squeezenet_v1.1.mar
Log dir: null
Metrics dir: null
Netty threads: 0
Netty client threads: 0
Default workers per model: 2
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Preload model: false
Prefer direct buffer: false
2022-02-04T22:35:40,125 [INFO ] main com.amazonaws.ml.mms.ModelServer - Loading initial models: https://s3.amazonaws.com/model-server/model_archive_1.0/squeezenet_v1.1.mar preload_model: false
2022-02-04T22:35:41,145 [WARN ] main com.amazonaws.ml.mms.ModelServer - Failed to load model: https://s3.amazonaws.com/model-server/model_archive_1.0/squeezenet_v1.1.mar
com.amazonaws.ml.mms.archive.DownloadModelException: Failed to download model from: https://s3.amazonaws.com/model-server/model_archive_1.0/squeezenet_v1.1.mar , code: 403
at com.amazonaws.ml.mms.archive.ModelArchive.download(ModelArchive.java:156) ~[model-server.jar:?]
at com.amazonaws.ml.mms.archive.ModelArchive.downloadModel(ModelArchive.java:72) ~[model-server.jar:?]
at com.amazonaws.ml.mms.wlm.ModelManager.registerModel(ModelManager.java:99) ~[model-server.jar:?]
at com.amazonaws.ml.mms.ModelServer.initModelStore(ModelServer.java:212) [model-server.jar:?]
at com.amazonaws.ml.mms.ModelServer.start(ModelServer.java:315) [model-server.jar:?]
at com.amazonaws.ml.mms.ModelServer.startAndWait(ModelServer.java:103) [model-server.jar:?]
at com.amazonaws.ml.mms.ModelServer.main(ModelServer.java:86) [model-server.jar:?]
2022-02-04T22:35:41,160 [INFO ] main com.amazonaws.ml.mms.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2022-02-04T22:35:41,449 [INFO ] main com.amazonaws.ml.mms.ModelServer - Inference API bind to: http://127.0.0.1:8080
2022-02-04T22:35:41,451 [INFO ] main com.amazonaws.ml.mms.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2022-02-04T22:35:41,459 [INFO ] main com.amazonaws.ml.mms.ModelServer - Management API bind to: http://127.0.0.1:8081
Model server started.
2022-02-04T22:35:41,477 [ERROR] pool-3-thread-1 com.amazonaws.ml.mms.metrics.MetricCollector -
java.io.IOException: Broken pipe
at java.io.FileOutputStream.writeBytes(Native Method) ~[?:1.8.0_292]
at java.io.FileOutputStream.write(FileOutputStream.java:326) ~[?:1.8.0_292]
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) ~[?:1.8.0_292]
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) ~[?:1.8.0_292]
at java.io.FilterOutputStream.close(FilterOutputStream.java:158) ~[?:1.8.0_292]
at com.amazonaws.ml.mms.metrics.MetricCollector.run(MetricCollector.java:76) [model-server.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_292]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_292]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_292]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_292]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_292]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_292]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]
root@eb4f03280c9c:/#
#### After a good while 1-2mins:
root@eb4f03280c9c:/# 2022-02-04T22:36:41,413 [ERROR] Thread-1 com.amazonaws.ml.mms.metrics.MetricCollector - /usr/local/bin/python3.7: error while loading shared libraries: libpython3.7m.so.1.0: cannot open shared object file: No such file or directory
Extra info:
I've tried with versions of MMS 1.1.7 and 1.1.8, same effect.
This happens only with CPU docker containers. I cannot reproduce this error unless I forget to pass the the --gpus all flag to a GPU-enabled docker from a GPU-host, in which case we get similar Java exceptions, but related to cuda, which makes sense.
The error with libpython3.7m.so.1.0 hints to me that the issue might be that in the MMS-worker's when it sets it's Python execution environment the LD_LIBRARY_PATH is missing or wrongly set. Furthermore, you can reproduce this specific .so error while installing Python, and not setting that LD_LIBRARY_PATH environment, for example, if you follow the steps in https://github.com/aws/deep-learning-containers/blob/master/mxnet/inference/docker/1.8/py3/Dockerfile.cpu until line 91 and try to run pip for example. At this point setting the LD_LIBRARY_PATH solves the issue. So, I've tried to manually set LD_LIBRARY_PATH prior to executing multi-model-server --start (...) but no luck.
I've managed to load our model and serve it successfully, even with the errors reported above happening and registered in the logs/stdin.
Any help to understand this would be appreciated, thanks!
The text was updated successfully, but these errors were encountered:
Hi there!
We're encountering an issue with MMS and deployment of MXNET models. We thought it was related to the way we're packing the model, but after some digging, it seems that it's related to MMS with MXNET in CPU mode.
The errors we're seeing are related to metrics throwing exceptions, from both a host with and without GPU devices, steps to reproduce:
1
docker pull 763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-inference:1.8.0-cpu-py37-ubuntu16.04
2
docker run -ti --entrypoint="/bin/bash" -p 60000:8080 -p 60001:8081 8828975689bb
(change your image)3
multi-model-server --start --models squeezenet=https://s3.amazonaws.com/model-server/model_archive_1.0/squeezenet_v1.1.mar
And:
Extra info:
libpython3.7m.so.1.0
hints to me that the issue might be that in the MMS-worker's when it sets it's Python execution environment the LD_LIBRARY_PATH is missing or wrongly set. Furthermore, you can reproduce this specific .so error while installing Python, and not setting that LD_LIBRARY_PATH environment, for example, if you follow the steps in https://github.com/aws/deep-learning-containers/blob/master/mxnet/inference/docker/1.8/py3/Dockerfile.cpu until line 91 and try to run pip for example. At this point setting the LD_LIBRARY_PATH solves the issue. So, I've tried to manually set LD_LIBRARY_PATH prior to executingmulti-model-server --start (...)
but no luck.Any help to understand this would be appreciated, thanks!
The text was updated successfully, but these errors were encountered: