Configurable startup time (#3262)

* configurable start up time, minimum working example * remove startuptimeout from async worker for now before I confirm what model_load_timeout is * doc updates * remove extra spaces in model manager * apply formatting * remove worker command logging * add tests for long startup timeout * worker thread add logging response timeout if worker state isn't worker_started * add startuptimeout to registerWorkflow function * add startuptimeout to the correct word in spellchecker * working example * small refactor * small refactor * added default value for model status * Update ts_scripts/spellcheck_conf/wordlist.txt * Fix java unit tests * Fix regression test * add startup_timeout for test to cast it to int --------- Co-authored-by: Matthias Reso <[email protected]>
pytorch · Aug 12, 2024 · ef196c0 · ef196c0
1 parent 30eb13d
commit ef196c0
Show file tree

Hide file tree

Showing 32 changed files with 322 additions and 24 deletions.
diff --git a/docs/configuration.md b/docs/configuration.md
@@ -245,6 +245,7 @@ A model's parameters are defined in [model source code](https://github.com/pytor
 * `maxWorkers`: the maximum number of workers of a model
 * `batchSize`: the batch size of a model
 * `maxBatchDelay`: the maximum delay in msec of a batch of a model
+* `startupTimeout`: the timeout in sec of a specific model's startup. This setting takes priority over `default_startup_timeout` which is a default timeout over all models
 * `responseTimeout`: the timeout in sec of a specific model's response. This setting takes priority over `default_response_timeout` which is a default timeout over all models
 * `defaultVersion`: the default version of a model
 * `marName`: the mar file name of a model
@@ -295,6 +296,7 @@ Most of the following properties are designed for performance tuning. Adjusting
 * `job_queue_size`: Number inference jobs that frontend will queue before backend can serve. Default: 100.
 * `async_logging`: Enable asynchronous logging for higher throughput, log output may be delayed if this is enabled. Default: false.
 * `default_response_timeout`: Timeout, in seconds, used for all models backend workers before they are deemed unresponsive and rebooted. Default: 120 seconds.
+* `default_startup_timeout`: Specifies the maximum time, in seconds, allowed for model backend workers to initialize and become ready. If a worker fails to start within this timeframe, it is considered unresponsive and will be restarted. Default: 120 seconds.
 * `unregister_model_timeout`: Timeout, in seconds, used when handling an unregister model request when cleaning a process before it is deemed unresponsive and an error response is sent. Default: 120 seconds.
 * `decode_input_request`: Configuration to let backend workers to decode requests, when the content type is known.
 If this is set to "true", backend workers do "Bytearray to JSON object" conversion when the content type is "application/json" and

diff --git a/docs/large_model_inference.md b/docs/large_model_inference.md
@@ -233,7 +233,8 @@ To reduce model latency we recommend:
 #### Tune [model config YAML file](https://github.com/pytorch/serve/blob/5ee02e4f050c9b349025d87405b246e970ee710b/model-archiver/README.md)
 
  You can tune the model config YAML file to get better performance in the following ways:
-* Update the [responseTimeout](https://github.com/pytorch/serve/blob/5ee02e4f050c9b349025d87405b246e970ee710b/docs/configuration.md?plain=1#L216) if high model loading or inference latency causes response timeout.
+* Update the [responseTimeout](https://github.com/pytorch/serve/blob/5ee02e4f050c9b349025d87405b246e970ee710b/docs/configuration.md?plain=1#L216) if high model inference latency causes response timeout.
+* Update the [startupTimeout](https://github.com/pytorch/serve/blob/5ee02e4f050c9b349025d87405b246e970ee710b/docs/configuration.md?plain=1#L216) if high model loading latency causes startup timeout.
 * Tune the [torchrun parameters](https://github.com/pytorch/serve/blob/2f1f52f553e83703b5c380c2570a36708ee5cafa/model-archiver/README.md?plain=1#L179). The supported parameters are defined at [here](https://github.com/pytorch/serve/blob/2f1f52f553e83703b5c380c2570a36708ee5cafa/frontend/archive/src/main/java/org/pytorch/serve/archive/model/ModelConfig.java#L329). For example, by default, `OMP_NUMBER_THREADS` is 1. This can be modified in the YAML file.
 ```yaml
 #frontend settings

diff --git a/docs/management_api.md b/docs/management_api.md
@@ -39,6 +39,7 @@ To use this API after TorchServe starts, model API control has to be enabled. Ad
 * `initial_workers` - the number of initial workers to create. The default value is `0`. TorchServe will not run inference until there is at least one work assigned.
 * `synchronous` - whether or not the creation of worker is synchronous. The default value is false. TorchServe will create new workers without waiting for acknowledgement that the previous worker is online.
 * `response_timeout` - If the model's backend worker doesn't respond with inference response within this timeout period, the worker will be deemed unresponsive and rebooted. The units is seconds. The default value is 120 seconds.
+* `startup_timeout` - If the model's backend worker doesn't load the model within this timeout period, the worker will be deemed unresponsive and rebooted. The units is seconds. The default value is 120 seconds.
 
 ```bash
 curl -X POST  "http://localhost:8081/models?url=https://torchserve.pytorch.org/mar_files/squeezenet1_1.mar"

diff --git a/frontend/archive/src/main/java/org/pytorch/serve/archive/model/ModelConfig.java b/frontend/archive/src/main/java/org/pytorch/serve/archive/model/ModelConfig.java
@@ -21,6 +21,8 @@ public class ModelConfig {
     private int maxBatchDelay;
     /** the timeout in sec of a specific model's response. */
     private int responseTimeout = 120; // unit: sec
+    /** the timeout in sec of a specific model's startup. */
+    private int startupTimeout = 120; // unit: sec
     /**
      * the device type where the model is loaded. It can be gpu, cpu. The model is loaded on CPU if
      * deviceType: "cpu" is set on a GPU host.
@@ -122,6 +124,13 @@ public static ModelConfig build(Map<String, Object> yamlMap) {
                                 logger.warn("Invalid responseTimeout: {}, should be integer", v);
                             }
                             break;
+                        case "startupTimeout":
+                            if (v instanceof Integer) {
+                                modelConfig.setStartupTimeout((int) v);
+                            } else {
+                                logger.warn("Invalid startupTimeout: {}, should be integer", v);
+                            }
+                            break;
                         case "deviceType":
                             if (v instanceof String) {
                                 modelConfig.setDeviceType((String) v);
@@ -319,6 +328,18 @@ public void setResponseTimeout(int responseTimeout) {
         this.responseTimeout = responseTimeout;
     }
 
+    public int getStartupTimeout() {
+        return startupTimeout;
+    }
+
+    public void setStartupTimeout(int startupTimeout) {
+        if (startupTimeout <= 0) {
+            logger.warn("Invalid startupTimeout:{}", startupTimeout);
+            return;
+        }
+        this.startupTimeout = startupTimeout;
+    }
+
     public List<Integer> getDeviceIds() {
         return deviceIds;
     }

diff --git a/frontend/archive/src/test/java/org/pytorch/serve/archive/model/ModelConfigTest.java b/frontend/archive/src/test/java/org/pytorch/serve/archive/model/ModelConfigTest.java
@@ -21,6 +21,7 @@ public void testValidYamlConfig() throws InvalidModelException, IOException {
         Assert.assertEquals(modelConfig.getBatchSize(), 1);
         Assert.assertEquals(modelConfig.getMaxBatchDelay(), 100);
         Assert.assertEquals(modelConfig.getResponseTimeout(), 120);
+        Assert.assertEquals(modelConfig.getStartupTimeout(), 120);
         Assert.assertEquals(modelConfig.getDeviceType(), ModelConfig.DeviceType.GPU);
         Assert.assertEquals(modelConfig.getParallelLevel(), 4);
         Assert.assertEquals(modelConfig.getParallelType(), ModelConfig.ParallelType.PP);
@@ -42,6 +43,7 @@ public void testInvalidYamlConfig() throws InvalidModelException, IOException {
         Assert.assertEquals(modelConfig.getBatchSize(), 1);
         Assert.assertEquals(modelConfig.getMaxBatchDelay(), 100);
         Assert.assertEquals(modelConfig.getResponseTimeout(), 120);
+        Assert.assertEquals(modelConfig.getStartupTimeout(), 120);
         Assert.assertNotEquals(modelConfig.getDeviceType(), ModelConfig.DeviceType.GPU);
         Assert.assertEquals(modelConfig.getParallelLevel(), 0);
         Assert.assertNotEquals(modelConfig.getParallelType(), ModelConfig.ParallelType.PPTP);

diff --git a/frontend/server/src/main/java/org/pytorch/serve/ModelServer.java b/frontend/server/src/main/java/org/pytorch/serve/ModelServer.java
@@ -270,6 +270,7 @@ private void initModelStore() throws InvalidSnapshotException, IOException {
                                 -1 * RegisterModelRequest.DEFAULT_BATCH_SIZE,
                                 -1 * RegisterModelRequest.DEFAULT_MAX_BATCH_DELAY,
                                 configManager.getDefaultResponseTimeout(),
+                                configManager.getDefaultStartupTimeout(),
                                 defaultModelName,
                                 false,
                                 false,

diff --git a/frontend/server/src/main/java/org/pytorch/serve/http/messages/DescribeModelResponse.java b/frontend/server/src/main/java/org/pytorch/serve/http/messages/DescribeModelResponse.java
@@ -22,6 +22,7 @@ public class DescribeModelResponse {
     private int batchSize;
     private int maxBatchDelay;
     private int responseTimeout;
+    private int startupTimeout;
     private long maxRetryTimeoutInSec;
     private long clientTimeoutInMills;
     private String parallelType;
@@ -132,10 +133,18 @@ public int getResponseTimeout() {
         return responseTimeout;
     }
 
+    public int getStartupTimeout() {
+        return startupTimeout;
+    }
+
     public void setResponseTimeout(int responseTimeout) {
         this.responseTimeout = responseTimeout;
     }
 
+    public void setStartupTimeout(int startupTimeout) {
+        this.startupTimeout = startupTimeout;
+    }
+
     public long getMaxRetryTimeoutInSec() {
         return maxRetryTimeoutInSec;
     }

diff --git a/frontend/server/src/main/java/org/pytorch/serve/http/messages/RegisterModelRequest.java b/frontend/server/src/main/java/org/pytorch/serve/http/messages/RegisterModelRequest.java
@@ -35,6 +35,9 @@ public class RegisterModelRequest {
     @SerializedName("response_timeout")
     private int responseTimeout;
 
+    @SerializedName("startup_timeout")
+    private int startupTimeout;
+
     @SerializedName("url")
     private String modelUrl;
 
@@ -56,6 +59,7 @@ public RegisterModelRequest(QueryStringDecoder decoder) {
                         ConfigManager.getInstance().getConfiguredDefaultWorkersPerModel());
         synchronous = Boolean.parseBoolean(NettyUtils.getParameter(decoder, "synchronous", "true"));
         responseTimeout = NettyUtils.getIntParameter(decoder, "response_timeout", -1);
+        startupTimeout = NettyUtils.getIntParameter(decoder, "startup_timeout", -1);
         modelUrl = NettyUtils.getParameter(decoder, "url", null);
         s3SseKms = Boolean.parseBoolean(NettyUtils.getParameter(decoder, "s3_sse_kms", "false"));
     }
@@ -74,6 +78,7 @@ public RegisterModelRequest(org.pytorch.serve.grpc.management.RegisterModelReque
                         ConfigManager.getInstance().getConfiguredDefaultWorkersPerModel());
         synchronous = request.getSynchronous();
         responseTimeout = GRPCUtils.getRegisterParam(request.getResponseTimeout(), -1);
+        startupTimeout = GRPCUtils.getRegisterParam(request.getStartupTimeout(), -1);
         modelUrl = GRPCUtils.getRegisterParam(request.getUrl(), null);
         s3SseKms = request.getS3SseKms();
     }
@@ -84,6 +89,7 @@ public RegisterModelRequest() {
         synchronous = true;
         initialWorkers = ConfigManager.getInstance().getConfiguredDefaultWorkersPerModel();
         responseTimeout = -1;
+        startupTimeout = -1;
         s3SseKms = false;
     }
 
@@ -119,6 +125,10 @@ public Integer getResponseTimeout() {
         return responseTimeout;
     }
 
+    public Integer getStartupTimeout() {
+        return startupTimeout;
+    }
+
     public String getModelUrl() {
         return modelUrl;
     }

diff --git a/frontend/server/src/main/java/org/pytorch/serve/openapi/OpenApiUtils.java b/frontend/server/src/main/java/org/pytorch/serve/openapi/OpenApiUtils.java
@@ -294,6 +294,12 @@ private static Operation getRegisterOperation() {
                         "integer",
                         "2",
                         "Maximum time, in seconds, the TorchServe waits for a response from the model inference code, default: 120."));
+        operation.addParameter(
+                new QueryParameter(
+                        "startup_timeout",
+                        "integer",
+                        "120",
+                        "Maximum time, in seconds, the TorchServe waits for the model to startup/initialize, default: 120."));
         operation.addParameter(
                 new QueryParameter(
                         "initial_workers",

diff --git a/frontend/server/src/main/java/org/pytorch/serve/util/ApiUtils.java b/frontend/server/src/main/java/org/pytorch/serve/util/ApiUtils.java
@@ -122,10 +122,14 @@ public static StatusResponse registerModel(RegisterModelRequest registerModelReq
         int maxBatchDelay = registerModelRequest.getMaxBatchDelay();
         int initialWorkers = registerModelRequest.getInitialWorkers();
         int responseTimeout = registerModelRequest.getResponseTimeout();
+        int startupTimeout = registerModelRequest.getStartupTimeout();
         boolean s3SseKms = registerModelRequest.getS3SseKms();
         if (responseTimeout == -1) {
             responseTimeout = ConfigManager.getInstance().getDefaultResponseTimeout();
         }
+        if (startupTimeout == -1) {
+            startupTimeout = ConfigManager.getInstance().getDefaultStartupTimeout();
+        }
 
         Manifest.RuntimeType runtimeType = null;
         if (runtime != null) {
@@ -144,6 +148,7 @@ public static StatusResponse registerModel(RegisterModelRequest registerModelReq
                 batchSize,
                 maxBatchDelay,
                 responseTimeout,
+                startupTimeout,
                 initialWorkers,
                 registerModelRequest.getSynchronous(),
                 false,
@@ -158,6 +163,7 @@ public static StatusResponse handleRegister(
             int batchSize,
             int maxBatchDelay,
             int responseTimeout,
+            int startupTimeout,
             int initialWorkers,
             boolean isSync,
             boolean isWorkflowModel,
@@ -177,6 +183,7 @@ public static StatusResponse handleRegister(
                             batchSize,
                             maxBatchDelay,
                             responseTimeout,
+                            startupTimeout,
                             null,
                             false,
                             isWorkflowModel,
@@ -403,6 +410,7 @@ private static DescribeModelResponse createModelResponse(
         resp.setModelVersion(manifest.getModel().getModelVersion());
         resp.setRuntime(manifest.getRuntime().getValue());
         resp.setResponseTimeout(model.getResponseTimeout());
+        resp.setStartupTimeout(model.getStartupTimeout());
         resp.setMaxRetryTimeoutInSec(model.getMaxRetryTimeoutInMill() / 1000);
         resp.setClientTimeoutInMills(model.getClientTimeoutInMills());
         resp.setParallelType(model.getParallelType().getParallelType());

diff --git a/frontend/server/src/main/java/org/pytorch/serve/util/ConfigManager.java b/frontend/server/src/main/java/org/pytorch/serve/util/ConfigManager.java
@@ -65,6 +65,7 @@ public final class ConfigManager {
     private static final String TS_BLACKLIST_ENV_VARS = "blacklist_env_vars";
     private static final String TS_DEFAULT_WORKERS_PER_MODEL = "default_workers_per_model";
     private static final String TS_DEFAULT_RESPONSE_TIMEOUT = "default_response_timeout";
+    private static final String TS_DEFAULT_STARTUP_TIMEOUT = "default_startup_timeout";
     private static final String TS_UNREGISTER_MODEL_TIMEOUT = "unregister_model_timeout";
     private static final String TS_NUMBER_OF_NETTY_THREADS = "number_of_netty_threads";
     private static final String TS_NETTY_CLIENT_THREADS = "netty_client_threads";
@@ -879,6 +880,10 @@ public int getDefaultResponseTimeout() {
         return Integer.parseInt(prop.getProperty(TS_DEFAULT_RESPONSE_TIMEOUT, "120"));
     }
 
+    public int getDefaultStartupTimeout() {
+        return Integer.parseInt(prop.getProperty(TS_DEFAULT_STARTUP_TIMEOUT, "120"));
+    }
+
     public int getUnregisterModelTimeout() {
         return Integer.parseInt(prop.getProperty(TS_UNREGISTER_MODEL_TIMEOUT, "120"));
     }

diff --git a/frontend/server/src/main/java/org/pytorch/serve/wlm/Model.java b/frontend/server/src/main/java/org/pytorch/serve/wlm/Model.java
@@ -32,6 +32,7 @@ public class Model {
     public static final String BATCH_SIZE = "batchSize";
     public static final String MAX_BATCH_DELAY = "maxBatchDelay";
     public static final String RESPONSE_TIMEOUT = "responseTimeout";
+    public static final String STARTUP_TIMEOUT = "startupTimeout";
     public static final String PARALLEL_LEVEL = "parallelLevel";
     public static final String DEFAULT_VERSION = "defaultVersion";
     public static final String MAR_NAME = "marName";
@@ -57,6 +58,7 @@ public class Model {
     private ReentrantLock lock;
     private ReentrantLock jobGroupLock;
     private int responseTimeout;
+    private int startupTimeout;
     private long sequenceMaxIdleMSec;
     private long sequenceTimeoutMSec;
     private int maxNumSequence;
@@ -178,6 +180,7 @@ public JsonObject getModelState(boolean isDefaultVersion) {
         modelInfo.addProperty(BATCH_SIZE, getBatchSize());
         modelInfo.addProperty(MAX_BATCH_DELAY, getMaxBatchDelay());
         modelInfo.addProperty(RESPONSE_TIMEOUT, getResponseTimeout());
+        modelInfo.addProperty(STARTUP_TIMEOUT, getStartupTimeout());
         modelInfo.addProperty(RUNTIME_TYPE, getRuntimeType().getValue());
         if (parallelLevel > 0) {
             modelInfo.addProperty(PARALLEL_LEVEL, parallelLevel);
@@ -191,6 +194,7 @@ public void setModelState(JsonObject modelInfo) {
         maxWorkers = modelInfo.get(MAX_WORKERS).getAsInt();
         maxBatchDelay = modelInfo.get(MAX_BATCH_DELAY).getAsInt();
         responseTimeout = modelInfo.get(RESPONSE_TIMEOUT).getAsInt();
+        startupTimeout = modelInfo.get(STARTUP_TIMEOUT).getAsInt();
         batchSize = modelInfo.get(BATCH_SIZE).getAsInt();
 
         JsonElement runtime = modelInfo.get(RUNTIME_TYPE);
@@ -537,10 +541,18 @@ public int getResponseTimeout() {
         return ConfigManager.getInstance().isDebug() ? Integer.MAX_VALUE : responseTimeout;
     }
 
+    public int getStartupTimeout() {
+        return ConfigManager.getInstance().isDebug() ? Integer.MAX_VALUE : startupTimeout;
+    }
+
     public void setResponseTimeout(int responseTimeout) {
         this.responseTimeout = responseTimeout;
     }
 
+    public void setStartupTimeout(int startupTimeout) {
+        this.startupTimeout = startupTimeout;
+    }
+
     public List<Integer> getDeviceIds() {
         return this.deviceIds;
     }