Auto detect backend metrics not defined in metrics configuration (#2769)

* Auto detect backend metrics not defined in metrics configuration * Add tests for metric auto detection * Update custom metrics example to show metrics auto detection * Update metrics documentation * Fix unit test failure * Set request ID for load model requests * Update metrics integration test to use custom metrics example * Make auto updates to frontend metrics cache thread safe * Add auto detect metrics is always called on metric log that has type information * Fix auto detect backend metrics unit test * Update documentation about performance impact of metrics auto detection * fix linter error * Disable metrics auto detection by default * fix integration tests * Update documentation and custom metrics example about metrics auto detection * Update metrics auto detection documentation * Add helper functions to get dimension names and values from Metric object * Fix java formatting * Move request id assignment for model load requests from model_service_worker.py to model_loader.py
pytorch · Nov 21, 2023 · 98c19bc · 98c19bc
1 parent 796808f
commit 98c19bc
Show file tree

Hide file tree

Showing 18 changed files with 573 additions and 111 deletions.
diff --git a/docs/metrics.md b/docs/metrics.md
@@ -36,16 +36,25 @@ The location of log files and metric files can be configured in the [log4j2.xml]
 
 **Prometheus Mode**
 
-In `prometheus` mode, metrics defined in the metrics configuration file are made available in prometheus format via the [metrics API endpoint](metrics_api.md).
+In `prometheus` mode, metrics are made available in prometheus format via the [metrics API endpoint](metrics_api.md).
 
 ## Getting Started with TorchServe Metrics
 
-TorchServe defines metrics in a [yaml](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml) file, including both frontend metrics (i.e. `ts_metrics`) and backend metrics (i.e. `model_metrics`).
+TorchServe defines metrics configuration in a [yaml](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml) file, including both frontend metrics (i.e. `ts_metrics`) and backend metrics (i.e. `model_metrics`).
 When TorchServe is started, the metrics definition is loaded in the frontend and backend cache separately.
-The backend emits metrics logs as they are updated. The frontend parses these logs and makes the corresponding metrics available either as logs or via the [metrics API endpoint](metrics_api.md) based on the metrics_mode configuration.
+The backend emits metrics logs as they are updated. The frontend parses these logs and makes the corresponding metrics available either as logs or via the [metrics API endpoint](metrics_api.md) based on the `metrics_mode` configuration.
 
 
-Dynamic updates to the metrics configuration file is currently not supported. In order to account for updates made to the metrics configuration file, Torchserve will need to be restarted.
+Dynamic updates to the metrics configuration file is not supported. In order to account for updates made to the metrics configuration file, Torchserve will need to be restarted.
+
+By default, metrics that are not defined in the metrics configuration file will not be logged in the metrics log files or made available via the prometheus metrics API endpoint.
+Backend model metrics can be `auto-detected` and registered in the frontend by setting `model_metrics_auto_detect` to `true` in `config.properties`
+or using the `TS_MODEL_METRICS_AUTO_DETECT` environment variable. By default, `model_metrics_auto_detect` is disabled.
+
+`Warning: Using auto-detection of backend metrics will have performance impact in the form of latency overhead, typically at model load and first inference for a given model.
+This cold start behavior is because, it is during model load and first inference that new metrics are typically emitted by the backend and is detected and registered by the frontend.
+Subsequent inferences could also see performance impact if new metrics are updated for the first time.
+For use cases where multiple models are loaded/unloaded often, the latency overhead can be mitigated by specifying known metrics in the metrics configuration file, ahead of time.`
 
 
 The `metrics.yaml` is formatted with Prometheus metric type terminology:
@@ -87,9 +96,6 @@ model_metrics:  # backend metrics
       dimensions: [*model_name, *level]
 ```
 
-
-Note that **only** the metrics defined in the **metrics configuration file** can be emitted to model_metrics.log or made available via the [metrics API endpoint](metrics_api.md). This is done to ensure that the metrics configuration file serves as a central inventory of all the metrics that Torchserve can emit.
-
 Default metrics are provided in the [metrics.yaml](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml) file, but the user can either delete them to their liking / ignore them altogether, because these metrics will not be emitted unless they are updated.\
 When adding custom `model_metrics` in the metrics configuration file, ensure to include `ModelName` and `Level` dimension names towards the end of the list of dimensions since they are included by default by the following custom metrics APIs:
 [add_metric](#function-api-to-add-generic-metrics-with-default-dimensions), [add_counter](#add-counter-based-metrics),
@@ -175,7 +181,7 @@ Metrics collected include:
 
 ### Metric Types Enum
 
-TorchServe Metrics is introducing [Metric Types](https://github.com/pytorch/serve/blob/master/ts/metrics/metric_type_enum.py)
+TorchServe Metrics use [Metric Types](https://github.com/pytorch/serve/blob/master/ts/metrics/metric_type_enum.py)
 that are in line with the [Prometheus API](https://github.com/prometheus/client_python) metric types.
 
 Metric types are an attribute of Metric objects.
@@ -268,14 +274,15 @@ All metrics are collected within the context.
 
 ### Specifying Metric Types
 
-When adding any metric via Metrics API, users have the ability to override the metric type by specifying the positional argument
+When adding any metric via Metrics API, users have the ability to override the default metric type by specifying the positional argument
 `metric_type=MetricTypes.[COUNTER/GAUGE/HISTOGRAM]`.
 
 ```python
-metrics.add_metric("GenericMetric", value, unit=unit, dimension_names=["name1", "name2", ...], metric_type=MetricTypes.GAUGE)
+example_metric = metrics.add_metric_to_cache(name="ExampleMetric", unit="ms", dimension_names=["name1", "name2"], metric_type=MetricTypes.GAUGE)
+example_metric.add_or_update(value=1, dimension_values=["value1", "value2"])
 
 # Backwards compatible, combines the above two method calls
-metrics.add_counter("CounterMetric", value=1, dimensions=[Dimension("name", "value"), ...])
+metrics.add_metric(name="ExampleMetric", value=1, unit="ms", dimensions=[Dimension("name1", "value1"), Dimension("name2", "value2")], metric_type=MetricTypes.GAUGE)
 ```
 
 
@@ -302,14 +309,12 @@ given some criteria:
 3. Dimensions should be the same (as well as the same order!)
    1. All dimensions have to match, and Metric objects that have been parsed from the yaml file also have dimension names that are parsed from the yaml file
       1. Users can [create their own](#create-dimension-objects) `Dimension` objects to match those in the yaml file dimensions
-      2. if the Metric object has `ModelName` and `Level` dimensions only, it is optional to specify additional dimensions since these are considered [default dimensions](#default-dimensions), so: `add_counter('InferenceTimeInMS', value=2)` or `add_counter('InferenceTimeInMS', value=2, dimensions=["ModelName", "Level"])`
+      2. If the Metric object has `ModelName` and `Level` dimensions only, it is optional to specify additional dimensions since these are considered [default dimensions](#default-dimensions), so: `add_counter('InferenceTimeInMS', value=2)` or `add_counter('InferenceTimeInMS', value=2, dimensions=["ModelName", "Level"])`
 
 
 ### Default dimensions
 
-Metrics will have a couple of default dimensions if not already specified.
-
-If the metric is a type `Gauge`, `Histogram`, `Counter`, by default it will have:
+Metrics will have a couple of default dimensions if not already specified:
   * `ModelName,{name_of_model}`
   * `Level,Model`
 
@@ -555,7 +560,6 @@ Function API
         metric_type: MetricTypes
            type for defining different operations, defaulted to gauge metric type for Percent metrics
         """
-
 ```
 
 **Inferred unit**: `percent`
@@ -599,7 +603,7 @@ Function API
 
 ### Getting a metric
 
-Users can get a metric from the cache. The Metric object is returned, so the user can access the methods of the Metric: (i.e. `Metric.update(value)`, `Metric.__str__`)
+Users can get a metric from the cache. The CachingMetric object is returned, so the user can access the methods of the CachingMetric: (i.e. `CachingMetric.add_or_update(value, dimensions_values)`, `CachingMetric.update(value, dimensions)`)
 
 ```python
     def get_metric(self, metric_name: str, metric_type: MetricTypes) -> Metric:
@@ -723,12 +727,12 @@ class CustomHandlerExample:
      ```
    - **[v0.6.1 - v0.8.1] to [> v0.8.1]**\
      Replace the call to `add_metric` with `add_metric_to_cache`.
-2. Starting [v0.8.0](https://github.com/pytorch/serve/releases/tag/v0.8.0), only metrics that are defined in the metrics config file(default: [metrics.yaml](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml))
+2. In versions [[v0.8.0](https://github.com/pytorch/serve/releases/tag/v0.8.0) - [v0.9.0](https://github.com/pytorch/serve/releases/tag/v0.9.0)], only metrics that are defined in the metrics config file(default: [metrics.yaml](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml))
    are either all logged to `ts_metrics.log` and `model_metrics.log` or made available via the [metrics API endpoint](metrics_api.md)
    based on the `metrics_mode` configuration as described [above](#introduction).\
    The default `metrics_mode` is `log` mode.\
    This is unlike in previous versions where all metrics were only logged to `ts_metrics.log` and `model_metrics.log` except for `ts_inference_requests_total`, `ts_inference_latency_microseconds` and `ts_queue_latency_microseconds`
    which were only available via the metrics API endpoint.\
    **Upgrade paths**:
-   - **[< v0.8.0] to [>= v0.8.0]**\
+   - **[< v0.8.0] to [v0.8.0 - v0.9.0]**\
      Specify all the custom metrics added to the custom handler in the metrics configuration file as shown [above](#getting-started-with-torchserve-metrics).
diff --git a/examples/custom_metrics/README.md b/examples/custom_metrics/README.md
@@ -18,9 +18,10 @@ Run the commands given in following steps from the root directory of the reposit
   - HandlerMethodTime
   - ExamplePercentMetric
 
-  The custom metrics configuration file `metrics.yaml` in this example builds on top of the [default metrics configuration file](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml) to include the custom metrics listed above.
-  The `config.properties` file in this example configures torchserve to use the custom metrics configuration file and sets the `metrics_mode` to `prometheus`. The custom handler
-  `mnist_handler.py` updates the metrics listed above.
+  The custom metrics configuration file [metrics.yaml](metrics.yaml) in this example builds on top of the [default metrics configuration file](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml) to include the custom metrics listed above.
+  Note that, `HandlerMethodTime` and `ExamplePercentMetric` are not defined in the [metrics configuration file](metrics.yaml) to demonstrate auto-detection of backend metrics.
+  The [config.properties](config.properties) file in this example configures torchserve to use the custom metrics configuration file, sets the `metrics_mode` to `prometheus` and enables `model_metrics_auto_detect`. The custom handler
+  [mnist_handler.py](mnist_handler.py) updates the metrics listed above.
 
   Refer: [Custom Metrics](https://github.com/pytorch/serve/blob/master/docs/metrics.md#custom-metrics-api)\
   Refer: [Custom Handler](https://github.com/pytorch/serve/blob/master/docs/custom_service.md#custom-handlers)

diff --git a/examples/custom_metrics/config.properties b/examples/custom_metrics/config.properties
@@ -1,5 +1,6 @@
 metrics_mode=prometheus
 metrics_config=examples/custom_metrics/metrics.yaml
+model_metrics_auto_detect=true
 models={\
   "mnist": {\
     "1.0": {\

diff --git a/examples/custom_metrics/metrics.yaml b/examples/custom_metrics/metrics.yaml
@@ -68,6 +68,7 @@ ts_metrics:
 
 model_metrics:
   # Dimension "Hostname" is automatically added for model metrics in the backend
+  # "HandlerMethodTime" and "ExamplePercentMetric" metrics are not defined here to show auto-detection of backend metrics
   counter:
     - name: InferenceRequestCount
       unit: count
@@ -94,10 +95,3 @@ model_metrics:
     - name: SizeOfImage
       unit: kB
       dimensions: [*model_name, *level]
-    - name: HandlerMethodTime
-      unit: ms
-      dimensions: ["MethodName", *model_name, *level]
-  histogram:
-    - name: ExamplePercentMetric
-      unit: percent
-      dimensions: [*model_name, *level]
diff --git a/examples/custom_metrics/mnist_handler.py b/examples/custom_metrics/mnist_handler.py
@@ -41,10 +41,6 @@ def initialize(self, context):
             name="InitializeCallCount",
             value=1,
             unit="count",
-            dimensions=[
-                Dimension(name="ModelName", value=context.model_name),
-                Dimension(name="Level", value="Model"),
-            ],
             metric_type=MetricTypes.COUNTER,
         )
 
@@ -95,6 +91,7 @@ def preprocess(self, data):
 
         # "add_time" will register the metric if not already present in metric cache,
         # include the "ModelName" and "Level" dimensions by default and emit it
+        # Note: "HandlerMethodTime" is not defined in "metrics.yaml" and will be auto-detected
         metrics.add_time(
             name="HandlerMethodTime",
             value=(preprocess_stop - preprocess_start) * 1000,
@@ -122,6 +119,7 @@ def postprocess(self, data):
         )
         # "add_percent" will register the metric if not already present in metric cache,
         # include the "ModelName" and "Level" dimensions by default and emit it
+        # Note: "ExamplePercentMetric" is not defined in "metrics.yaml" and will be auto-detected
         self.context.metrics.add_percent(
             name="ExamplePercentMetric",
             value=50,

diff --git a/frontend/server/src/main/java/org/pytorch/serve/metrics/Metric.java b/frontend/server/src/main/java/org/pytorch/serve/metrics/Metric.java
@@ -12,7 +12,7 @@ public class Metric {
 
     private static final Pattern PATTERN =
             Pattern.compile(
-                    "\\s*([\\w\\s]+)\\.([\\w\\s]+):([0-9\\-,.e]+)\\|#([^|]*)\\|#hostname:([^,]+),([^,]+)(,(.*))?");
+                    "\\s*([\\w\\s]+)\\.([\\w\\s]+):([0-9\\-,.e]+)\\|#([^|]*)(\\|#type:([^|,]+))?\\|#hostname:([^,]+),([^,]+)(,(.*))?");
 
     @SerializedName("MetricName")
     private String metricName;
@@ -23,9 +23,18 @@ public class Metric {
     @SerializedName("Unit")
     private String unit;
 
+    @SerializedName("Type")
+    private String type;
+
     @SerializedName("Dimensions")
     private List<Dimension> dimensions;
 
+    @SerializedName("DimensionNames")
+    private List<String> dimensionNames;
+
+    @SerializedName("DimensionValues")
+    private List<String> dimensionValues;
+
     @SerializedName("Timestamp")
     private String timestamp;
 
@@ -41,13 +50,15 @@ public Metric(
             String metricName,
             String value,
             String unit,
+            String type,
             String hostName,
             Dimension... dimensions) {
         this.metricName = metricName;
         this.value = value;
         this.unit = unit;
+        this.type = type;
         this.hostName = hostName;
-        this.dimensions = Arrays.asList(dimensions);
+        this.setDimensions(Arrays.asList(dimensions));
         this.timestamp =
                 String.valueOf(TimeUnit.MILLISECONDS.toSeconds(System.currentTimeMillis()));
     }
@@ -92,12 +103,34 @@ public void setUnit(String unit) {
         this.unit = unit;
     }
 
+    public String getType() {
+        return type;
+    }
+
+    public void setType(String type) {
+        this.type = type;
+    }
+
     public List<Dimension> getDimensions() {
         return dimensions;
     }
 
+    public List<String> getDimensionNames() {
+        return this.dimensionNames;
+    }
+
+    public List<String> getDimensionValues() {
+        return this.dimensionValues;
+    }
+
     public void setDimensions(List<Dimension> dimensions) {
         this.dimensions = dimensions;
+        this.dimensionNames = new ArrayList<String>();
+        this.dimensionValues = new ArrayList<String>();
+        for (Dimension dimension : dimensions) {
+            this.dimensionNames.add(dimension.getName());
+            this.dimensionValues.add(dimension.getValue());
+        }
     }
 
     public String getTimestamp() {
@@ -120,9 +153,10 @@ public static Metric parse(String line) {
         metric.setUnit(matcher.group(2));
         metric.setValue(matcher.group(3));
         String dimensions = matcher.group(4);
-        metric.setHostName(matcher.group(5));
-        metric.setTimestamp(matcher.group(6));
-        metric.setRequestId(matcher.group(8));
+        metric.setType(matcher.group(6));
+        metric.setHostName(matcher.group(7));
+        metric.setTimestamp(matcher.group(8));
+        metric.setRequestId(matcher.group(10));
 
         if (dimensions != null) {
             String[] dimension = dimensions.split(",");

diff --git a/frontend/server/src/main/java/org/pytorch/serve/metrics/MetricCache.java b/frontend/server/src/main/java/org/pytorch/serve/metrics/MetricCache.java
@@ -29,11 +29,7 @@ private MetricCache() throws FileNotFoundException {
             return;
         }
 
-        MetricBuilder.MetricMode metricsMode = MetricBuilder.MetricMode.LOG;
-        String metricsConfigMode = ConfigManager.getInstance().getMetricsMode();
-        if (metricsConfigMode != null && metricsConfigMode.toLowerCase().contains("prometheus")) {
-            metricsMode = MetricBuilder.MetricMode.PROMETHEUS;
-        }
+        MetricBuilder.MetricMode metricsMode = ConfigManager.getInstance().getMetricsMode();
 
         if (this.config.getTs_metrics() != null) {
             addMetrics(
@@ -106,6 +102,24 @@ public static MetricCache getInstance() {
         return instance;
     }
 
+    public IMetric addAutoDetectMetricBackend(Metric parsedMetric) {
+        // The Hostname dimension is included by default for backend metrics
+        List<String> dimensionNames = parsedMetric.getDimensionNames();
+        dimensionNames.add("Hostname");
+
+        IMetric metric =
+                MetricBuilder.build(
+                        ConfigManager.getInstance().getMetricsMode(),
+                        MetricBuilder.MetricType.valueOf(parsedMetric.getType()),
+                        parsedMetric.getMetricName(),
+                        parsedMetric.getUnit(),
+                        dimensionNames);
+
+        this.metricsBackend.putIfAbsent(parsedMetric.getMetricName(), metric);
+
+        return metric;
+    }
+
     public IMetric getMetricFrontend(String metricName) {
         return metricsFrontend.get(metricName);
     }

diff --git a/frontend/server/src/main/java/org/pytorch/serve/metrics/MetricCollector.java b/frontend/server/src/main/java/org/pytorch/serve/metrics/MetricCollector.java
@@ -81,12 +81,10 @@ public void run() {
                     } else {
                         if (this.metricCache.getMetricFrontend(metric.getMetricName()) != null) {
                             try {
-                                List<String> dimensionValues = new ArrayList<String>();
-                                for (Dimension dimension : metric.getDimensions()) {
-                                    dimensionValues.add(dimension.getValue());
-                                }
                                 // Frontend metrics by default have the last dimension as Hostname
+                                List<String> dimensionValues = metric.getDimensionValues();
                                 dimensionValues.add(metric.getHostName());
+
                                 this.metricCache
                                         .getMetricFrontend(metric.getMetricName())
                                         .addOrUpdate(

diff --git a/frontend/server/src/main/java/org/pytorch/serve/util/ConfigManager.java b/frontend/server/src/main/java/org/pytorch/serve/util/ConfigManager.java
@@ -43,6 +43,7 @@
 import org.apache.commons.cli.Option;
 import org.apache.commons.cli.Options;
 import org.apache.commons.io.IOUtils;
+import org.pytorch.serve.metrics.MetricBuilder;
 import org.pytorch.serve.servingsdk.snapshot.SnapshotSerializer;
 import org.pytorch.serve.snapshot.SnapshotSerializerFactory;
 import org.slf4j.Logger;
@@ -68,6 +69,7 @@ public final class ConfigManager {
     private static final String TS_NUMBER_OF_GPU = "number_of_gpu";
     private static final String TS_METRICS_CONFIG = "metrics_config";
     private static final String TS_METRICS_MODE = "metrics_mode";
+    private static final String TS_MODEL_METRICS_AUTO_DETECT = "model_metrics_auto_detect";
     private static final String TS_DISABLE_SYSTEM_METRICS = "disable_system_metrics";
 
     // IPEX config option that can be set at config.properties
@@ -412,8 +414,23 @@ public String getTorchRunLogDir() {
         return torchrunLogDir;
     }
 
-    public String getMetricsMode() {
-        return getProperty(TS_METRICS_MODE, "log");
+    public MetricBuilder.MetricMode getMetricsMode() {
+        String metricsMode = getProperty(TS_METRICS_MODE, "log");
+        try {
+            return MetricBuilder.MetricMode.valueOf(
+                    metricsMode.replaceAll("\\s", "").toUpperCase());
+        } catch (IllegalArgumentException | NullPointerException e) {
+            logger.error(
+                    "Configured metrics mode \"{}\" not supported. Defaulting to \"{}\" mode: {}",
+                    metricsMode,
+                    MetricBuilder.MetricMode.LOG,
+                    e);
+            return MetricBuilder.MetricMode.LOG;
+        }
+    }
+
+    public boolean isModelMetricsAutoDetectEnabled() {
+        return Boolean.parseBoolean(getProperty(TS_MODEL_METRICS_AUTO_DETECT, "false"));
     }
 
     public boolean isSystemMetricsDisabled() {