Skip to content

Commit

Permalink
Auto detect backend metrics not defined in metrics configuration (#2769)
Browse files Browse the repository at this point in the history
* Auto detect backend metrics not defined in metrics configuration

* Add tests for metric auto detection

* Update custom metrics example to show metrics auto detection

* Update metrics documentation

* Fix unit test failure

* Set request ID for load model requests

* Update metrics integration test to use custom metrics example

* Make auto updates to frontend metrics cache thread safe

* Add auto detect metrics is always called on metric log that has type information

* Fix auto detect backend metrics unit test

* Update documentation about performance impact of metrics auto detection

* fix linter error

* Disable metrics auto detection by default

* fix integration tests

* Update documentation and custom metrics example about metrics auto detection

* Update metrics auto detection documentation

* Add helper functions to get dimension names and values from Metric object

* Fix java formatting

* Move request id assignment for model load requests from model_service_worker.py to model_loader.py
  • Loading branch information
namannandan committed Nov 21, 2023
1 parent 796808f commit 98c19bc
Show file tree
Hide file tree
Showing 18 changed files with 573 additions and 111 deletions.
42 changes: 23 additions & 19 deletions docs/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,16 +36,25 @@ The location of log files and metric files can be configured in the [log4j2.xml]

**Prometheus Mode**

In `prometheus` mode, metrics defined in the metrics configuration file are made available in prometheus format via the [metrics API endpoint](metrics_api.md).
In `prometheus` mode, metrics are made available in prometheus format via the [metrics API endpoint](metrics_api.md).

## Getting Started with TorchServe Metrics

TorchServe defines metrics in a [yaml](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml) file, including both frontend metrics (i.e. `ts_metrics`) and backend metrics (i.e. `model_metrics`).
TorchServe defines metrics configuration in a [yaml](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml) file, including both frontend metrics (i.e. `ts_metrics`) and backend metrics (i.e. `model_metrics`).
When TorchServe is started, the metrics definition is loaded in the frontend and backend cache separately.
The backend emits metrics logs as they are updated. The frontend parses these logs and makes the corresponding metrics available either as logs or via the [metrics API endpoint](metrics_api.md) based on the metrics_mode configuration.
The backend emits metrics logs as they are updated. The frontend parses these logs and makes the corresponding metrics available either as logs or via the [metrics API endpoint](metrics_api.md) based on the `metrics_mode` configuration.


Dynamic updates to the metrics configuration file is currently not supported. In order to account for updates made to the metrics configuration file, Torchserve will need to be restarted.
Dynamic updates to the metrics configuration file is not supported. In order to account for updates made to the metrics configuration file, Torchserve will need to be restarted.

By default, metrics that are not defined in the metrics configuration file will not be logged in the metrics log files or made available via the prometheus metrics API endpoint.
Backend model metrics can be `auto-detected` and registered in the frontend by setting `model_metrics_auto_detect` to `true` in `config.properties`
or using the `TS_MODEL_METRICS_AUTO_DETECT` environment variable. By default, `model_metrics_auto_detect` is disabled.

`Warning: Using auto-detection of backend metrics will have performance impact in the form of latency overhead, typically at model load and first inference for a given model.
This cold start behavior is because, it is during model load and first inference that new metrics are typically emitted by the backend and is detected and registered by the frontend.
Subsequent inferences could also see performance impact if new metrics are updated for the first time.
For use cases where multiple models are loaded/unloaded often, the latency overhead can be mitigated by specifying known metrics in the metrics configuration file, ahead of time.`


The `metrics.yaml` is formatted with Prometheus metric type terminology:
Expand Down Expand Up @@ -87,9 +96,6 @@ model_metrics: # backend metrics
dimensions: [*model_name, *level]
```
Note that **only** the metrics defined in the **metrics configuration file** can be emitted to model_metrics.log or made available via the [metrics API endpoint](metrics_api.md). This is done to ensure that the metrics configuration file serves as a central inventory of all the metrics that Torchserve can emit.
Default metrics are provided in the [metrics.yaml](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml) file, but the user can either delete them to their liking / ignore them altogether, because these metrics will not be emitted unless they are updated.\
When adding custom `model_metrics` in the metrics configuration file, ensure to include `ModelName` and `Level` dimension names towards the end of the list of dimensions since they are included by default by the following custom metrics APIs:
[add_metric](#function-api-to-add-generic-metrics-with-default-dimensions), [add_counter](#add-counter-based-metrics),
Expand Down Expand Up @@ -175,7 +181,7 @@ Metrics collected include:

### Metric Types Enum

TorchServe Metrics is introducing [Metric Types](https://github.com/pytorch/serve/blob/master/ts/metrics/metric_type_enum.py)
TorchServe Metrics use [Metric Types](https://github.com/pytorch/serve/blob/master/ts/metrics/metric_type_enum.py)
that are in line with the [Prometheus API](https://github.com/prometheus/client_python) metric types.

Metric types are an attribute of Metric objects.
Expand Down Expand Up @@ -268,14 +274,15 @@ All metrics are collected within the context.

### Specifying Metric Types

When adding any metric via Metrics API, users have the ability to override the metric type by specifying the positional argument
When adding any metric via Metrics API, users have the ability to override the default metric type by specifying the positional argument
`metric_type=MetricTypes.[COUNTER/GAUGE/HISTOGRAM]`.

```python
metrics.add_metric("GenericMetric", value, unit=unit, dimension_names=["name1", "name2", ...], metric_type=MetricTypes.GAUGE)
example_metric = metrics.add_metric_to_cache(name="ExampleMetric", unit="ms", dimension_names=["name1", "name2"], metric_type=MetricTypes.GAUGE)
example_metric.add_or_update(value=1, dimension_values=["value1", "value2"])
# Backwards compatible, combines the above two method calls
metrics.add_counter("CounterMetric", value=1, dimensions=[Dimension("name", "value"), ...])
metrics.add_metric(name="ExampleMetric", value=1, unit="ms", dimensions=[Dimension("name1", "value1"), Dimension("name2", "value2")], metric_type=MetricTypes.GAUGE)
```


Expand All @@ -302,14 +309,12 @@ given some criteria:
3. Dimensions should be the same (as well as the same order!)
1. All dimensions have to match, and Metric objects that have been parsed from the yaml file also have dimension names that are parsed from the yaml file
1. Users can [create their own](#create-dimension-objects) `Dimension` objects to match those in the yaml file dimensions
2. if the Metric object has `ModelName` and `Level` dimensions only, it is optional to specify additional dimensions since these are considered [default dimensions](#default-dimensions), so: `add_counter('InferenceTimeInMS', value=2)` or `add_counter('InferenceTimeInMS', value=2, dimensions=["ModelName", "Level"])`
2. If the Metric object has `ModelName` and `Level` dimensions only, it is optional to specify additional dimensions since these are considered [default dimensions](#default-dimensions), so: `add_counter('InferenceTimeInMS', value=2)` or `add_counter('InferenceTimeInMS', value=2, dimensions=["ModelName", "Level"])`


### Default dimensions

Metrics will have a couple of default dimensions if not already specified.

If the metric is a type `Gauge`, `Histogram`, `Counter`, by default it will have:
Metrics will have a couple of default dimensions if not already specified:
* `ModelName,{name_of_model}`
* `Level,Model`

Expand Down Expand Up @@ -555,7 +560,6 @@ Function API
metric_type: MetricTypes
type for defining different operations, defaulted to gauge metric type for Percent metrics
"""
```

**Inferred unit**: `percent`
Expand Down Expand Up @@ -599,7 +603,7 @@ Function API

### Getting a metric

Users can get a metric from the cache. The Metric object is returned, so the user can access the methods of the Metric: (i.e. `Metric.update(value)`, `Metric.__str__`)
Users can get a metric from the cache. The CachingMetric object is returned, so the user can access the methods of the CachingMetric: (i.e. `CachingMetric.add_or_update(value, dimensions_values)`, `CachingMetric.update(value, dimensions)`)

```python
def get_metric(self, metric_name: str, metric_type: MetricTypes) -> Metric:
Expand Down Expand Up @@ -723,12 +727,12 @@ class CustomHandlerExample:
```
- **[v0.6.1 - v0.8.1] to [> v0.8.1]**\
Replace the call to `add_metric` with `add_metric_to_cache`.
2. Starting [v0.8.0](https://github.com/pytorch/serve/releases/tag/v0.8.0), only metrics that are defined in the metrics config file(default: [metrics.yaml](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml))
2. In versions [[v0.8.0](https://github.com/pytorch/serve/releases/tag/v0.8.0) - [v0.9.0](https://github.com/pytorch/serve/releases/tag/v0.9.0)], only metrics that are defined in the metrics config file(default: [metrics.yaml](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml))
are either all logged to `ts_metrics.log` and `model_metrics.log` or made available via the [metrics API endpoint](metrics_api.md)
based on the `metrics_mode` configuration as described [above](#introduction).\
The default `metrics_mode` is `log` mode.\
This is unlike in previous versions where all metrics were only logged to `ts_metrics.log` and `model_metrics.log` except for `ts_inference_requests_total`, `ts_inference_latency_microseconds` and `ts_queue_latency_microseconds`
which were only available via the metrics API endpoint.\
**Upgrade paths**:
- **[< v0.8.0] to [>= v0.8.0]**\
- **[< v0.8.0] to [v0.8.0 - v0.9.0]**\
Specify all the custom metrics added to the custom handler in the metrics configuration file as shown [above](#getting-started-with-torchserve-metrics).
7 changes: 4 additions & 3 deletions examples/custom_metrics/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,10 @@ Run the commands given in following steps from the root directory of the reposit
- HandlerMethodTime
- ExamplePercentMetric

The custom metrics configuration file `metrics.yaml` in this example builds on top of the [default metrics configuration file](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml) to include the custom metrics listed above.
The `config.properties` file in this example configures torchserve to use the custom metrics configuration file and sets the `metrics_mode` to `prometheus`. The custom handler
`mnist_handler.py` updates the metrics listed above.
The custom metrics configuration file [metrics.yaml](metrics.yaml) in this example builds on top of the [default metrics configuration file](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml) to include the custom metrics listed above.
Note that, `HandlerMethodTime` and `ExamplePercentMetric` are not defined in the [metrics configuration file](metrics.yaml) to demonstrate auto-detection of backend metrics.
The [config.properties](config.properties) file in this example configures torchserve to use the custom metrics configuration file, sets the `metrics_mode` to `prometheus` and enables `model_metrics_auto_detect`. The custom handler
[mnist_handler.py](mnist_handler.py) updates the metrics listed above.

Refer: [Custom Metrics](https://github.com/pytorch/serve/blob/master/docs/metrics.md#custom-metrics-api)\
Refer: [Custom Handler](https://github.com/pytorch/serve/blob/master/docs/custom_service.md#custom-handlers)
Expand Down
1 change: 1 addition & 0 deletions examples/custom_metrics/config.properties
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
metrics_mode=prometheus
metrics_config=examples/custom_metrics/metrics.yaml
model_metrics_auto_detect=true
models={\
"mnist": {\
"1.0": {\
Expand Down
8 changes: 1 addition & 7 deletions examples/custom_metrics/metrics.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ ts_metrics:

model_metrics:
# Dimension "Hostname" is automatically added for model metrics in the backend
# "HandlerMethodTime" and "ExamplePercentMetric" metrics are not defined here to show auto-detection of backend metrics
counter:
- name: InferenceRequestCount
unit: count
Expand All @@ -94,10 +95,3 @@ model_metrics:
- name: SizeOfImage
unit: kB
dimensions: [*model_name, *level]
- name: HandlerMethodTime
unit: ms
dimensions: ["MethodName", *model_name, *level]
histogram:
- name: ExamplePercentMetric
unit: percent
dimensions: [*model_name, *level]
6 changes: 2 additions & 4 deletions examples/custom_metrics/mnist_handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,10 +41,6 @@ def initialize(self, context):
name="InitializeCallCount",
value=1,
unit="count",
dimensions=[
Dimension(name="ModelName", value=context.model_name),
Dimension(name="Level", value="Model"),
],
metric_type=MetricTypes.COUNTER,
)

Expand Down Expand Up @@ -95,6 +91,7 @@ def preprocess(self, data):

# "add_time" will register the metric if not already present in metric cache,
# include the "ModelName" and "Level" dimensions by default and emit it
# Note: "HandlerMethodTime" is not defined in "metrics.yaml" and will be auto-detected
metrics.add_time(
name="HandlerMethodTime",
value=(preprocess_stop - preprocess_start) * 1000,
Expand Down Expand Up @@ -122,6 +119,7 @@ def postprocess(self, data):
)
# "add_percent" will register the metric if not already present in metric cache,
# include the "ModelName" and "Level" dimensions by default and emit it
# Note: "ExamplePercentMetric" is not defined in "metrics.yaml" and will be auto-detected
self.context.metrics.add_percent(
name="ExamplePercentMetric",
value=50,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ public class Metric {

private static final Pattern PATTERN =
Pattern.compile(
"\\s*([\\w\\s]+)\\.([\\w\\s]+):([0-9\\-,.e]+)\\|#([^|]*)\\|#hostname:([^,]+),([^,]+)(,(.*))?");
"\\s*([\\w\\s]+)\\.([\\w\\s]+):([0-9\\-,.e]+)\\|#([^|]*)(\\|#type:([^|,]+))?\\|#hostname:([^,]+),([^,]+)(,(.*))?");

@SerializedName("MetricName")
private String metricName;
Expand All @@ -23,9 +23,18 @@ public class Metric {
@SerializedName("Unit")
private String unit;

@SerializedName("Type")
private String type;

@SerializedName("Dimensions")
private List<Dimension> dimensions;

@SerializedName("DimensionNames")
private List<String> dimensionNames;

@SerializedName("DimensionValues")
private List<String> dimensionValues;

@SerializedName("Timestamp")
private String timestamp;

Expand All @@ -41,13 +50,15 @@ public Metric(
String metricName,
String value,
String unit,
String type,
String hostName,
Dimension... dimensions) {
this.metricName = metricName;
this.value = value;
this.unit = unit;
this.type = type;
this.hostName = hostName;
this.dimensions = Arrays.asList(dimensions);
this.setDimensions(Arrays.asList(dimensions));
this.timestamp =
String.valueOf(TimeUnit.MILLISECONDS.toSeconds(System.currentTimeMillis()));
}
Expand Down Expand Up @@ -92,12 +103,34 @@ public void setUnit(String unit) {
this.unit = unit;
}

public String getType() {
return type;
}

public void setType(String type) {
this.type = type;
}

public List<Dimension> getDimensions() {
return dimensions;
}

public List<String> getDimensionNames() {
return this.dimensionNames;
}

public List<String> getDimensionValues() {
return this.dimensionValues;
}

public void setDimensions(List<Dimension> dimensions) {
this.dimensions = dimensions;
this.dimensionNames = new ArrayList<String>();
this.dimensionValues = new ArrayList<String>();
for (Dimension dimension : dimensions) {
this.dimensionNames.add(dimension.getName());
this.dimensionValues.add(dimension.getValue());
}
}

public String getTimestamp() {
Expand All @@ -120,9 +153,10 @@ public static Metric parse(String line) {
metric.setUnit(matcher.group(2));
metric.setValue(matcher.group(3));
String dimensions = matcher.group(4);
metric.setHostName(matcher.group(5));
metric.setTimestamp(matcher.group(6));
metric.setRequestId(matcher.group(8));
metric.setType(matcher.group(6));
metric.setHostName(matcher.group(7));
metric.setTimestamp(matcher.group(8));
metric.setRequestId(matcher.group(10));

if (dimensions != null) {
String[] dimension = dimensions.split(",");
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,11 +29,7 @@ private MetricCache() throws FileNotFoundException {
return;
}

MetricBuilder.MetricMode metricsMode = MetricBuilder.MetricMode.LOG;
String metricsConfigMode = ConfigManager.getInstance().getMetricsMode();
if (metricsConfigMode != null && metricsConfigMode.toLowerCase().contains("prometheus")) {
metricsMode = MetricBuilder.MetricMode.PROMETHEUS;
}
MetricBuilder.MetricMode metricsMode = ConfigManager.getInstance().getMetricsMode();

if (this.config.getTs_metrics() != null) {
addMetrics(
Expand Down Expand Up @@ -106,6 +102,24 @@ public static MetricCache getInstance() {
return instance;
}

public IMetric addAutoDetectMetricBackend(Metric parsedMetric) {
// The Hostname dimension is included by default for backend metrics
List<String> dimensionNames = parsedMetric.getDimensionNames();
dimensionNames.add("Hostname");

IMetric metric =
MetricBuilder.build(
ConfigManager.getInstance().getMetricsMode(),
MetricBuilder.MetricType.valueOf(parsedMetric.getType()),
parsedMetric.getMetricName(),
parsedMetric.getUnit(),
dimensionNames);

this.metricsBackend.putIfAbsent(parsedMetric.getMetricName(), metric);

return metric;
}

public IMetric getMetricFrontend(String metricName) {
return metricsFrontend.get(metricName);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -81,12 +81,10 @@ public void run() {
} else {
if (this.metricCache.getMetricFrontend(metric.getMetricName()) != null) {
try {
List<String> dimensionValues = new ArrayList<String>();
for (Dimension dimension : metric.getDimensions()) {
dimensionValues.add(dimension.getValue());
}
// Frontend metrics by default have the last dimension as Hostname
List<String> dimensionValues = metric.getDimensionValues();
dimensionValues.add(metric.getHostName());

this.metricCache
.getMetricFrontend(metric.getMetricName())
.addOrUpdate(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@
import org.apache.commons.cli.Option;
import org.apache.commons.cli.Options;
import org.apache.commons.io.IOUtils;
import org.pytorch.serve.metrics.MetricBuilder;
import org.pytorch.serve.servingsdk.snapshot.SnapshotSerializer;
import org.pytorch.serve.snapshot.SnapshotSerializerFactory;
import org.slf4j.Logger;
Expand All @@ -68,6 +69,7 @@ public final class ConfigManager {
private static final String TS_NUMBER_OF_GPU = "number_of_gpu";
private static final String TS_METRICS_CONFIG = "metrics_config";
private static final String TS_METRICS_MODE = "metrics_mode";
private static final String TS_MODEL_METRICS_AUTO_DETECT = "model_metrics_auto_detect";
private static final String TS_DISABLE_SYSTEM_METRICS = "disable_system_metrics";

// IPEX config option that can be set at config.properties
Expand Down Expand Up @@ -412,8 +414,23 @@ public String getTorchRunLogDir() {
return torchrunLogDir;
}

public String getMetricsMode() {
return getProperty(TS_METRICS_MODE, "log");
public MetricBuilder.MetricMode getMetricsMode() {
String metricsMode = getProperty(TS_METRICS_MODE, "log");
try {
return MetricBuilder.MetricMode.valueOf(
metricsMode.replaceAll("\\s", "").toUpperCase());
} catch (IllegalArgumentException | NullPointerException e) {
logger.error(
"Configured metrics mode \"{}\" not supported. Defaulting to \"{}\" mode: {}",
metricsMode,
MetricBuilder.MetricMode.LOG,
e);
return MetricBuilder.MetricMode.LOG;
}
}

public boolean isModelMetricsAutoDetectEnabled() {
return Boolean.parseBoolean(getProperty(TS_MODEL_METRICS_AUTO_DETECT, "false"));
}

public boolean isSystemMetricsDisabled() {
Expand Down
Loading

0 comments on commit 98c19bc

Please sign in to comment.