Metrics Naming Guidelines

We propose a set of guidelines to build consistent and readable names for metrics. The guidelines cover how to build a good hierarchical name, the syntax of elements in a name, the usage of dimensions (attributes), pluralization and suffixes.

This set of “rules” has been built by looking at naming conventions and best practices used by other software (e.g. Prometheus, Datadog) or standards (OpenTelemetry, OpenMetrics - see for details).
They follow OpenTelemetry guidelines most closely with some ES specifics.

Guidelines

A metric name should be composed of elements limited by separators to organize them in a hierarchy.

Elements should be lower-case, and use underscore (_) to combine words in the same element level (e.g. blob_cache).

The separator character is dot (.)

The hierarchy should be built by putting "more common" elements at the beginning, in order to facilitate the creation of new metrics under a common namespace. Each element in the metric name specializes or describes the prefix that precedes it. Rule of thumb: you could truncate the name at any segment, and what you're left with is something that makes sense by itself.

Example:

prefer es.indices.docs.deleted.total to es.indices.total.deleted.docs
This way you can later add es.indices.docs.total, es.indices.docs.ingested.total, etc.)

Prefix metrics:

Always use es as our root application name: this will give us a separate namespace and avoid any possibility of clashes with other metrics, and quick identification of Elasticsearch metrics on a dashboard.
Follow the root prefix with a simple module name, team or area of code. E.g. snapshot, repositories, indices, threadpool. Notice the mix of singular and plural - here this is intentional, to reflect closely the existing names in the codebase (e.g. reindex and indices)
In building a metric name, look for existing prefixes (e.g. module name and/or area of code, e.g. blob_cache) and for existing sub-elements as well (e.g. error) to build a good, consistent name. E.g. prefer the consistent use of error.total rather than introducing failures, failed.total or errors.
Avoid having sub-metrics under a name that is also a metric (e.g. do not create names like es.repositories.elements, es.repositories.elements.utilization; use es.repositories.element.total andes.repositories.element.utilizationinstead). Such metrics are hard to handle well in Elasticsearch, or in some internal structures (e.g. nested maps).

Keep the hierarchy compact: do not add elements if you don’t need to. There is a description field when registering a metric, prefer using that as an explanation.
For example, if emitting existing metrics from node stats, do not use the whole “object path”, but choose the most significant terms.

The metric name can be generated but there should be no dynamic or variable content in the name: that content belongs to a dimension (attributes/labels).

Node name, node id, cluster id, etc. are all considered dynamic content that belongs to attributes, not to the metric name.
When there are different "flavors" of a metric (i.e. s3, azure, etc) use an attribute rather than inserting it in the metric name.
Rule of thumb: you should be able to do aggregations (e.g. sum, avg) across a dimension of a given metric (without the need to aggregate over different metric names); on the other hand, any aggregation across any dimension of a given metric should be meaningful.
There might be exceptions of course. For example:
- When similar metrics have significantly different implementations/related metrics.
  If we have only common metrics like es.repositories.element.total, es.repositories.element.utilization, es.repositories.writes.total for every blob storage implementation, then s3,azure should be an attribute.
  If we have specific metrics, e.g. for s3 storage classes, prefer using prefixed metric names for the specific metrics: es.repositories.s3.deep_archive_access.total (but keep es.repositories.elements)
- When you have a finite and fixed set of names it might be OK to have them in the name (e.g. "young" and "old" for GC generations).

The metric name should NOT include its unit. Instead, the associated physical quantity should be added as a suffix, possibly following the general semantic names (link). Examples :

es.process.jvm.collection.time instead of es.process.jvm.collection.seconds.
es.process.mem.virtual.size, es.indices.storage.size (instead of es.process.mem.virtual.bytes, es.indices.storage.bytes)
In case size has a known upper limit, consider using usage (e.g.: es.process.jvm.heap.usage when there is a es.process.jvm.heap.limit)
es.indices.storage.write.io, instead of es.indices.storage.write.bytes_per_sec
These can all be composed with the suffixes below, e.g. es.process.jvm.collection.time.total, es.indices.storage.write.total to represent the monotonic sum of time spent in GC and the total number of bytes written to indices respectively.

Suffixes:

Use total as a suffix for monotonic metrics (always increasing counter) (e.g. es.indices.docs.deleted.total)
- Note: even though async counter is reporting a total cumulative value, it is till monotonic.
Use current to represent the non-monotonic metrics (like gauges, upDownCounters)
- e.g. current vs total We can have es.process.jvm.classes.loaded.current to express the number of classes currently loaded by the JVM, and the total number of classes loaded since the JVM started as es.process.jvm.classes.loaded.total
Use ratio to represent the ratio of two measures with identical unit (or unit-less) or measures that represent a fraction in the range [0, 1]. Examples:
- Exception: consider using utilization when the ratio is between a usage and its limit, e.g. the ratio between es.process.jvm.heap.usage and es.process.jvm.heap.limit should be es.process.jvm.heap.utilization
Use status to represent enum like gauges. example es.health.overall.red.status have values 1/0 to represent true/false
Use usage to represent the amount used ouf of the known resource size
Use size to represent the overall size of the resource measured
Use utilisation to represent a fraction of usage out of the overall size of a resource measured
Use histogram to represent instruments of type histogram
Use time to represent passage of time
If it has a unit of measure, then it should not be plural (and also not include the unit of measure, see above). Examples: es.process.jvm.collection.time, es.process.mem.virtual.usage, es.indices.storage.utilization

Attributes

Attribute names should follow the same rules. In particular, these rules apply to attributes too:

elements and separators
hierarchy/namespaces
units
pluralization (when an attribute represents a measurement)

For pluralization, when an attribute represents an entity, the attribute name should be singular (e.g. es.security.realm_type, not es.security.realms_type or es.security.realm_types), unless it represents a collection (e.g. es.rest.request_headers)

List of previously registered metric names

You can inspect all previously registered metrics names with ./gradlew run -Dtests.es.logger.org.elasticsearch.telemetry.apm=debug This should help you find out the already registered group that your meteric might fit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NAMING.md

NAMING.md

Metrics Naming Guidelines

Guidelines

Attributes

List of previously registered metric names

Files

NAMING.md

Latest commit

History

NAMING.md

File metadata and controls

Metrics Naming Guidelines

Guidelines

Attributes

List of previously registered metric names