Skip to content

Commit

Permalink
Propose support for UTF-8 in metric and label names
Browse files Browse the repository at this point in the history
Signed-off-by: Owen Williams <[email protected]>
  • Loading branch information
ywwg committed Sep 21, 2023
1 parent 6169c90 commit 165f65f
Showing 1 changed file with 261 additions and 0 deletions.
261 changes: 261 additions & 0 deletions proposals/2023-08-21-utf8.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,261 @@
# UTF-8 Support for Metric and Label names

* **Owners:**
* `<@author: [email protected]>`

* **Implementation Status:** `Proof-of-concept implemented`

* **Related Issues and PRs:**
* [GH Issue](https://github.com/prometheus/prometheus/issues/12630)
* [PR](https://github.com/grafana/mimir-prometheus/pull/476) (TODO: needs to be rebased on upstream prom)

* **Other docs or links:**
* [Background Discussion / Justification](https://docs.google.com/document/d/1yFj5QSd1AgCYecZ9EJ8f2t4OgF2KBZgJYVde-uzVEtI/edit). Please read this document first for more information on the chosen solution.

> TL;DR: We propose to add support for arbitrary UTF-8 characters in Prometheus metric and label names. While tsdb and the Prometheus code already support arbitrary UTF-8 strings, PromQL and the exposition format need a quoting syntax to make these values distinguishable.
## Why

We wish to use Prometheus as a backing database for other metrics implementations like Open Telemetry that use Prometheus-illegal characters, like '`.`'. Establishing a quoting syntax avoids the need for verbose and unreadable character encoding schemes, as well as non-roundtripable replacement schemes like using underscore for all unsupported characters.

### Pitfalls of the current solution

The current recommendation for users wanting to handle invalid characters is to replace all those characters with underscores. This substitution cannot be reversed, creating the possibility for name collisions. Furthermore, the change in metric names between original write and eventual read can be confusing for end users.

## Goals

* Allow arbitrarily-named metrics and labels to be submitted to Prometheus.
* Allow as many UTF-8 characters as possible to exist literally inline, with escaping required for characters that are unprintable or cause parsing problems.
* Allow these metric and label names to be easily queried back with PromQL.
* Ensure that Go templating supports quoted label names.
* Provide a graceful backwards-compatible fallback for query clients that do not support UTF-8 metric and label names.

### Audience

The audience for this change is very broad. It would allow all end-users of Prometheus to use richer metric names. Tool and service authors will want to be aware of the change and support the new character set.

## Non-Goals

We do not directly address the issue of adding function-like syntaxes, such as `metricname.sum`. The changes proposed here are forward-compatible with such a new syntax, as long as the new syntax is outside of the quoted metric names.

We do not propose a syntax for selecting metric names other than by basic equality. The changes proposeed here are forward-compatible with other future possible selection mechanisms, for instance operator prefixes on the quoted metric name.

## How

### But First: Differences between PromQL and Openmetrics Exposition

Changes will need to be made to both PromQL and the exposition format. It is important to note that while these two languages have surface similarities, there are important differences between them that need to be accounted for:

* Quotes:
* **PromQL**: Strings may be quoted with single or double quotes. Strings may also be quoted with backticks to avoid escape sequences (raw strings).
* **Exposition format**: Supports only double quotes.
* Equal Sign:
* **PromQL**: The equal sign (`=`) is a label matching operator, and there are other label matching operators for other operations.
* **Exposition Format**: The equal sign is an assignment operator that sets a label to a specific value.
* String Escaping:
* **PromQL**: Strings are unquoted using the Go strconv.Unquote function.
* **Exposition Format**: Characters are escaped with a custom backslash format.

### What will the new syntax look like?

**tl;dr:**

`{"my.noncompliant.metric", "noncompliant.label"="value"}`

#### Oh no, I hate it

A full discussion of rejected alternatives can be found below, but a common first reaction is, why move the metric name inside the braces? Why not leave it outside, like: `"my.noncompliant.metric"{"label"="value"}`.

There are some specific risks with this alternative approach, but we also wanted to weigh the value of a syntax that is **more familiar to current experts** vs **more approachable for new users**. We take the optimistic position that there are more potential new users of Prometheus than there are existing users, and therefore decided to give more weight to that framing.

The previous syntax makes sense because it resembles a function call, the way that `foo{bar}` is like `foo(bar)`. The metric name is the primary selector, and the braces enclose parameters on that selection. If we were to add the additional quote characters, this frame of reference is lost. For new users, visually parsing the query, with brace in the middle of a noisy series of quoted values, is likely to be confusing.

Instead, we believe that it is more intuitive and easier to teach new users to think of the brace as **the characters that surround the parameters for selecting data**, where the "metric name" is **a special case** of that kind of selector. This is similar to the approach chosen for [LogQL](https://grafana.com/docs/loki/latest/query/) and [TraceQL](https://grafana.com/docs/tempo/latest/traceql/), which do not have a special case "metric name" equivalent and thus have no items outside the curly braces.

Furthermore, PromQL users will already know that it's possible to put the metric name inside the braces, for the purpose of using an operator to enable richer metric selection, like: `{__name__=~"my\\.metric.*"}`. The chosen syntax serves as a shortening of this operation, and is forward-compatible with future metric matching prefix operators if we decide to implement them.

### Syntax Changes

We propose the following change in syntax to PromQL and the exposition format:

* Label names may be quoted, either double quotes for the exposition format, or singles/doubles/backticks for PromQL.
* A quoted string inside the label set (`{}`) without a label matching operator is default-assigned to the label name `__name__` (the metric name).
* Two strings without a label matching operator is a syntax error.
* Style prefers, but does not enforce, that the string without a label matching operator is the first term in the label set.
* Strings in quotes will be unquoted using the mechanism for each language: strconv.Unquote for PromQL, and the OpenMetrics escaping mechanism for the exposition format.
* In the exposition format, the TYPE, HELP, and UNIT lines can have double quotes around the metric name and similarly use the OpenMetrics escaping mechanism.

### Syntax Examples:

* Basic quoted metric: `{"my.dotted.metric", region="east"}`

* Quoted label name: `{"my.dotted.metric", "http.status"=~"5.."}`

* Classic metrics do not need to be treated differently:
`sum(rate(regular_metric{region="east"}[5m]))`

* If the user decides to put a classic metric inside the selector, it still needs to be quoted:
`sum(rate({"regular_metric", region="east"}[5m]))`

* Escape syntax if the metric has a quote: `sum(rate({"my \"quoted\" metric", region="east"}[5m]))` or use single quotes in PromQL (not available in the exposition format): `sum(rate({'my "quoted" metric', region="east"}[5m]))`

* Recording Rule Syntax: `sum(rate({"my.dotted.metric:errors:rate5m", region="east"}[5m]))`. Recording rule syntax is a naming convention, not a programmatically-meaningful operator syntax.

* Unicode escaping: `{"\u263a"}`

* Compatible with possible future Native Histogram syntax (not being proposed here):
`{"my.metric.with.periods", region="east"}.sum`

* Compatible with possible future metric name operator syntax via prefixes (not being proposed here):
```{~`a\.regex.*selector.*`, region="east"}``` or `{!"not.this.name", region="east"}`

### Unusual / Complex Characters

Depending on the parsing implementation, there may be UTF-8 characters that, when encountered as native values in quotes, may cause problems with the lexers and parsers. These characters could include whitespace (especially newlines) or complex combiner characters. If such characters exist, we will require that they be escaped with `\u` rather than allow them to exist literally. As development of this proposal progresses, we will discover which characters will fall into this category.

There may also be differences between PromQL and the exposition format as to what characters must be escaped. To reduce confusion we may require escaping for characters that need not be escaped in one language but would not otherwise have needed to be escaped in the other.

### Templating syntax

Templating is used in alerts, and for dashboards that want to reference metric values. Go templating already supports quoting by way of the `index` keyword. Whereas labels are usually denoted as `{{ $labels.my_label_name }}`, the quote syntax is `{{ index $labels "my.label.name" }}`. No code changes are required.

## Backwards-compatibility

There a few situations where we need to consider backwards compatibility:

1. Old metrics provider, new backend: Old providers will not serve UTF-8 metric names, so nothing needs to be done here.
2. New metrics provider, old backend: Old versions of Prometheus will throw an error if they see the new syntax. Therefore we must provide a way for the scraped target to serve previously-invalid metrics in a form that the old database can process.
3. New backend, old query client: Errors will occur if an old client is fed UTF-8 metric names. Therefore we must provide a way for Prometheus to escape metric names such that old clients can query for and read them.
4. Old backend, new query client: as with the first case, nothing needs to be done.

### Text escaping

We propose an escaping syntax for metric names that contain UTF-8 characters for systems that do not support UTF-8. This system can be used in queries and responses to provide compatibility between the systems.

* Prefix the string with `U__`.
* All non-valid characters (i.e. characters other than letters, numbers, and underscores) will be encoded as underscores surrounding the unicode value, like `_1F60A_`.
* All pre-existing underscores will become doubled: `__`.
* If a string should start with "U__" already, it will need to be escaped: `U___55_____`. (That's `U__` + `_55_` (for `U`) + `__` + `__`).

#### Mixed-Block Scenario

We must consider an edge case in which a newer client persists metrics to disk in an older database that does not support UTF-8. Those metrics will be written to disk with the U__ escaping format. If, later, the user upgrades their database software, new metrics will be written with the native UTF-8 characters. At query time, there will be a problem: newer blocks were written with UTF-8 and older blocks were written with the escaping format. The query code will not know which encoding to look for.

To avoid this confusion we propose to bump the version number in the tsdb meta.json file. On a per-block basis the query code can check the version number and look for UTF-8 metric names in native encoding for bumped version number 2, or in U__-escaped naming for earlier version 1.

Additionally, this means that an old metric or label name containing a valid escape sequence by chance would be parsed differently after enabling UTF-8 support. For example, a metric called `U___263a_` would appear as `` after the upgrade (if read from an old storage chunk, scraped from an old target, or received from an old remote-write client). This is an extremely unlikely scenario in practice so we believe a theoretically possible implicit name change is acceptable.

Lastly, if a metric is written as "U__" but does not pass the unescaping algorithm, then we assume it was meant to be read as-is and let it pass through.

### Scrape Time

Prometheus can scrape metrics either via HTTP text or protobuf.

In the case of text scraping, we propose a version bump in the [TextVersion and OpenMetricsVersion numbers](https://github.com/prometheus/common/blob/main/expfmt/expfmt.go) so that a Prometheus scraper can signal its support of the new syntaxes. For TextVersion, this would be version 0.0.5. For OpenMetrics this would either be version 1.1.0 or 2.0.0, depending on whether this is considered a breaking change.

For protobuf scraping, we would add a new parameter to the content-type tuple list, perhaps `validchars=UTF-8`.

Replies to clients not providing these values would be escaped with the Text Escaping method.

### Query Time

Prometheus will also need to support queries and responses from old query clients that do not support the new syntax. We will create a new boolean HTTP header called AllowUTF8Names. When this value is true in the request, and the feature is enabled in prometheus, UTF-8 support will be enabled. For any client that does not pass the AllowUTF8Names header, any HTTP request that can return metric names (including simple queries) will have the UTF-8 metric names escaped.

### HTTP quoting

The HTTP endpoints will need to support HTTP-unquoting metric names in the URL parameters and POST data of HTTP API calls that take metric names as an argument.

## Implementation

Implementing this feature will require updates to several Prometheus libraries. These updates will be written so that functionality does not change for existing users and builds, and will only be enabled when certain flags are flipped or options turned on. There will be minimal read-only backwards compatibility for clients that do not support the new supported character set and syntax.

This implementation summary is based on the partial proof-of-concept work and may need to expand to accommodate any additional edge cases that are encountered. We do not expect the scope of work to expand significantly.

Main tasks:

* Add flags / options to libraries and binaries to enable the new behavior.
* Change / update code locations that check for a valid metric name, based on the flag settings and header values.
* Bump the version number in the tsdb code and to metric encoding at query time, and select escaping depending on the version number in each meta.json file.
* Update text generators to generate the new syntax when a metric must be quoted.
* Update PromQL parsers and exposition format parsers to support the new syntax.
* Add an HTTP header so that clients can indicate their support for UTF-8.
* Add a tuple term to the protobuf content type so clients can indicate their support for UTF-8.
* Add a way to escape metric names for clients that do not support the full character set.
* Update endpoint code to support HTTP-unquoting metric names in the URL parameters and POST data of HTTP API calls that take metric names as an argument.

### prometheus/common

We will add a boolean argument to model.IsValidMetricName that switches validation logic to allow UTF-8 names (and therefore will only check for 0 length). Because this is a public function, all users of this library will need to add an argument to their call sites.

openmetrics_create.go and text_create.go will detect if the metric names require quoting, and if so, use
the new syntax when creating the formatted text.

Other repositories that use this check to filter out "bad" metric names will have to be optional via a flag or argument.

### prometheus/client_golang

The golang library checks that metrics have a valid name when NewDesc is called. We can create a new parallel function, NewDescUTF8 (or similar) that will bypass the validity check.

### prometheus/prometheus

All support for UTF-8 will be behind a default-off configuration flag. Behavior will not be changed unless this option is on.

Prometheus itself needs updates to its lexers and parsers to support reading the new exposition syntax and promql syntax:

* openmetricslex (openmetrics exposition format)
* promlex (prometheus exposition format)
* generated_parser (PromQL parsing)
* parser.ts (validation for text input fields)

We will also have to bump the version number of tsdb and add code to handle the various querying patterns in the Mixed Block Scenario described above.

## Alternatives Considered

### Non-quoted metric names

`my.dotted.metric{label="value"}`

While this syntax could be made to work with simple additional characters like dots, it does not fulfill the goal of full UTF-8 support, which would include `{` or other special characters without a verbose and less-readable escaping mechanism, e.g.

`my.dotted.metric\{with\ braces\ and\ spaces\}{foo="bar"}`.

Versus our selected approach:

`{"my.dotted.metric{with braces and spaces}", foo="bar"}`

Selecting this approach would introduce a convenient syntax for only a small new subset of UTF-8, so a quoting mechanism would still be needed for the full character set. Rather than implement this half-improvement we decided to only support a full quoting approach.

Furthermore, the dot has been reserved for over a decade as a potential operator for PromQL with a likely forthcoming application in Native Histograms. Adding `.` as a valid character would solve some of the problems we face in compatibility at the cost of losing it as an operator.

### Quoted metric names outside curly braces

`"my.dotted.metric"{"label"="value"}`

While quoted strings are literals in PromQL, the combination of string+"{}" is currently an invalid syntax, so we could feasibly use this approach. There was concern that this syntax posed a greater threat of issues with existing queries in the wild because string literals are suddenly also instance vector selectors, so we decided to go with the safer approach of moving everything inside the curly braces.

See the [Discussion Doc](https://docs.google.com/document/d/1yFj5QSd1AgCYecZ9EJ8f2t4OgF2KBZgJYVde-uzVEtI/edit#heading=h.bpkufz5d0juo) for more exploration of this approach and why it was not selected.

### Backticked metric names outside the curly braces

``` `my.dotted.metric`{"label"="value"}```

Single, double, and backtick quotes are nearly equivalent in PromQL so this would just be the same as the above approach, or require changing the meaning of a particular quote character in a way that would break backward-compatibility. (Backticks prevent processing of escape sequences, which is otherwise enabled, see [PromQL documentation](https://prometheus.io/docs/prometheus/latest/querying/basics/#string-literals) for details.)

### VictoriaMetrics approach

VictoriaMetrics has extended PromQL for UTF-8 Support, calling the updated language [MetricsQL](https://docs.victoriametrics.com/MetricsQL.html):

> * Metric names and label names may contain any unicode letter. For example температура{город="Киев"} is a value MetricsQL expression.
> * Metric names and labels names may contain escaped chars. For example, foo\-bar{baz\=aa="b"} is valid expression. It returns time series with name foo-bar containing label baz=aa with value b. Additionally, the following escape sequences are supported:
> * \xXX, where XX is hexadecimal representation of the escaped ascii char.
> * \uXXXX, where XXXX is a hexadecimal representation of the escaped unicode char.
By allowing non-escaped characters, this approach risks colliding with any new feature in PromQL that depends on a special character. Other characters will have to be escaped, resulting in many backslashes as in our own non-quoted alternative approach. Note that the \u approach also works in our proposal.

## Action Plan

1. Make necessary updates to common and client_golang, with flags / parameters as needed to preserve existing functionality.
2. Make updates to Prometheus and Grafana Agent.
3. Launch as experimental feature, default off.
4. Once stable and debugged, make UTF-8 the default (with fallback escaping for old clients always supported).

0 comments on commit 165f65f

Please sign in to comment.