Proposal Amendment: UTF-8 migration

This proposal expands upon the existing UTF-8 proposal, providing more detail into how we plan to handle the migration of metrics data, ensuring that it remains queryable during the transition period
prometheus · Nov 15, 2023 · f437109 · f437109
1 parent c9d43de
commit f437109
Showing 1 changed file with 149 additions and 0 deletions.
diff --git a/proposals/2023-11-13-utf8-migration.md b/proposals/2023-11-13-utf8-migration.md
@@ -0,0 +1,149 @@
+# UTF-8 Support for Metric and Label names
+
+* **Owners:**
+  * `<@author: [email protected]>`
+
+* **Implementation Status:** N/A
+
+* **Related Issues and PRs:**
+  * [GH Issue](https://github.com/prometheus/prometheus/issues/12630)
+  * [PR](https://github.com/grafana/mimir-prometheus/pull/476) (TODO: needs to be rebased on upstream prom)
+
+* **Other docs or links:**
+  * [Primary Proposal](https://github.com/prometheus/proposals/blob/main/proposals/2023-08-21-utf8.md)
+  * [Background Discussion / Justification](https://docs.google.com/document/d/1yFj5QSd1AgCYecZ9EJ8f2t4OgF2KBZgJYVde-uzVEtI/edit). Please read this document first for more information on the chosen solution.
+
+> TL;DR: This is an amendment to the existing UTF-8 proposal that provides more detail in the backwards compatibility and migration scenarios.
+
+## Why
+
+## Goals
+
+* Allow queries to transparently read data from blocks generated by combinations of old and new versions of tsdb and scraping clients.
+* Minimize edge cases where behavior is undefined or suboptimal or risks bad results.
+
+### Audience
+
+The audience for this amendment are users that are planning to migrate existing Prometheus deployments to add support for UTF-8 metric names who want to ensure continuity in query behavior through the upgrade process.
+
+## Non-Goals
+
+We do not promise smooth accommodation of every edge case, especially pathological ones (see Name Collisions below).
+In those instances, users may not be able to turn on UTF-8 support, or may need to rename metrics.
+
+## How
+
+Given a query for a UTF-8 metric name, the tsdb will look for that metric in on-disk blocks whether those blocks were written in native UTF-8 or either of two supported name-munging patterns.
+Those series will be located even in cases when a single block has one metric written in more than one way.
+The tsdb will know what versions of clients wrote those blocks based on a new entry in the meta.json.
+
+### Mixed-Format Scenarios
+
+We must consider edge cases in which a blocks database has persisted metrics that may have been written by different client versions. There are multiple ways this can (and will) happen:
+
+* A newer client persists metrics to an older database version. In this case, metrics would be escaped with the U__ syntax.  If the database is upgraded, newer blocks will be written in UTF-8.
+* A newer database receives metrics from an older client, which is later upgraded. In this case, older metrics might be escaped using the replace-with-underscores method, and newer metrics will be UTF-8. 
+* A newer database receives metrics from a mix of new and old clients, in which case the same block could contain munged and UTF-8 data representing the same intended metric name.
+
+At query time, there will be a problem: some data may be written with UTF-8 and other data was written with an escaping format.
+The query code will not know which encoding to look for.
+In order to ensure consistent querying, the backwards-compatibility design must account for these scenarios, making trade-offs when needed.
+
+All of these situations can be summarized as follows:
+
+1. Data written with old database code: all metric names are guaranteed not to be UTF-8.
+2. Data written with new database code by new clients: all metric names guaranteed to be UTF-8-compatible.
+3. Data written with new database code by one or more old clients (and possibly new clients as well): No guarantees, some names could be escaped, others not.
+
+### Proposed Solution
+
+To help alleviate this confusion we first propose to bump the version number in the tsdb meta.json file. On a per-block basis, the query code can check the version number and know if the data was written with an old version of the database code. This helps distinguish the first case.
+
+Secondly, we will add a new flag to meta.json that indicates the oldest client protocol version that was used to write data to this block.
+This is useful for distinguishing the second case from the third.
+If the oldest client version supports UTF-8, then all data in the block is UTF-8 compatible.
+But if an older client contributed to the block, then data could be mixed.
+
+Thankfully, during content negotiation, the write path knows whether the client doing the writing is capable of sending UTF-8 data.
+If it is not, then we can mark that block as having an old client and the querying code will know the block falls into the third case and can look for mixed names.
+
+### Query time
+
+For the mixed-format scenarios, at query time, we will to look for **all possible** escapings of a name in order to locate the correct data. We propose to do this by expanding a lookup for a UTF-8 metric name into a reasonable set of escapings:
+
+1. UTF-8 (only if the tsdb version is newer)
+2. underscore-replaced: All unsupported characters are converted to underscores.
+3. U__ escaping:  As described in the UTF-8 proposal, strings with invalid characters can be escaped by prepending `U__` and replacing all invalid characters with `_[UTF8 value]_`.
+
+In PromQL, this would look something like:
+
+User-generated query:
+
+`{"my.utf8.metric", label="value"}`
+
+Expanded query:
+
+`{"__name__"=~"^my.utf8.metric$|^my_utf8_metric$|^U__my_2E_utf8_2E_metric$", label="value"}`
+
+XXXXXXXXXXXXXXXXXXXXXX OR three queries combined in code?????
+
+`{"my.utf8.metric", label="value"}`
+`{"my_utf8_metric", label="value"}`
+`{"U__my_2E_utf8_2E_metric", label="value"}`
+
+
+EXPLAIN Since metric names are fast to look up
+
+### Regex lookups
+
+If the user is querying for metrics using a regex lookup for the `__name__` label, attempting to rewrite that query to account for other name encodings would be overly complex and error-prone.
+Therefore we will not try to rewrite the regex to account for multiple escaping methods and the regex will be passed through as-is.
+Users will need to write custom regex queries to account for metric name changes during the transition period in this case. 
+Since regex queries on metrics names are relatively rare and the domain of advanced users, we feel this is an acceptable approach.
+
+### Name Collisions
+
+In most cases, we do not anticipate bad query results due to name collisions in the case where names are munged by an old client using the underscore method.
+This is because collisions would occur at write time, when the colliding names are written to the database.
+Any problems with collisions will occur well before a migration to UTF-8 support takes place.
+Therefore, behavior due to name collisions due to underscore replacement is undefined.
+
+Hypothetically, there could be collisions in the following situation:
+
+1. A database has incoming metrics generated by an old client that munges names with underscores.
+2. That database also has incoming metrics written in UTF-8 by a new client.
+3. There is a UTF-8 metric name that collides with a similar metric sent by the old client.
+
+For example, an old client is sending "service.name", and that is getting munged to "service_name" by that client at write time.
+And then, a newer client is sending "service/name" as native UTF-8.
+The error occurs when the user tries to query for "service/name": because an old client was writing to the same blocks as the new one, the query will be expanded to look for "service_name" and will accidentally grab the metrics meant for "service.name".
+
+The short answer to avoiding this scenario is **don't do that**. Specifically: If possible, if there are any old clients present, do not construct metrics which could cause collisions; and if that is unavoidable, don't mix old and new clients together.
+
+As long as all the clients are new, users do not need to worry about collisions -- "service.name" and "service/name" will be stored separately and the queries will never have to be expanded to include the munged "service_name" possibility.
+
+This situation seems contrived-enough that we are comfortable not supporting it.
+
+## Discarded Approaches
+
+### Rewrite Old Data
+
+We could have required that users rewrite their tsdb blocks to "upgrade" them to UTF-8 and undo the munging.
+This approach seems tedious, difficult, and dangerous -- what if something goes wrong during rewriting?
+Requiring massive data rewrites is not a reasonable ask of users.
+
+### Lookup Table / Per-Metric Config
+
+We considered recording a lookup table or per-metric configuration that would describe how UTF-8 metrics might be stored in old data blocks.
+This approach would be faster than doing query expansion, but would create extra operational overhead -- lookup tables would have to be correct and exhaustive.
+
+Because metric names are stored in the index, query expansion is not expensive enough to justify the extra operational overhead.
+
+### No Migration -- Write Both Versions
+
+We very briefly considered the idea of having the tsdb write all names for a metric as long as the user configured it that way.
+That way queries for both the native UTF-8 name and the munged name would succeed.
+When the migration was complete, users could turn off double-writing and only write UTF-8.
+
+This approach would cause an explosion of on-disk usage.
+As disk is one of the most expensive resources, this approach was quickly discarded.