Histogram prometheus metrics unusable in production due to cardinality explosion #64962

kevinkokomani · 2021-05-10T20:14:05Z

Describe the problem

There was an explosion of metrics from CRDB, causing a user to disable all histogram metrics as a result. Because histograms are not composable, every one of these metrics results in a new metric being created. And over time the binning changes, too, which further compounds the problem.

To Reproduce

Example of the issue:

# HELP exec_latency Latency of batch KV requests executed on this node
# TYPE exec_latency histogram
exec_latency_bucket{le="6143"} 1
exec_latency_bucket{le="6911"} 2
exec_latency_bucket{le="7167"} 20
exec_latency_bucket{le="7423"} 85
exec_latency_bucket{le="7679"} 247
exec_latency_bucket{le="7935"} 549
exec_latency_bucket{le="8191"} 1089
exec_latency_bucket{le="8703"} 3069
exec_latency_bucket{le="9215"} 6347
exec_latency_bucket{le="9727"} 10810
exec_latency_bucket{le="10239"} 15774
exec_latency_bucket{le="10751"} 20843
exec_latency_bucket{le="11263"} 25490
exec_latency_bucket{le="11775"} 29693

This represents the bucket distribution for a single node. Refreshing the stats page results in a different node with different binning for metrics.

# HELP exec_latency Latency of batch KV requests executed on this node
# TYPE exec_latency histogram
exec_latency_bucket{le="6655"} 1
exec_latency_bucket{le="6911"} 7
exec_latency_bucket{le="7167"} 41
exec_latency_bucket{le="7423"} 121
exec_latency_bucket{le="7679"} 304
exec_latency_bucket{le="7935"} 632
exec_latency_bucket{le="8191"} 1233
exec_latency_bucket{le="8703"} 3246
exec_latency_bucket{le="9215"} 6443
exec_latency_bucket{le="9727"} 10377
exec_latency_bucket{le="10239"} 14718
exec_latency_bucket{le="10751"} 19293
exec_latency_bucket{le="11263"} 23635
exec_latency_bucket{le="11775"} 27739
exec_latency_bucket{le="12287"} 31652
exec_latency_bucket{le="12799"} 35280

Expected behavior

Not a new metric to be created every time, and instead using log-linear histograms so that buckets are composable. Suggested links: https://github.com/openhistogram/circonusllhist & https://github.com/circonus-labs/circllhist-paper/blob/master/circllhist.pdf

CRDB versions: 20.2.x

Impact

Unable to monitor or alert on crdb with any metric that derives from a histogram, which is a large blindspot.

gz#8409

Jira issue: CRDB-7362

The text was updated successfully, but these errors were encountered:

dhartunian · 2021-05-10T22:02:51Z

Collecting code trail of where the histogram is defined:

Metric defn: https://github.com/cockroachdb/cockroach/blob/master/pkg/server/node.go#L72-L77
Metrics impl: https://github.com/cockroachdb/cockroach/blob/master/pkg/server/node.go#L125
Histogram impl: https://github.com/cockroachdb/cockroach/blob/master/pkg/util/metric/metric.go#L199-L212

I think maybe a quick prototype of replacing the hdrhistogram implementation with the circonus one mentioned in the issue description should give us a sense of whether this will work as a drop-in replacement.

ajwerner · 2021-05-10T22:17:51Z

I'm pretty certain HDR histograms indeed are log-linear and are composable. The prometheus code just omits buckets which have the same value as the previous one and also empty buckets below the min. Encoding histograms this way is the prometheus norm. I feel like I'm also unsure of what you mean in terms of composability. See https://prometheus.io/docs/concepts/metric_types/#histogram.

I feel like maybe you're complaining about the configuration of our histograms, which, most of the time, is pretty dumb. It affords 1 sigFig all the way down to nanoseconds.

ajwerner · 2021-05-10T22:34:07Z

To clarify, I believe our histograms may have as many as 496 buckets, which is totally crazy!

https://play.golang.org/p/A3tUzhT_4cW

Given https://github.com/HdrHistogram/hdrhistogram-go/blob/master/hdr.go#L166

Note that our choice of max value and sig figs seems stupid in retrospect for many sql level things (min value of 0, and max value of 10s).

See also #36543.

FWIW I absolutely love this topic. Doing anything at all here is going to cause real migration pains and is going to be hard. I do not think this is a solved problem in the industry. I think we could consider migrating to better settings on an HDR histogram but we'll need to think very carefully about it.

In the much longer term, I dream of something like tdunning/t-digest#127 (comment)

Our current histogram is based on `hdrhistogram`. This tends to create lots of buckets and is inflexible w.r.t the bucket layout. In hindsight, it was a poor choice of default implementation (I can say that since I introduced it) and its cost is disincentivizing the introduction of histograms that would be useful. This commit introduces a histogram that is based on a completely vanilla `prometheus.Histogram`. The only reason we need to wrap it is because we want to export quantiles to CockraochDB's internal timeseries (it does not support histograms) and this requires maintaining an internal windowed histogram (on top of the cumulative histogram). With this done, we can now introduce metrics with any kind of buckets we want. Helpfully, we introduce two common kinds of buckets, suitable for IO-type and RPC-type latencies. These are defined in a human-readable format by explicitly listing out the buckets. We can move existing metrics to HistogramV2 easily, assuming we are not concerned with existing prometheus scrapers getting tripped up by the changes in bucket boundaries. I assume this is not a big deal in practice as long as it doesn't happen "all the time". In fact, I would strongly suggest we move all metrics wholesale and remove the hdrhistogram-based implementation. If this is not acceptable for some reason, we ought to at least deprecated it. We also slightly improve the existing `Histogram` code by unifying how the windowed histograms are ticked and by making explicit where their quantiles are recorded (this dependency was previously hidden via a local interface assertion). Resolves cockroachdb#10015. Resolves cockroachdb#64962. Alternative to dhartunian@eac3d06 TODO - export quantiles (how to do this? Not clear, don't see the code laying around in prometheus, might have to hand-roll it but should be easy enough) Release note: None

tbg · 2022-06-02T12:17:45Z

@lunevalex I am not working on this - #81181 is something I have handed off to obs-inf for picking up.

@nkodali with #82200 we're going to add a bunch of histograms so we will likely have to get #81181 in before the end of the release, or our customers are going to see a ton more timeseries than before which may cause issues.

tbg · 2022-06-03T07:37:15Z

Another thing, I was now also personally hit by the high cardinality here. I wanted to have nice latency heatmaps as per this article but it actually crashes my browser (!) when I try to put it in Grafana. Plotting quantiles works though.

Our current histogram is based on `hdrhistogram`. This tends to create lots of buckets and is inflexible w.r.t the bucket layout. In hindsight, it was a poor choice of default implementation (I can say that since I introduced it) and its cost is disincentivizing the introduction of histograms that would be useful. This commit introduces a histogram that is based on a completely vanilla `prometheus.Histogram`. The only reason we need to wrap it is because we want to export quantiles to CockraochDB's internal timeseries (it does not support histograms) and this requires maintaining an internal windowed histogram (on top of the cumulative histogram). With this done, we can now introduce metrics with any kind of buckets we want. Helpfully, we introduce two common kinds of buckets, suitable for IO-type and RPC-type latencies. These are defined in a human-readable format by explicitly listing out the buckets. We can move existing metrics to HistogramV2 easily, assuming we are not concerned with existing prometheus scrapers getting tripped up by the changes in bucket boundaries. I assume this is not a big deal in practice as long as it doesn't happen "all the time". In fact, I would strongly suggest we move all metrics wholesale and remove the hdrhistogram-based implementation. If this is not acceptable for some reason, we ought to at least deprecated it. We also slightly improve the existing `Histogram` code by unifying how the windowed histograms are ticked and by making explicit where their quantiles are recorded (this dependency was previously hidden via a local interface assertion). Resolves cockroachdb#10015. Resolves cockroachdb#64962. Alternative to dhartunian@eac3d06 TODO - export quantiles (how to do this? Not clear, don't see the code laying around in prometheus, might have to hand-roll it but should be easy enough) Release note: None

85990: metric: add prometheus-based histogram r=aadityasondhi a=aadityasondhi From #81181: Our current histogram is based on `hdrhistogram`. This tends to create lots of buckets and is inflexible w.r.t the bucket layout. In hindsight, it was a poor choice of default implementation (I can say that since I introduced it) and its cost is disincentivizing the introduction of histograms that would be useful. This commit introduces a histogram that is based on a completely vanilla `prometheus.Histogram`. The only reason we need to wrap it is because we want to export quantiles to CockraochDB's internal timeseries (it does not support histograms) and this requires maintaining an internal windowed histogram (on top of the cumulative histogram). With this done, we can now introduce metrics with any kind of buckets we want. Helpfully, we introduce two common kinds of buckets, suitable for IO-type and RPC-type latencies. These are defined in a human-readable format by explicitly listing out the buckets. We can move existing metrics to HistogramV2 easily, assuming we are not concerned with existing prometheus scrapers getting tripped up by the changes in bucket boundaries. I assume this is not a big deal in practice as long as it doesn't happen "all the time". In fact, I would strongly suggest we move all metrics wholesale and remove the hdrhistogram-based implementation. If this is not acceptable for some reason, we ought to at least deprecated it. We also slightly improve the existing `Histogram` code by unifying how the windowed histograms are ticked and by making explicit where their quantiles are recorded (this dependency was previously hidden via a local interface assertion). Resolves #10015. Resolves #64962. Alternative to dhartunian@eac3d06 Release justification: low risk, high benefit changes 86007: kvserver: log traces from replicate queue on errors or slow processing r=andreimatei a=AlexTalks While we previously had some logging from the replicate queue as a result of the standard queue logging, this change adds logging to the replicate queue when there are errors in processing a replica, or when processing a replica exceeds a 50% of the timeout duration. When there are errors or the duration threshold is exceeded, any error messages are logged along with the collected tracing spans from the operation. Release note (ops change): Added logging on replicate queue processing in the presence of errors or when the duration exceeds 50% of the timeout. Release justification: Low risk observability change. 86255: upgrades,upgrade,upgrades_test: Added an upgrade for updating invalid column ID in seq's back references r=Xiang-Gu a=Xiang-Gu commit 1: a small refactoring of existing function commit 2: added a field to TenantDeps commit 3: added a cluster version upgrade that attempts to update invalid column ID in seq's back references due to bugs in prior versions. See below for details commit 4: wrote a test for this newly added upgrade Previously, in version 21.1 and prior, `ADD COLUMN DEFAULT nextval('s')` will incorrectly store a 0-valued column id in the sequence 's' back reference `dependedOnBy` because we added this dependency before allocating an ID to the newly added column. Customers ran into issues when upgrading to v22.1 so we relaxed the validation logic as a quick short-term fix, as detailed in #82859. Now we want to give this issue a more proper treatment by adding a cluster upgrade (to v22.2) that attempts to detect such invalid column ID issues and update them with the correct column ID. This PR does exactly this. Fixes: #83124 Release note: None Release justification: fixed a release blocker that will resolve invalid column ID appearance in sequence's back referenced, caused by bugs in older binaries. Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Aaditya Sondhi <[email protected]> Co-authored-by: Alex Sarkesian <[email protected]> Co-authored-by: Xiang Gu <[email protected]>

Our current histogram is based on `hdrhistogram`. This tends to create lots of buckets and is inflexible w.r.t the bucket layout. In hindsight, it was a poor choice of default implementation (I can say that since I introduced it) and its cost is disincentivizing the introduction of histograms that would be useful. This commit introduces a histogram that is based on a completely vanilla `prometheus.Histogram`. The only reason we need to wrap it is because we want to export quantiles to CockraochDB's internal timeseries (it does not support histograms) and this requires maintaining an internal windowed histogram (on top of the cumulative histogram). With this done, we can now introduce metrics with any kind of buckets we want. Helpfully, we introduce two common kinds of buckets, suitable for IO-type and RPC-type latencies. These are defined in a human-readable format by explicitly listing out the buckets. We can move existing metrics to HistogramV2 easily, assuming we are not concerned with existing prometheus scrapers getting tripped up by the changes in bucket boundaries. I assume this is not a big deal in practice as long as it doesn't happen "all the time". In fact, I would strongly suggest we move all metrics wholesale and remove the hdrhistogram-based implementation. If this is not acceptable for some reason, we ought to at least deprecated it. We also slightly improve the existing `Histogram` code by unifying how the windowed histograms are ticked and by making explicit where their quantiles are recorded (this dependency was previously hidden via a local interface assertion). Resolves cockroachdb#10015. Resolves cockroachdb#64962. Alternative to dhartunian@eac3d06 TODO - export quantiles (how to do this? Not clear, don't see the code laying around in prometheus, might have to hand-roll it but should be easy enough) Release note: None

kevinkokomani added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-observability-inf labels May 10, 2021

kevinkokomani assigned dhartunian May 10, 2021

exalate-issue-sync bot removed the T-observability-inf label May 10, 2021

ajwerner mentioned this issue May 11, 2021

metrics: wanted: histograms for skewed distributions #10015

Closed

jlinder added the T-observability-inf label Jun 16, 2021

tbg mentioned this issue May 11, 2022

metric: add prometheus-based histogram #81181

Closed

lunevalex assigned tbg and unassigned dhartunian May 31, 2022

lunevalex added T-kv-replication and removed T-observability-inf labels May 31, 2022

tbg removed their assignment Jun 2, 2022

exalate-issue-sync bot assigned tbg and unassigned tbg Jun 2, 2022

tbg removed their assignment Jun 3, 2022

exalate-issue-sync bot added T-observability-inf and removed T-kv-replication labels Jun 3, 2022

exalate-issue-sync bot assigned dhartunian Jun 21, 2022

aadityasondhi mentioned this issue Aug 11, 2022

metric: add prometheus-based histogram #85990

Merged

aadityasondhi self-assigned this Aug 11, 2022

exalate-issue-sync bot unassigned dhartunian Aug 11, 2022

craig bot closed this as completed in eb82a43 Aug 26, 2022

tbg mentioned this issue Oct 10, 2022

metrics: expose fsync latency histogram #89485

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Histogram prometheus metrics unusable in production due to cardinality explosion #64962

Histogram prometheus metrics unusable in production due to cardinality explosion #64962

kevinkokomani commented May 10, 2021 •

edited by cockroach-jira-scripts

Loading

dhartunian commented May 10, 2021

ajwerner commented May 10, 2021

ajwerner commented May 10, 2021

tbg commented Jun 2, 2022 •

edited

Loading

tbg commented Jun 3, 2022 •

edited

Loading

Histogram prometheus metrics unusable in production due to cardinality explosion #64962

Histogram prometheus metrics unusable in production due to cardinality explosion #64962

Comments

kevinkokomani commented May 10, 2021 • edited by cockroach-jira-scripts Loading

dhartunian commented May 10, 2021

ajwerner commented May 10, 2021

ajwerner commented May 10, 2021

tbg commented Jun 2, 2022 • edited Loading

tbg commented Jun 3, 2022 • edited Loading

kevinkokomani commented May 10, 2021 •

edited by cockroach-jira-scripts

Loading

tbg commented Jun 2, 2022 •

edited

Loading

tbg commented Jun 3, 2022 •

edited

Loading