Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Histogram prometheus metrics unusable in production due to cardinality explosion #64962

Closed
kevinkokomani opened this issue May 10, 2021 · 5 comments · Fixed by #85990
Closed
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@kevinkokomani
Copy link
Contributor

kevinkokomani commented May 10, 2021

Describe the problem

There was an explosion of metrics from CRDB, causing a user to disable all histogram metrics as a result. Because histograms are not composable, every one of these metrics results in a new metric being created. And over time the binning changes, too, which further compounds the problem.

To Reproduce

Example of the issue:

# HELP exec_latency Latency of batch KV requests executed on this node
# TYPE exec_latency histogram
exec_latency_bucket{le="6143"} 1
exec_latency_bucket{le="6911"} 2
exec_latency_bucket{le="7167"} 20
exec_latency_bucket{le="7423"} 85
exec_latency_bucket{le="7679"} 247
exec_latency_bucket{le="7935"} 549
exec_latency_bucket{le="8191"} 1089
exec_latency_bucket{le="8703"} 3069
exec_latency_bucket{le="9215"} 6347
exec_latency_bucket{le="9727"} 10810
exec_latency_bucket{le="10239"} 15774
exec_latency_bucket{le="10751"} 20843
exec_latency_bucket{le="11263"} 25490
exec_latency_bucket{le="11775"} 29693

This represents the bucket distribution for a single node. Refreshing the stats page results in a different node with different binning for metrics.

# HELP exec_latency Latency of batch KV requests executed on this node
# TYPE exec_latency histogram
exec_latency_bucket{le="6655"} 1
exec_latency_bucket{le="6911"} 7
exec_latency_bucket{le="7167"} 41
exec_latency_bucket{le="7423"} 121
exec_latency_bucket{le="7679"} 304
exec_latency_bucket{le="7935"} 632
exec_latency_bucket{le="8191"} 1233
exec_latency_bucket{le="8703"} 3246
exec_latency_bucket{le="9215"} 6443
exec_latency_bucket{le="9727"} 10377
exec_latency_bucket{le="10239"} 14718
exec_latency_bucket{le="10751"} 19293
exec_latency_bucket{le="11263"} 23635
exec_latency_bucket{le="11775"} 27739
exec_latency_bucket{le="12287"} 31652
exec_latency_bucket{le="12799"} 35280

Expected behavior

Not a new metric to be created every time, and instead using log-linear histograms so that buckets are composable. Suggested links: https://github.com/openhistogram/circonusllhist & https://github.com/circonus-labs/circllhist-paper/blob/master/circllhist.pdf

CRDB versions: 20.2.x

Impact

Unable to monitor or alert on crdb with any metric that derives from a histogram, which is a large blindspot.

gz#8409

Jira issue: CRDB-7362

@kevinkokomani kevinkokomani added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-observability-inf labels May 10, 2021
@dhartunian
Copy link
Collaborator

Collecting code trail of where the histogram is defined:

Metric defn: https://github.com/cockroachdb/cockroach/blob/master/pkg/server/node.go#L72-L77
Metrics impl: https://github.com/cockroachdb/cockroach/blob/master/pkg/server/node.go#L125
Histogram impl: https://github.com/cockroachdb/cockroach/blob/master/pkg/util/metric/metric.go#L199-L212

I think maybe a quick prototype of replacing the hdrhistogram implementation with the circonus one mentioned in the issue description should give us a sense of whether this will work as a drop-in replacement.

@ajwerner
Copy link
Contributor

I'm pretty certain HDR histograms indeed are log-linear and are composable. The prometheus code just omits buckets which have the same value as the previous one and also empty buckets below the min. Encoding histograms this way is the prometheus norm. I feel like I'm also unsure of what you mean in terms of composability. See https://prometheus.io/docs/concepts/metric_types/#histogram.

I feel like maybe you're complaining about the configuration of our histograms, which, most of the time, is pretty dumb. It affords 1 sigFig all the way down to nanoseconds.

@ajwerner
Copy link
Contributor

To clarify, I believe our histograms may have as many as 496 buckets, which is totally crazy!

https://play.golang.org/p/A3tUzhT_4cW

Given https://github.com/HdrHistogram/hdrhistogram-go/blob/master/hdr.go#L166

Note that our choice of max value and sig figs seems stupid in retrospect for many sql level things (min value of 0, and max value of 10s).

See also #36543.

FWIW I absolutely love this topic. Doing anything at all here is going to cause real migration pains and is going to be hard. I do not think this is a solved problem in the industry. I think we could consider migrating to better settings on an HDR histogram but we'll need to think very carefully about it.

In the much longer term, I dream of something like tdunning/t-digest#127 (comment)

tbg added a commit to tbg/cockroach that referenced this issue May 11, 2022
Our current histogram is based on `hdrhistogram`. This tends to create
lots of buckets and is inflexible w.r.t the bucket layout. In hindsight,
it was a poor choice of default implementation (I can say that since I
introduced it) and its cost is disincentivizing the introduction of
histograms that would be useful.

This commit introduces a histogram that is based on a completely vanilla
`prometheus.Histogram`. The only reason we need to wrap it is because we
want to export quantiles to CockraochDB's internal timeseries (it does
not support histograms) and this requires maintaining an internal windowed
histogram (on top of the cumulative histogram).

With this done, we can now introduce metrics with any kind of buckets we
want. Helpfully, we introduce two common kinds of buckets, suitable for
IO-type and RPC-type latencies. These are defined in a human-readable
format by explicitly listing out the buckets.

We can move existing metrics to HistogramV2 easily, assuming we are not
concerned with existing prometheus scrapers getting tripped up by the
changes in bucket boundaries. I assume this is not a big deal in
practice as long as it doesn't happen "all the time". In fact, I would
strongly suggest we move all metrics wholesale and remove the
hdrhistogram-based implementation. If this is not acceptable for some
reason, we ought to at least deprecated it.

We also slightly improve the existing `Histogram` code by unifying how
the windowed histograms are ticked and by making explicit where their
quantiles are recorded (this dependency was previously hidden via a
local interface assertion).

Resolves cockroachdb#10015.
Resolves cockroachdb#64962.
Alternative to dhartunian@eac3d06

TODO
- export quantiles (how to do this? Not clear, don't see the code
  laying around in prometheus, might have to hand-roll it but should
  be easy enough)

Release note: None
@lunevalex lunevalex assigned tbg and unassigned dhartunian May 31, 2022
@tbg tbg removed their assignment Jun 2, 2022
@tbg
Copy link
Member

tbg commented Jun 2, 2022

@lunevalex I am not working on this - #81181 is something I have handed off to obs-inf for picking up.

@nkodali with #82200 we're going to add a bunch of histograms so we will likely have to get #81181 in before the end of the release, or our customers are going to see a ton more timeseries than before which may cause issues.

@exalate-issue-sync exalate-issue-sync bot assigned tbg and unassigned tbg Jun 2, 2022
@tbg tbg removed their assignment Jun 3, 2022
@tbg
Copy link
Member

tbg commented Jun 3, 2022

Another thing, I was now also personally hit by the high cardinality here. I wanted to have nice latency heatmaps as per this article but it actually crashes my browser (!) when I try to put it in Grafana. Plotting quantiles works though.

image

aadityasondhi pushed a commit to aadityasondhi/cockroach that referenced this issue Aug 11, 2022
Our current histogram is based on `hdrhistogram`. This tends to create
lots of buckets and is inflexible w.r.t the bucket layout. In hindsight,
it was a poor choice of default implementation (I can say that since I
introduced it) and its cost is disincentivizing the introduction of
histograms that would be useful.

This commit introduces a histogram that is based on a completely vanilla
`prometheus.Histogram`. The only reason we need to wrap it is because we
want to export quantiles to CockraochDB's internal timeseries (it does
not support histograms) and this requires maintaining an internal windowed
histogram (on top of the cumulative histogram).

With this done, we can now introduce metrics with any kind of buckets we
want. Helpfully, we introduce two common kinds of buckets, suitable for
IO-type and RPC-type latencies. These are defined in a human-readable
format by explicitly listing out the buckets.

We can move existing metrics to HistogramV2 easily, assuming we are not
concerned with existing prometheus scrapers getting tripped up by the
changes in bucket boundaries. I assume this is not a big deal in
practice as long as it doesn't happen "all the time". In fact, I would
strongly suggest we move all metrics wholesale and remove the
hdrhistogram-based implementation. If this is not acceptable for some
reason, we ought to at least deprecated it.

We also slightly improve the existing `Histogram` code by unifying how
the windowed histograms are ticked and by making explicit where their
quantiles are recorded (this dependency was previously hidden via a
local interface assertion).

Resolves cockroachdb#10015.
Resolves cockroachdb#64962.
Alternative to dhartunian@eac3d06

TODO
- export quantiles (how to do this? Not clear, don't see the code
  laying around in prometheus, might have to hand-roll it but should
  be easy enough)

Release note: None
@aadityasondhi aadityasondhi self-assigned this Aug 11, 2022
aadityasondhi pushed a commit to aadityasondhi/cockroach that referenced this issue Aug 14, 2022
Our current histogram is based on `hdrhistogram`. This tends to create
lots of buckets and is inflexible w.r.t the bucket layout. In hindsight,
it was a poor choice of default implementation (I can say that since I
introduced it) and its cost is disincentivizing the introduction of
histograms that would be useful.

This commit introduces a histogram that is based on a completely vanilla
`prometheus.Histogram`. The only reason we need to wrap it is because we
want to export quantiles to CockraochDB's internal timeseries (it does
not support histograms) and this requires maintaining an internal windowed
histogram (on top of the cumulative histogram).

With this done, we can now introduce metrics with any kind of buckets we
want. Helpfully, we introduce two common kinds of buckets, suitable for
IO-type and RPC-type latencies. These are defined in a human-readable
format by explicitly listing out the buckets.

We can move existing metrics to HistogramV2 easily, assuming we are not
concerned with existing prometheus scrapers getting tripped up by the
changes in bucket boundaries. I assume this is not a big deal in
practice as long as it doesn't happen "all the time". In fact, I would
strongly suggest we move all metrics wholesale and remove the
hdrhistogram-based implementation. If this is not acceptable for some
reason, we ought to at least deprecated it.

We also slightly improve the existing `Histogram` code by unifying how
the windowed histograms are ticked and by making explicit where their
quantiles are recorded (this dependency was previously hidden via a
local interface assertion).

Resolves cockroachdb#10015.
Resolves cockroachdb#64962.
Alternative to dhartunian@eac3d06

TODO
- export quantiles (how to do this? Not clear, don't see the code
  laying around in prometheus, might have to hand-roll it but should
  be easy enough)

Release note: None
aadityasondhi pushed a commit to aadityasondhi/cockroach that referenced this issue Aug 16, 2022
Our current histogram is based on `hdrhistogram`. This tends to create
lots of buckets and is inflexible w.r.t the bucket layout. In hindsight,
it was a poor choice of default implementation (I can say that since I
introduced it) and its cost is disincentivizing the introduction of
histograms that would be useful.

This commit introduces a histogram that is based on a completely vanilla
`prometheus.Histogram`. The only reason we need to wrap it is because we
want to export quantiles to CockraochDB's internal timeseries (it does
not support histograms) and this requires maintaining an internal windowed
histogram (on top of the cumulative histogram).

With this done, we can now introduce metrics with any kind of buckets we
want. Helpfully, we introduce two common kinds of buckets, suitable for
IO-type and RPC-type latencies. These are defined in a human-readable
format by explicitly listing out the buckets.

We can move existing metrics to HistogramV2 easily, assuming we are not
concerned with existing prometheus scrapers getting tripped up by the
changes in bucket boundaries. I assume this is not a big deal in
practice as long as it doesn't happen "all the time". In fact, I would
strongly suggest we move all metrics wholesale and remove the
hdrhistogram-based implementation. If this is not acceptable for some
reason, we ought to at least deprecated it.

We also slightly improve the existing `Histogram` code by unifying how
the windowed histograms are ticked and by making explicit where their
quantiles are recorded (this dependency was previously hidden via a
local interface assertion).

Resolves cockroachdb#10015.
Resolves cockroachdb#64962.
Alternative to dhartunian@eac3d06

TODO
- export quantiles (how to do this? Not clear, don't see the code
  laying around in prometheus, might have to hand-roll it but should
  be easy enough)

Release note: None
aadityasondhi pushed a commit to aadityasondhi/cockroach that referenced this issue Aug 19, 2022
Our current histogram is based on `hdrhistogram`. This tends to create
lots of buckets and is inflexible w.r.t the bucket layout. In hindsight,
it was a poor choice of default implementation (I can say that since I
introduced it) and its cost is disincentivizing the introduction of
histograms that would be useful.

This commit introduces a histogram that is based on a completely vanilla
`prometheus.Histogram`. The only reason we need to wrap it is because we
want to export quantiles to CockraochDB's internal timeseries (it does
not support histograms) and this requires maintaining an internal windowed
histogram (on top of the cumulative histogram).

With this done, we can now introduce metrics with any kind of buckets we
want. Helpfully, we introduce two common kinds of buckets, suitable for
IO-type and RPC-type latencies. These are defined in a human-readable
format by explicitly listing out the buckets.

We can move existing metrics to HistogramV2 easily, assuming we are not
concerned with existing prometheus scrapers getting tripped up by the
changes in bucket boundaries. I assume this is not a big deal in
practice as long as it doesn't happen "all the time". In fact, I would
strongly suggest we move all metrics wholesale and remove the
hdrhistogram-based implementation. If this is not acceptable for some
reason, we ought to at least deprecated it.

We also slightly improve the existing `Histogram` code by unifying how
the windowed histograms are ticked and by making explicit where their
quantiles are recorded (this dependency was previously hidden via a
local interface assertion).

Resolves cockroachdb#10015.
Resolves cockroachdb#64962.
Alternative to dhartunian@eac3d06

TODO
- export quantiles (how to do this? Not clear, don't see the code
  laying around in prometheus, might have to hand-roll it but should
  be easy enough)

Release note: None
aadityasondhi pushed a commit to aadityasondhi/cockroach that referenced this issue Aug 24, 2022
Our current histogram is based on `hdrhistogram`. This tends to create
lots of buckets and is inflexible w.r.t the bucket layout. In hindsight,
it was a poor choice of default implementation (I can say that since I
introduced it) and its cost is disincentivizing the introduction of
histograms that would be useful.

This commit introduces a histogram that is based on a completely vanilla
`prometheus.Histogram`. The only reason we need to wrap it is because we
want to export quantiles to CockraochDB's internal timeseries (it does
not support histograms) and this requires maintaining an internal windowed
histogram (on top of the cumulative histogram).

With this done, we can now introduce metrics with any kind of buckets we
want. Helpfully, we introduce two common kinds of buckets, suitable for
IO-type and RPC-type latencies. These are defined in a human-readable
format by explicitly listing out the buckets.

We can move existing metrics to HistogramV2 easily, assuming we are not
concerned with existing prometheus scrapers getting tripped up by the
changes in bucket boundaries. I assume this is not a big deal in
practice as long as it doesn't happen "all the time". In fact, I would
strongly suggest we move all metrics wholesale and remove the
hdrhistogram-based implementation. If this is not acceptable for some
reason, we ought to at least deprecated it.

We also slightly improve the existing `Histogram` code by unifying how
the windowed histograms are ticked and by making explicit where their
quantiles are recorded (this dependency was previously hidden via a
local interface assertion).

Resolves cockroachdb#10015.
Resolves cockroachdb#64962.
Alternative to dhartunian@eac3d06

TODO
- export quantiles (how to do this? Not clear, don't see the code
  laying around in prometheus, might have to hand-roll it but should
  be easy enough)

Release note: None
craig bot pushed a commit that referenced this issue Aug 26, 2022
85990: metric: add prometheus-based histogram r=aadityasondhi a=aadityasondhi

From #81181:

Our current histogram is based on `hdrhistogram`. This tends to create
lots of buckets and is inflexible w.r.t the bucket layout. In hindsight,
it was a poor choice of default implementation (I can say that since I
introduced it) and its cost is disincentivizing the introduction of
histograms that would be useful.

This commit introduces a histogram that is based on a completely vanilla
`prometheus.Histogram`. The only reason we need to wrap it is because we
want to export quantiles to CockraochDB's internal timeseries (it does
not support histograms) and this requires maintaining an internal windowed
histogram (on top of the cumulative histogram).

With this done, we can now introduce metrics with any kind of buckets we
want. Helpfully, we introduce two common kinds of buckets, suitable for
IO-type and RPC-type latencies. These are defined in a human-readable
format by explicitly listing out the buckets.

We can move existing metrics to HistogramV2 easily, assuming we are not
concerned with existing prometheus scrapers getting tripped up by the
changes in bucket boundaries. I assume this is not a big deal in
practice as long as it doesn't happen "all the time". In fact, I would
strongly suggest we move all metrics wholesale and remove the
hdrhistogram-based implementation. If this is not acceptable for some
reason, we ought to at least deprecated it.

We also slightly improve the existing `Histogram` code by unifying how
the windowed histograms are ticked and by making explicit where their
quantiles are recorded (this dependency was previously hidden via a
local interface assertion).

Resolves #10015.
Resolves #64962.
Alternative to dhartunian@eac3d06

Release justification: low risk, high benefit changes

86007: kvserver: log traces from replicate queue on errors or slow processing r=andreimatei a=AlexTalks

While we previously had some logging from the replicate queue as a
result of the standard queue logging, this change adds logging to the
replicate queue when there are errors in processing a replica, or when
processing a replica exceeds a 50% of the timeout duration.
When there are errors or the duration threshold is exceeded,
any error messages are logged along with the collected tracing spans
from the operation.

Release note (ops change): Added logging on replicate queue processing
in the presence of errors or when the duration exceeds 50% of the
timeout.

Release justification: Low risk observability change.

86255: upgrades,upgrade,upgrades_test: Added an upgrade for updating invalid column ID in seq's back references r=Xiang-Gu a=Xiang-Gu

commit 1: a small refactoring of existing function
commit 2: added a field to TenantDeps
commit 3: added a cluster version upgrade that attempts to update
invalid column ID in seq's back references due to bugs in prior versions.
See below for details
commit 4: wrote a test for this newly added upgrade

Previously, in version 21.1 and prior, `ADD COLUMN DEFAULT nextval('s')`
will incorrectly store a 0-valued column id in the sequence 's' back reference
`dependedOnBy` because we added this dependency before allocating an
ID to the newly added column. Customers ran into issues when upgrading
to v22.1 so we relaxed the validation logic as a quick short-term fix, as detailed
in #82859. 

Now we want to give this issue a more proper treatment by adding a cluster
upgrade (to v22.2) that attempts to detect such invalid column ID issues and 
update them with the correct column ID. This PR does exactly this.

Fixes: #83124

Release note: None

Release justification: fixed a release blocker that will resolve invalid column ID
appearance in sequence's back referenced, caused by bugs in older binaries.

Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Aaditya Sondhi <[email protected]>
Co-authored-by: Alex Sarkesian <[email protected]>
Co-authored-by: Xiang Gu <[email protected]>
@craig craig bot closed this as completed in eb82a43 Aug 26, 2022
aadityasondhi pushed a commit to aadityasondhi/cockroach that referenced this issue Aug 26, 2022
Our current histogram is based on `hdrhistogram`. This tends to create
lots of buckets and is inflexible w.r.t the bucket layout. In hindsight,
it was a poor choice of default implementation (I can say that since I
introduced it) and its cost is disincentivizing the introduction of
histograms that would be useful.

This commit introduces a histogram that is based on a completely vanilla
`prometheus.Histogram`. The only reason we need to wrap it is because we
want to export quantiles to CockraochDB's internal timeseries (it does
not support histograms) and this requires maintaining an internal windowed
histogram (on top of the cumulative histogram).

With this done, we can now introduce metrics with any kind of buckets we
want. Helpfully, we introduce two common kinds of buckets, suitable for
IO-type and RPC-type latencies. These are defined in a human-readable
format by explicitly listing out the buckets.

We can move existing metrics to HistogramV2 easily, assuming we are not
concerned with existing prometheus scrapers getting tripped up by the
changes in bucket boundaries. I assume this is not a big deal in
practice as long as it doesn't happen "all the time". In fact, I would
strongly suggest we move all metrics wholesale and remove the
hdrhistogram-based implementation. If this is not acceptable for some
reason, we ought to at least deprecated it.

We also slightly improve the existing `Histogram` code by unifying how
the windowed histograms are ticked and by making explicit where their
quantiles are recorded (this dependency was previously hidden via a
local interface assertion).

Resolves cockroachdb#10015.
Resolves cockroachdb#64962.
Alternative to dhartunian@eac3d06

TODO
- export quantiles (how to do this? Not clear, don't see the code
  laying around in prometheus, might have to hand-roll it but should
  be easy enough)

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
7 participants