Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metric: changes in histogram buckets affects the connection latency graph in db console #95833

Closed
ZhouXing19 opened this issue Jan 25, 2023 · 6 comments · Fixed by #96029
Closed
Assignees
Labels
A-observability-inf branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-technical-advisory Caused a technical advisory

Comments

@ZhouXing19
Copy link
Collaborator

ZhouXing19 commented Jan 25, 2023

While investigating #95627, we noticed that the p90 connection latency from a local client to a server on roachprod jumped from ~230ms to ~500ms since a82aa82.

Before:
Screenshot 2023-01-24 at 18 42 26

After:
Screenshot 2023-01-24 at 18 43 14

While the client latency (i.e. the time for the query to finish from the client side) are the same, ~500ms.

How to repro

  1. Build the Linux binaries before and after the commit a82aa82 (say they are cockroach-pre and cockroach-post).
  2. Create 2 roachprod (i.e. $USER-test-pre and $USER-test-post)
  3. Use this script for the testing. (i.e. run ./prepare.sh pre or ./prepare.sh post). Note that you'll need to create the add_user.sql as well, as specified here.
  4. Once the connection start to establish (i.e. you can see ----- round 1 ----- from the terminal stdout), go to the db console and check the connection latency under the dist SQL section.

Jira issue: CRDB-23769

@ZhouXing19 ZhouXing19 added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jan 25, 2023
@thtruo
Copy link
Contributor

thtruo commented Jan 25, 2023

cc @abarganier assigning to you as you're currently investigating it

@abarganier
Copy link
Contributor

Okay, here's the scoop.

This is caused by switching from the older hdrhistogram, which dynamically created its own buckets based on maxVal and significant figures parameters, to Prometheus-based histograms, where the histogram buckets ad explicitly defined, such as: https://github.com/cockroachdb/cockroach/blob/master/pkg/util/metric/histogram_buckets.go#L17-L34

These buckets are not granular enough, and their value ranges don't make sense for measuring SQL latencies. This is the cause of the misleading latency values - we have less precision and therefore these ~200ms latencies are ending up in the ~517ms bucket.

We're going to backport a revert of #86671 into 22.2. This allows us to maintain the code for the newer histogram while reverting the metrics themselves to use the older histograms.

This will allow us to fix the issue as fast as possible for customers without derailing the team's other initiatives. Once we finish delivering on our outstanding epics for 23.1, we can do a proper fix. A proper fix involves A) defining more appropriate & granular histogram buckets, and B) creating a cluster setting that allows users to optionally enable the new histograms (disabled by default).

We'll also be testing out the impact on continuity here. If customers have been using Prometheus on v22.2 for the past month (for example), what's the impact on Prometheus of the buckets suddenly changing back to the older form? We seek to define that alongside our fix.

We'll also provide a technical advisory and documentation for TSE on this issue.

@irfansharif
Copy link
Contributor

This allows us to maintain the code for the newer histogram while reverting the metrics themselves to use the older histograms.

Will the revert undo the benefits of #85990, which drastically reduced the bucket count for most histograms? The wholesale revert, to me, feels very heavy handed, and especially for a backport, given what we've spelled out for proper fixes (defining more appropriate bucket boundaries, perhaps even narrowly targeting the connection latency graph).

Above you linked to the IO latency buckets, but for this connection latency histogram, aren't we using the network latency buckets defined here?

var NetworkLatencyBuckets = []float64{
// Generated via TestHistogramBuckets/NetworkLatencyBuckets.
500000.000000, // 500µs
860513.842995, // 860.513µs
1480968.147973, // 1.480968ms
2548787.184731, // 2.548787ms
4386533.310619, // 4.386533ms
7549345.273094, // 7.549345ms
12992632.226094, // 12.992632ms
22360679.774998, // 22.360679ms
38483348.970335, // 38.483348ms
66230909.027573, // 66.230909ms
113985228.104760, // 113.985228ms
196171733.362212, // 196.171733ms
337616984.325077, // 337.616984ms
581048177.284016, // 581.048177ms
999999999.999999, // 999.999999ms
}

So we have bucket boundaries [196.171733ms, 337.616984ms, 581.048177ms], which when computing various percentiles, we interpolate within:

// Calculate the linearly interpolated value within the bucket.
if b > 0 {
bucketStart = *buckets[b-1].UpperBound
count -= *buckets[b-1].CumulativeCount
rank -= *buckets[b-1].CumulativeCount
}
val := bucketStart + (bucketEnd-bucketStart)*(float64(rank)/float64(count))
if math.IsNaN(val) || math.IsInf(val, -1) {
return 0
}

So I'm a bit confused about how, with those bucket boundaries and that linear interpolation code, we could have "~200ms latencies [..] ending up in the ~517ms bucket".

@irfansharif
Copy link
Contributor

Oh I was looking at the wrong thing, sorry for the noise. We are indeed using the IOLatencyBuckets here.

ConnLatency: metric.NewHistogram(
MetaConnLatency, histogramWindow, metric.IOLatencyBuckets,
),
where the bucket boundaries are further apart around ~200ms ([193.069772ms, 517.947467ms]).

@abarganier
Copy link
Contributor

I don't disagree that it's a bit heavy handed of an approach. However, at the same time, I don't feel comfortable defining new sets of buckets to use without an accompanying cluster setting to disable using the new histograms altogether. The benefits of reduced #'s of buckets is negated if our histograms are misleading and no longer useful.

Unfortunately, the team does not have capacity right now to put our other initiatives on the shelf to introduce a new cluster setting and the ability to toggle back and forth between the old and new histograms. Given the prevalence of customer complaints, the priority for now is to make histograms useful again, so the team has opted to take the path of least resistance to get that done as soon as possible. We could attempt to fix the buckets by defining new ones, but what if those are also found to have issues? Then we're back to square one and we find ourselves in the same situation, which is really bad optics when it comes to our customers who have widely noticed this issue. This is why I don't feel comfortable with this solution unless it's accompanied by a cluster setting to toggle between the new & old histograms, and that's a non-trivial change AFAIK.

Reintroducing the new histograms with an accompanying cluster setting will be fairly high priority for the team, but we have bigger commitments on our plate right now. However, I expect we'll be able to get it in before the 23.1 branch cut (along with a backport of the fix to 22.2), so keep in mind that this is only temporary.

@abarganier
Copy link
Contributor

Hey all - the team has discussed this at length and we've decided against the revert. We believe that it will create toil for customers, and that we can provide a better solution in time for the upcoming v22.2.4 patch release.

Here's the revised plan:

  • Provide a cluster setting that enables customers to disable the new histogram buckets in favor of the legacy hdr histograms. This will require a node restart to take effect, but provides a way to revert back to the older histograms if issues arise with the newer histograms and/or their pre-defined buckets.

  • For the new histograms, update all pre-defined bucket types with a significant increase in granularity. Right now, each pre-defined bucket type only has 15 buckets. We will be increasing this to 60 for each of them, which should increase the fidelity significantly. If folks feel that a bucket should be even higher fidelity, we can work that out in the PR review. We are open to feedback on how to best serve different use cases.

@dhartunian and I have already made good progress on this approach. We will continue posting updates here as we have them.

abarganier pushed a commit to dhartunian/cockroach that referenced this issue Feb 1, 2023
Addresses cockroachdb#95833

This patch reeintroduces the old HdrHistogram model to optionally be
enabled in favor of the new Prometheus model, gated behind
an environment variable called `COCKROACH_ENABLE_HDR_HISTOGRAMS`,
allowing users a means to "fall back" to the old model in the
event that the new model does not adequately serve their needs
(think of this as an "insurance policy" to protect against
this from happening again with no real mitigation - ideally,
this environment variable should never have to be used).

Note: some histograms were introduced *after* the new
Prometheus histograms were added to CockroachDB. In this
case, we use the `ForceUsePrometheus` option in the
`HistogramOptions` struct to ignore the value of the env
var, since there never was a time where these specific
histograms used the HdrHistogram model.

Release note (ops change): Histogram metrics can now optionally
use the legacy HdrHistogram model by setting the environment var
`COCKROACH_ENABLE_HDR_HISTOGRAMS=true` on CockroachDB nodes.
**Note that this is not recommended** unless users are having
difficulties with the newer Prometheus-backed histogram model.
Enabling can cause performance issues with timeseries databases
like Prometheus, as processing and storing the increased number
of buckets is taxing on both CPU and storage. Note that the
HdrHistogram model is slated for full deprecation in upcoming
releases.
abarganier pushed a commit to dhartunian/cockroach that referenced this issue Feb 1, 2023
Addresses cockroachdb#95833

This patch reeintroduces the old HdrHistogram model to optionally be
enabled in favor of the new Prometheus model, gated behind
an environment variable called `COCKROACH_ENABLE_HDR_HISTOGRAMS`,
allowing users a means to "fall back" to the old model in the
event that the new model does not adequately serve their needs
(think of this as an "insurance policy" to protect against
this from happening again with no real mitigation - ideally,
this environment variable should never have to be used).

Note: some histograms were introduced *after* the new
Prometheus histograms were added to CockroachDB. In this
case, we use the `ForceUsePrometheus` option in the
`HistogramOptions` struct to ignore the value of the env
var, since there never was a time where these specific
histograms used the HdrHistogram model.

Release note (ops change): Histogram metrics can now optionally
use the legacy HdrHistogram model by setting the environment var
`COCKROACH_ENABLE_HDR_HISTOGRAMS=true` on CockroachDB nodes.
**Note that this is not recommended** unless users are having
difficulties with the newer Prometheus-backed histogram model.
Enabling can cause performance issues with timeseries databases
like Prometheus, as processing and storing the increased number
of buckets is taxing on both CPU and storage. Note that the
HdrHistogram model is slated for full deprecation in upcoming
releases.
abarganier pushed a commit to dhartunian/cockroach that referenced this issue Feb 2, 2023
Addresses cockroachdb#95833

This patch reeintroduces the old HdrHistogram model to optionally be
enabled in favor of the new Prometheus model, gated behind
an environment variable called `COCKROACH_ENABLE_HDR_HISTOGRAMS`,
allowing users a means to "fall back" to the old model in the
event that the new model does not adequately serve their needs
(think of this as an "insurance policy" to protect against
this from happening again with no real mitigation - ideally,
this environment variable should never have to be used).

Note: some histograms were introduced *after* the new
Prometheus histograms were added to CockroachDB. In this
case, we use the `ForceUsePrometheus` option in the
`HistogramOptions` struct to ignore the value of the env
var, since there never was a time where these specific
histograms used the HdrHistogram model.

Release note (ops change): Histogram metrics can now optionally
use the legacy HdrHistogram model by setting the environment var
`COCKROACH_ENABLE_HDR_HISTOGRAMS=true` on CockroachDB nodes.
**Note that this is not recommended** unless users are having
difficulties with the newer Prometheus-backed histogram model.
Enabling can cause performance issues with timeseries databases
like Prometheus, as processing and storing the increased number
of buckets is taxing on both CPU and storage. Note that the
HdrHistogram model is slated for full deprecation in upcoming
releases.
@craig craig bot closed this as completed in dd97d0c Feb 3, 2023
abarganier pushed a commit to abarganier/cockroach that referenced this issue Feb 3, 2023
Addresses cockroachdb#95833

This patch reeintroduces the old HdrHistogram model to optionally be
enabled in favor of the new Prometheus model, gated behind
an environment variable called `COCKROACH_ENABLE_HDR_HISTOGRAMS`,
allowing users a means to "fall back" to the old model in the
event that the new model does not adequately serve their needs
(think of this as an "insurance policy" to protect against
this from happening again with no real mitigation - ideally,
this environment variable should never have to be used).

Note: some histograms were introduced *after* the new
Prometheus histograms were added to CockroachDB. In this
case, we use the `ForceUsePrometheus` option in the
`HistogramOptions` struct to ignore the value of the env
var, since there never was a time where these specific
histograms used the HdrHistogram model.

Release note (ops change): Histogram metrics can now optionally
use the legacy HdrHistogram model by setting the environment var
`COCKROACH_ENABLE_HDR_HISTOGRAMS=true` on CockroachDB nodes.
**Note that this is not recommended** unless users are having
difficulties with the newer Prometheus-backed histogram model.
Enabling can cause performance issues with timeseries databases
like Prometheus, as processing and storing the increased number
of buckets is taxing on both CPU and storage. Note that the
HdrHistogram model is slated for full deprecation in upcoming
releases.
abarganier pushed a commit to abarganier/cockroach that referenced this issue Feb 6, 2023
Addresses cockroachdb#95833

This patch reeintroduces the old HdrHistogram model to optionally be
enabled in favor of the new Prometheus model, gated behind
an environment variable called `COCKROACH_ENABLE_HDR_HISTOGRAMS`,
allowing users a means to "fall back" to the old model in the
event that the new model does not adequately serve their needs
(think of this as an "insurance policy" to protect against
this from happening again with no real mitigation - ideally,
this environment variable should never have to be used).

Note: some histograms were introduced *after* the new
Prometheus histograms were added to CockroachDB. In this
case, we use the `ForceUsePrometheus` option in the
`HistogramOptions` struct to ignore the value of the env
var, since there never was a time where these specific
histograms used the HdrHistogram model.

Release note (ops change): Histogram metrics can now optionally
use the legacy HdrHistogram model by setting the environment var
`COCKROACH_ENABLE_HDR_HISTOGRAMS=true` on CockroachDB nodes.
**Note that this is not recommended** unless users are having
difficulties with the newer Prometheus-backed histogram model.
Enabling can cause performance issues with timeseries databases
like Prometheus, as processing and storing the increased number
of buckets is taxing on both CPU and storage. Note that the
HdrHistogram model is slated for full deprecation in upcoming
releases.
craig bot pushed a commit that referenced this issue Feb 6, 2023
96504: Revert "sql: improve stack trace for get-user-timeout timeouts" r=ecwall a=rafiss

This reverts commit 662e19c.

gRPC does not expect the context to be wrapped, so it can get confused.

informs #95794
Release note: None

96516: pkg/util/metric: fix HistogramMode for streamingingest histograms r=dhartunian a=abarganier

Ref: #96514

The above patch migrated all histogram constructors over to
use a generic constructor, which has the ability to specify
a HistogramMode. While preparing a backport of this patch,
I noticed that the following streamingest histograms were
incorrectly tagged with a HistogramMode of
HistogramModePreferHdrLatency:

1. FlushHistNanos
2. CommitLatency
3. AdmitLatency

If you look at the original migration, when HDR was still being
used, these histograms were *not* using the HDR latency defaults:

a82aa82#diff-1bc5bdba63149e8efeadce17e7eb62bb5cd1dcee22974b37881a627e13c0501dL137-L143

This patch fixes these histograms to no longer incorrectly specify
the HistogramModePreferHdrLatency mode in the histogram options.

This was also fixed in the backport PR #96514.

Release note: none

Part of #95833

Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Alex Barganier <[email protected]>
@rytaft rytaft added the C-technical-advisory Caused a technical advisory label Dec 6, 2023
@rytaft rytaft added the branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 label Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-observability-inf branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-technical-advisory Caused a technical advisory
Projects
None yet
5 participants