metric: changes in histogram buckets affects the connection latency graph in db console #95833

ZhouXing19 · 2023-01-25T15:41:48Z

While investigating #95627, we noticed that the p90 connection latency from a local client to a server on roachprod jumped from ~230ms to ~500ms since a82aa82.

Before:

After:

While the client latency (i.e. the time for the query to finish from the client side) are the same, ~500ms.

How to repro

Build the Linux binaries before and after the commit a82aa82 (say they are cockroach-pre and cockroach-post).
Create 2 roachprod (i.e. $USER-test-pre and $USER-test-post)
Use this script for the testing. (i.e. run ./prepare.sh pre or ./prepare.sh post). Note that you'll need to create the add_user.sql as well, as specified here.
Once the connection start to establish (i.e. you can see ----- round 1 ----- from the terminal stdout), go to the db console and check the connection latency under the dist SQL section.

Jira issue: CRDB-23769

The text was updated successfully, but these errors were encountered:

thtruo · 2023-01-25T17:41:40Z

cc @abarganier assigning to you as you're currently investigating it

abarganier · 2023-01-25T21:13:12Z

Okay, here's the scoop.

This is caused by switching from the older hdrhistogram, which dynamically created its own buckets based on maxVal and significant figures parameters, to Prometheus-based histograms, where the histogram buckets ad explicitly defined, such as: https://github.com/cockroachdb/cockroach/blob/master/pkg/util/metric/histogram_buckets.go#L17-L34

These buckets are not granular enough, and their value ranges don't make sense for measuring SQL latencies. This is the cause of the misleading latency values - we have less precision and therefore these ~200ms latencies are ending up in the ~517ms bucket.

We're going to backport a revert of #86671 into 22.2. This allows us to maintain the code for the newer histogram while reverting the metrics themselves to use the older histograms.

This will allow us to fix the issue as fast as possible for customers without derailing the team's other initiatives. Once we finish delivering on our outstanding epics for 23.1, we can do a proper fix. A proper fix involves A) defining more appropriate & granular histogram buckets, and B) creating a cluster setting that allows users to optionally enable the new histograms (disabled by default).

We'll also be testing out the impact on continuity here. If customers have been using Prometheus on v22.2 for the past month (for example), what's the impact on Prometheus of the buckets suddenly changing back to the older form? We seek to define that alongside our fix.

We'll also provide a technical advisory and documentation for TSE on this issue.

irfansharif · 2023-01-26T02:17:48Z

This allows us to maintain the code for the newer histogram while reverting the metrics themselves to use the older histograms.

Will the revert undo the benefits of #85990, which drastically reduced the bucket count for most histograms? The wholesale revert, to me, feels very heavy handed, and especially for a backport, given what we've spelled out for proper fixes (defining more appropriate bucket boundaries, perhaps even narrowly targeting the connection latency graph).

Above you linked to the IO latency buckets, but for this connection latency histogram, aren't we using the network latency buckets defined here?

cockroach/pkg/util/metric/histogram_buckets.go

Lines 40 to 57 in a82aa82

    
           var NetworkLatencyBuckets = []float64{ 
        
           	// Generated via TestHistogramBuckets/NetworkLatencyBuckets. 
        
           	500000.000000,    // 500µs 
        
           	860513.842995,    // 860.513µs 
        
           	1480968.147973,   // 1.480968ms 
        
           	2548787.184731,   // 2.548787ms 
        
           	4386533.310619,   // 4.386533ms 
        
           	7549345.273094,   // 7.549345ms 
        
           	12992632.226094,  // 12.992632ms 
        
           	22360679.774998,  // 22.360679ms 
        
           	38483348.970335,  // 38.483348ms 
        
           	66230909.027573,  // 66.230909ms 
        
           	113985228.104760, // 113.985228ms 
        
           	196171733.362212, // 196.171733ms 
        
           	337616984.325077, // 337.616984ms 
        
           	581048177.284016, // 581.048177ms 
        
           	999999999.999999, // 999.999999ms 
        
           }

So we have bucket boundaries [196.171733ms, 337.616984ms, 581.048177ms], which when computing various percentiles, we interpolate within:

cockroach/pkg/util/metric/metric.go

Lines 665 to 674 in 4a2d06a

    
           // Calculate the linearly interpolated value within the bucket. 
        
           if b > 0 { 
        
           	bucketStart = *buckets[b-1].UpperBound 
        
           	count -= *buckets[b-1].CumulativeCount 
        
           	rank -= *buckets[b-1].CumulativeCount 
        
           } 
        
           val := bucketStart + (bucketEnd-bucketStart)*(float64(rank)/float64(count)) 
        
           if math.IsNaN(val) || math.IsInf(val, -1) { 
        
           	return 0 
        
           }

So I'm a bit confused about how, with those bucket boundaries and that linear interpolation code, we could have "~200ms latencies [..] ending up in the ~517ms bucket".

irfansharif · 2023-01-26T02:25:15Z

Oh I was looking at the wrong thing, sorry for the noise. We are indeed using the IOLatencyBuckets here.

cockroach/pkg/sql/pgwire/server.go

Lines 276 to 278 in 2f81142

    
           ConnLatency: metric.NewHistogram( 
        
           	MetaConnLatency, histogramWindow, metric.IOLatencyBuckets, 
        
           ),

where the bucket boundaries are further apart around ~200ms ([193.069772ms, 517.947467ms]).

abarganier · 2023-01-26T14:26:43Z

I don't disagree that it's a bit heavy handed of an approach. However, at the same time, I don't feel comfortable defining new sets of buckets to use without an accompanying cluster setting to disable using the new histograms altogether. The benefits of reduced #'s of buckets is negated if our histograms are misleading and no longer useful.

Unfortunately, the team does not have capacity right now to put our other initiatives on the shelf to introduce a new cluster setting and the ability to toggle back and forth between the old and new histograms. Given the prevalence of customer complaints, the priority for now is to make histograms useful again, so the team has opted to take the path of least resistance to get that done as soon as possible. We could attempt to fix the buckets by defining new ones, but what if those are also found to have issues? Then we're back to square one and we find ourselves in the same situation, which is really bad optics when it comes to our customers who have widely noticed this issue. This is why I don't feel comfortable with this solution unless it's accompanied by a cluster setting to toggle between the new & old histograms, and that's a non-trivial change AFAIK.

Reintroducing the new histograms with an accompanying cluster setting will be fairly high priority for the team, but we have bigger commitments on our plate right now. However, I expect we'll be able to get it in before the 23.1 branch cut (along with a backport of the fix to 22.2), so keep in mind that this is only temporary.

abarganier · 2023-01-26T20:11:45Z

Hey all - the team has discussed this at length and we've decided against the revert. We believe that it will create toil for customers, and that we can provide a better solution in time for the upcoming v22.2.4 patch release.

Here's the revised plan:

Provide a cluster setting that enables customers to disable the new histogram buckets in favor of the legacy hdr histograms. This will require a node restart to take effect, but provides a way to revert back to the older histograms if issues arise with the newer histograms and/or their pre-defined buckets.
For the new histograms, update all pre-defined bucket types with a significant increase in granularity. Right now, each pre-defined bucket type only has 15 buckets. We will be increasing this to 60 for each of them, which should increase the fidelity significantly. If folks feel that a bucket should be even higher fidelity, we can work that out in the PR review. We are open to feedback on how to best serve different use cases.

@dhartunian and I have already made good progress on this approach. We will continue posting updates here as we have them.

Addresses cockroachdb#95833 This patch reeintroduces the old HdrHistogram model to optionally be enabled in favor of the new Prometheus model, gated behind an environment variable called `COCKROACH_ENABLE_HDR_HISTOGRAMS`, allowing users a means to "fall back" to the old model in the event that the new model does not adequately serve their needs (think of this as an "insurance policy" to protect against this from happening again with no real mitigation - ideally, this environment variable should never have to be used). Note: some histograms were introduced *after* the new Prometheus histograms were added to CockroachDB. In this case, we use the `ForceUsePrometheus` option in the `HistogramOptions` struct to ignore the value of the env var, since there never was a time where these specific histograms used the HdrHistogram model. Release note (ops change): Histogram metrics can now optionally use the legacy HdrHistogram model by setting the environment var `COCKROACH_ENABLE_HDR_HISTOGRAMS=true` on CockroachDB nodes. **Note that this is not recommended** unless users are having difficulties with the newer Prometheus-backed histogram model. Enabling can cause performance issues with timeseries databases like Prometheus, as processing and storing the increased number of buckets is taxing on both CPU and storage. Note that the HdrHistogram model is slated for full deprecation in upcoming releases.

96504: Revert "sql: improve stack trace for get-user-timeout timeouts" r=ecwall a=rafiss This reverts commit 662e19c. gRPC does not expect the context to be wrapped, so it can get confused. informs #95794 Release note: None 96516: pkg/util/metric: fix HistogramMode for streamingingest histograms r=dhartunian a=abarganier Ref: #96514 The above patch migrated all histogram constructors over to use a generic constructor, which has the ability to specify a HistogramMode. While preparing a backport of this patch, I noticed that the following streamingest histograms were incorrectly tagged with a HistogramMode of HistogramModePreferHdrLatency: 1. FlushHistNanos 2. CommitLatency 3. AdmitLatency If you look at the original migration, when HDR was still being used, these histograms were *not* using the HDR latency defaults: a82aa82#diff-1bc5bdba63149e8efeadce17e7eb62bb5cd1dcee22974b37881a627e13c0501dL137-L143 This patch fixes these histograms to no longer incorrectly specify the HistogramModePreferHdrLatency mode in the histogram options. This was also fixed in the backport PR #96514. Release note: none Part of #95833 Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Alex Barganier <[email protected]>

ZhouXing19 added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jan 25, 2023

thtruo added the T-observability-inf label Jan 25, 2023

blathers-crl bot added the A-observability-inf label Jan 25, 2023

thtruo assigned abarganier Jan 25, 2023

dhartunian mentioned this issue Jan 26, 2023

metrics: users should have control over histogram granularity for connection latencies #96000

Open

abarganier mentioned this issue Jan 31, 2023

pkg/util/metric: option to use legacy hdrhistogram model, increase bucket fidelity #96029

Merged

rafiss mentioned this issue Jan 31, 2023

sql,auth: investigate connection latency changes in v22.2 #95627

Closed

abarganier mentioned this issue Feb 1, 2023

pkg/util/metric: remove HdrHistogram once Prometheus histograms proven in production #96357

Open

craig bot closed this as completed in dd97d0c Feb 3, 2023

This was referenced Feb 3, 2023

release-22.2: pkg/util/metric: optionally reintroduce legacy hdrhistogram model #96514

Merged

pkg/util/metric: fix HistogramMode for streamingingest histograms #96516

Merged

aadityasondhi mentioned this issue Feb 14, 2023

metric: improved testing for histogram latency buckets #97144

Open

rytaft added the C-technical-advisory Caused a technical advisory label Dec 6, 2023

rytaft added the branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 label Dec 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metric: changes in histogram buckets affects the connection latency graph in db console #95833

metric: changes in histogram buckets affects the connection latency graph in db console #95833

ZhouXing19 commented Jan 25, 2023 •

edited

Loading

thtruo commented Jan 25, 2023

abarganier commented Jan 25, 2023

irfansharif commented Jan 26, 2023

irfansharif commented Jan 26, 2023

abarganier commented Jan 26, 2023

abarganier commented Jan 26, 2023

metric: changes in histogram buckets affects the connection latency graph in db console #95833

metric: changes in histogram buckets affects the connection latency graph in db console #95833

Comments

ZhouXing19 commented Jan 25, 2023 • edited Loading

thtruo commented Jan 25, 2023

abarganier commented Jan 25, 2023

irfansharif commented Jan 26, 2023

irfansharif commented Jan 26, 2023

abarganier commented Jan 26, 2023

abarganier commented Jan 26, 2023

ZhouXing19 commented Jan 25, 2023 •

edited

Loading