-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metric: changes in histogram buckets affects the connection latency graph in db console #95833
Comments
cc @abarganier assigning to you as you're currently investigating it |
Okay, here's the scoop. This is caused by switching from the older These buckets are not granular enough, and their value ranges don't make sense for measuring SQL latencies. This is the cause of the misleading latency values - we have less precision and therefore these ~200ms latencies are ending up in the ~517ms bucket. We're going to backport a revert of #86671 into 22.2. This allows us to maintain the code for the newer histogram while reverting the metrics themselves to use the older histograms. This will allow us to fix the issue as fast as possible for customers without derailing the team's other initiatives. Once we finish delivering on our outstanding epics for 23.1, we can do a proper fix. A proper fix involves A) defining more appropriate & granular histogram buckets, and B) creating a cluster setting that allows users to optionally enable the new histograms (disabled by default). We'll also be testing out the impact on continuity here. If customers have been using Prometheus on v22.2 for the past month (for example), what's the impact on Prometheus of the buckets suddenly changing back to the older form? We seek to define that alongside our fix. We'll also provide a technical advisory and documentation for TSE on this issue. |
Will the revert undo the benefits of #85990, which drastically reduced the bucket count for most histograms? The wholesale revert, to me, feels very heavy handed, and especially for a backport, given what we've spelled out for proper fixes (defining more appropriate bucket boundaries, perhaps even narrowly targeting the connection latency graph). Above you linked to the IO latency buckets, but for this connection latency histogram, aren't we using the network latency buckets defined here? cockroach/pkg/util/metric/histogram_buckets.go Lines 40 to 57 in a82aa82
So we have bucket boundaries [196.171733ms, 337.616984ms, 581.048177ms], which when computing various percentiles, we interpolate within: cockroach/pkg/util/metric/metric.go Lines 665 to 674 in 4a2d06a
So I'm a bit confused about how, with those bucket boundaries and that linear interpolation code, we could have "~200ms latencies [..] ending up in the ~517ms bucket". |
Oh I was looking at the wrong thing, sorry for the noise. We are indeed using the IOLatencyBuckets here. cockroach/pkg/sql/pgwire/server.go Lines 276 to 278 in 2f81142
|
I don't disagree that it's a bit heavy handed of an approach. However, at the same time, I don't feel comfortable defining new sets of buckets to use without an accompanying cluster setting to disable using the new histograms altogether. The benefits of reduced #'s of buckets is negated if our histograms are misleading and no longer useful. Unfortunately, the team does not have capacity right now to put our other initiatives on the shelf to introduce a new cluster setting and the ability to toggle back and forth between the old and new histograms. Given the prevalence of customer complaints, the priority for now is to make histograms useful again, so the team has opted to take the path of least resistance to get that done as soon as possible. We could attempt to fix the buckets by defining new ones, but what if those are also found to have issues? Then we're back to square one and we find ourselves in the same situation, which is really bad optics when it comes to our customers who have widely noticed this issue. This is why I don't feel comfortable with this solution unless it's accompanied by a cluster setting to toggle between the new & old histograms, and that's a non-trivial change AFAIK. Reintroducing the new histograms with an accompanying cluster setting will be fairly high priority for the team, but we have bigger commitments on our plate right now. However, I expect we'll be able to get it in before the 23.1 branch cut (along with a backport of the fix to 22.2), so keep in mind that this is only temporary. |
Hey all - the team has discussed this at length and we've decided against the revert. We believe that it will create toil for customers, and that we can provide a better solution in time for the upcoming v22.2.4 patch release. Here's the revised plan:
@dhartunian and I have already made good progress on this approach. We will continue posting updates here as we have them. |
Addresses cockroachdb#95833 This patch reeintroduces the old HdrHistogram model to optionally be enabled in favor of the new Prometheus model, gated behind an environment variable called `COCKROACH_ENABLE_HDR_HISTOGRAMS`, allowing users a means to "fall back" to the old model in the event that the new model does not adequately serve their needs (think of this as an "insurance policy" to protect against this from happening again with no real mitigation - ideally, this environment variable should never have to be used). Note: some histograms were introduced *after* the new Prometheus histograms were added to CockroachDB. In this case, we use the `ForceUsePrometheus` option in the `HistogramOptions` struct to ignore the value of the env var, since there never was a time where these specific histograms used the HdrHistogram model. Release note (ops change): Histogram metrics can now optionally use the legacy HdrHistogram model by setting the environment var `COCKROACH_ENABLE_HDR_HISTOGRAMS=true` on CockroachDB nodes. **Note that this is not recommended** unless users are having difficulties with the newer Prometheus-backed histogram model. Enabling can cause performance issues with timeseries databases like Prometheus, as processing and storing the increased number of buckets is taxing on both CPU and storage. Note that the HdrHistogram model is slated for full deprecation in upcoming releases.
Addresses cockroachdb#95833 This patch reeintroduces the old HdrHistogram model to optionally be enabled in favor of the new Prometheus model, gated behind an environment variable called `COCKROACH_ENABLE_HDR_HISTOGRAMS`, allowing users a means to "fall back" to the old model in the event that the new model does not adequately serve their needs (think of this as an "insurance policy" to protect against this from happening again with no real mitigation - ideally, this environment variable should never have to be used). Note: some histograms were introduced *after* the new Prometheus histograms were added to CockroachDB. In this case, we use the `ForceUsePrometheus` option in the `HistogramOptions` struct to ignore the value of the env var, since there never was a time where these specific histograms used the HdrHistogram model. Release note (ops change): Histogram metrics can now optionally use the legacy HdrHistogram model by setting the environment var `COCKROACH_ENABLE_HDR_HISTOGRAMS=true` on CockroachDB nodes. **Note that this is not recommended** unless users are having difficulties with the newer Prometheus-backed histogram model. Enabling can cause performance issues with timeseries databases like Prometheus, as processing and storing the increased number of buckets is taxing on both CPU and storage. Note that the HdrHistogram model is slated for full deprecation in upcoming releases.
Addresses cockroachdb#95833 This patch reeintroduces the old HdrHistogram model to optionally be enabled in favor of the new Prometheus model, gated behind an environment variable called `COCKROACH_ENABLE_HDR_HISTOGRAMS`, allowing users a means to "fall back" to the old model in the event that the new model does not adequately serve their needs (think of this as an "insurance policy" to protect against this from happening again with no real mitigation - ideally, this environment variable should never have to be used). Note: some histograms were introduced *after* the new Prometheus histograms were added to CockroachDB. In this case, we use the `ForceUsePrometheus` option in the `HistogramOptions` struct to ignore the value of the env var, since there never was a time where these specific histograms used the HdrHistogram model. Release note (ops change): Histogram metrics can now optionally use the legacy HdrHistogram model by setting the environment var `COCKROACH_ENABLE_HDR_HISTOGRAMS=true` on CockroachDB nodes. **Note that this is not recommended** unless users are having difficulties with the newer Prometheus-backed histogram model. Enabling can cause performance issues with timeseries databases like Prometheus, as processing and storing the increased number of buckets is taxing on both CPU and storage. Note that the HdrHistogram model is slated for full deprecation in upcoming releases.
Addresses cockroachdb#95833 This patch reeintroduces the old HdrHistogram model to optionally be enabled in favor of the new Prometheus model, gated behind an environment variable called `COCKROACH_ENABLE_HDR_HISTOGRAMS`, allowing users a means to "fall back" to the old model in the event that the new model does not adequately serve their needs (think of this as an "insurance policy" to protect against this from happening again with no real mitigation - ideally, this environment variable should never have to be used). Note: some histograms were introduced *after* the new Prometheus histograms were added to CockroachDB. In this case, we use the `ForceUsePrometheus` option in the `HistogramOptions` struct to ignore the value of the env var, since there never was a time where these specific histograms used the HdrHistogram model. Release note (ops change): Histogram metrics can now optionally use the legacy HdrHistogram model by setting the environment var `COCKROACH_ENABLE_HDR_HISTOGRAMS=true` on CockroachDB nodes. **Note that this is not recommended** unless users are having difficulties with the newer Prometheus-backed histogram model. Enabling can cause performance issues with timeseries databases like Prometheus, as processing and storing the increased number of buckets is taxing on both CPU and storage. Note that the HdrHistogram model is slated for full deprecation in upcoming releases.
Addresses cockroachdb#95833 This patch reeintroduces the old HdrHistogram model to optionally be enabled in favor of the new Prometheus model, gated behind an environment variable called `COCKROACH_ENABLE_HDR_HISTOGRAMS`, allowing users a means to "fall back" to the old model in the event that the new model does not adequately serve their needs (think of this as an "insurance policy" to protect against this from happening again with no real mitigation - ideally, this environment variable should never have to be used). Note: some histograms were introduced *after* the new Prometheus histograms were added to CockroachDB. In this case, we use the `ForceUsePrometheus` option in the `HistogramOptions` struct to ignore the value of the env var, since there never was a time where these specific histograms used the HdrHistogram model. Release note (ops change): Histogram metrics can now optionally use the legacy HdrHistogram model by setting the environment var `COCKROACH_ENABLE_HDR_HISTOGRAMS=true` on CockroachDB nodes. **Note that this is not recommended** unless users are having difficulties with the newer Prometheus-backed histogram model. Enabling can cause performance issues with timeseries databases like Prometheus, as processing and storing the increased number of buckets is taxing on both CPU and storage. Note that the HdrHistogram model is slated for full deprecation in upcoming releases.
96504: Revert "sql: improve stack trace for get-user-timeout timeouts" r=ecwall a=rafiss This reverts commit 662e19c. gRPC does not expect the context to be wrapped, so it can get confused. informs #95794 Release note: None 96516: pkg/util/metric: fix HistogramMode for streamingingest histograms r=dhartunian a=abarganier Ref: #96514 The above patch migrated all histogram constructors over to use a generic constructor, which has the ability to specify a HistogramMode. While preparing a backport of this patch, I noticed that the following streamingest histograms were incorrectly tagged with a HistogramMode of HistogramModePreferHdrLatency: 1. FlushHistNanos 2. CommitLatency 3. AdmitLatency If you look at the original migration, when HDR was still being used, these histograms were *not* using the HDR latency defaults: a82aa82#diff-1bc5bdba63149e8efeadce17e7eb62bb5cd1dcee22974b37881a627e13c0501dL137-L143 This patch fixes these histograms to no longer incorrectly specify the HistogramModePreferHdrLatency mode in the histogram options. This was also fixed in the backport PR #96514. Release note: none Part of #95833 Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Alex Barganier <[email protected]>
While investigating #95627, we noticed that the p90 connection latency from a local client to a server on roachprod jumped from ~230ms to ~500ms since a82aa82.
Before:
After:
While the client latency (i.e. the time for the query to finish from the client side) are the same, ~500ms.
How to repro
cockroach-pre
andcockroach-post
).$USER-test-pre
and$USER-test-post
)./prepare.sh pre
or./prepare.sh post
). Note that you'll need to create theadd_user.sql
as well, as specified here.----- round 1 -----
from the terminal stdout), go to the db console and check theconnection latency
under thedist SQL
section.Jira issue: CRDB-23769
The text was updated successfully, but these errors were encountered: