-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metrics: histogram calculation results in reporting service latency spikes #103814
Comments
xref #98266. |
A summary of some offline discussions I have had today: The root cause of this issue and #98266 is that we removed the logic to merge views between the histogram windows without realizing it. This ensured the histogram always had enough values for sufficient granularity in its percentile and average calculations. The fix would be to implement a merge function and use it for all derived windowed stats. After spending some time thinking about this issue, I think #98621 is moot since it will make our problem even worse. It will reduce the "window" to the polling duration, which we do not want. Instead, we want to keep windowing to 60s (as it is set currently) and have two 30s windows for prev and current histograms. When we do calculations, we merge the previous with the current one. This logic will make it so that our windowing goes back to how it was before the refactoring in #85990. TLDR:
|
Fixes cockroachdb#103814. Fixes cockroachdb#98266. This commit updates the windowed histogram merging approach to add the previous window's histogram bucket counts and sample count to those of the current one. As a result of this change, the histograms will no longer report under-sampled quantile values, and timeseries metrics-derived charts (e.g., the quantile-based SQL service latency charts on the DB console's Metrics page) will more accurately display metrics. Release note (bug fix): Updated the histogram window merge calculation to more accurately interpolate quantile values. This change will result in smoother, more accurate Metrics charts on the DB Console. Co-authored-by: Aaditya Sondhi <[email protected]>
Fixes cockroachdb#103814. Fixes cockroachdb#98266. This commit updates the windowed histogram merging approach to add the previous window's histogram bucket counts and sample count to those of the current one. As a result of this change, the histograms will no longer report under-sampled quantile values, and timeseries metrics-derived charts (e.g., the quantile-based SQL service latency charts on the DB console's Metrics page) will more accurately display metrics. Release note (bug fix): Updated the histogram window merge calculation to more accurately interpolate quantile values. This change will result in smoother, more accurate Metrics charts on the DB Console. Co-authored-by: Aaditya Sondhi <[email protected]>
Fixes cockroachdb#103814. Fixes cockroachdb#98266. This commit updates the windowed histogram merging approach to add the previous window's histogram bucket counts and sample count to those of the current one. As a result of this change, the histograms will no longer report under-sampled quantile values, and timeseries metrics-derived charts (e.g., the quantile-based SQL service latency charts on the DB console's Metrics page) will more accurately display metrics. Release note (bug fix): Updated the histogram window merge calculation to more accurately interpolate quantile values. This change will result in smoother, more accurate Metrics charts on the DB Console. Co-authored-by: Aaditya Sondhi <[email protected]>
Fixes cockroachdb#103814. Fixes cockroachdb#98266. This commit updates the windowed histogram merging approach to add the previous window's histogram bucket counts and sample count to those of the current one. As a result of this change, the histograms will no longer report under-sampled quantile values, and timeseries metrics-derived charts (e.g., the quantile-based SQL service latency charts on the DB console's Metrics page) will more accurately display metrics. Release note (bug fix): Updated the histogram window merge calculation to more accurately interpolate quantile values. This change will result in smoother, more accurate Metrics charts on the DB Console. Co-authored-by: Aaditya Sondhi <[email protected]>
Fixes cockroachdb#103814. Fixes cockroachdb#98266. This commit updates the windowed histogram merging approach to add the previous window's histogram bucket counts and sample count to those of the current one. As a result of this change, the histograms will no longer report under-sampled quantile values, and timeseries metrics-derived charts (e.g., the quantile-based SQL service latency charts on the DB console's Metrics page) will more accurately display metrics. Release note (bug fix): Updated the histogram window merge calculation to more accurately interpolate quantile values. This change will result in smoother, more accurate Metrics charts on the DB Console. Co-authored-by: Aaditya Sondhi <[email protected]>
103729: log: add headers, compression config to http servers r=healthy-pod a=dhartunian This commit adds support for custom headers and gzip compression on requests made to `http-server` outputs in CRDB's logging configuration. Custom headers enable the inclusion of API keys for 3rd party sinks and gzip compression reduces network resource consumption. Resolves #103477 Release note (ops change): `http-defaults` and `http-servers` sections of the log config will now accept a `headers` field containing a map of key value string pairs which will comprise custom HTTP headers appended to every request. Additionally a `compression` value is support which can be set to `gzip` or `none` to select a compression method for the HTTP requesst body. By default `gzip` is selected. This is a change from previous functionality that did not compress by default. 103963: kvclient: add x-region, x-zone metrics to DistSender r=healthy-pod a=wenyihu6 Previously, there were no metrics to observe cross-region, cross-zone traffic in batch requests / responses at DistSender. To improve this issue, this commit introduces six new distsender metrics - ``` "distsender.batch_requests.replica_addressed.bytes" "distsender.batch_responses.replica_addressed.bytes" "distsender.batch_requests.cross_region.bytes" "distsender.batch_responses.cross_region.bytes" "distsender.batch_requests.cross_zone.bytes" "distsender.batch_responses.cross_zone.bytes" ``` The first two metrics track the total byte count of batch requests processed and batch responses received at DistSender. Additionally, there are four metrics to track the aggregate counts processed and received across different regions and zones. Note that these metrics only track the sender node and not the receiver node as DistSender resides on the gateway node receiving SQL queries. Part of: #103983 Release note (ops change): Six new metrics - "distsender.batch_requests.replica_addressed.bytes", "distsender.batch_responses.replica_addressed.bytes", "distsender.batch_requests.cross_region.bytes", "distsender.batch_responses.cross_region.bytes", "distsender.batch_requests.cross_zone.bytes", "distsender.batch_responses.cross_zone.bytes"- are now added to DistSender metrics. For accurate metrics, follow these assumptions: - Configure region and zone tier keys consistently across nodes. - Within a node locality, ensure unique region and zone tier keys. - Maintain consistent configuration of region and zone tiers across nodes. 104088: metrics: fix windowed histogram merging approach r=ericharmeling a=ericharmeling Fixes #103814. Fixes #98266. This commit updates the windowed histogram merging approach to add the previous window's histogram bucket counts and sample count to those of the current one. As a result of this change, the histograms will no longer report under-sampled quantile values, and timeseries metrics-derived charts (e.g., the quantile-based SQL service latency charts on the DB console's Metrics page) will more accurately display metrics. Release note (bug fix): Updated the histogram window merge calculation to more accurately interpolate quantile values. This change will result in smoother, more accurate Metrics charts on the DB Console. 104376: sql: CREATEROLE now includes ability to grant non-admin roles r=rafiss a=rafiss fixes #104371 This matches the PostgreSQL behavior. Release note (security update): Users who have the CREATEROLE role option can now grant and revoke role membership in any non-admin role. This change also removes the sql.auth.createrole_allows_grant_role_membership.enabled cluster setting, which was added in v23.1. Now, the cluster setting is effectively always true. 104802: kvserver: prevent split at invalid tenant prefix keys r=arulajmani a=tbg Closes #104796. Epic: None Release note (bug fix): prevents invalid splits that can crash (and prevent restarts) of nodes that hold a replica for the right-hand side. Co-authored-by: David Hartunian <[email protected]> Co-authored-by: Wenyi <[email protected]> Co-authored-by: Eric Harmeling <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Tobias Grieger <[email protected]>
Fixes #103814. Fixes #98266. This commit updates the windowed histogram merging approach to add the previous window's histogram bucket counts and sample count to those of the current one. As a result of this change, the histograms will no longer report under-sampled quantile values, and timeseries metrics-derived charts (e.g., the quantile-based SQL service latency charts on the DB console's Metrics page) will more accurately display metrics. Release note (bug fix): Updated the histogram window merge calculation to more accurately interpolate quantile values. This change will result in smoother, more accurate Metrics charts on the DB Console. Co-authored-by: Aaditya Sondhi <[email protected]>
Fixes cockroachdb#103814. Fixes cockroachdb#98266. This commit updates the windowed histogram merging approach to add the previous window's histogram bucket counts and sample count to those of the current one. As a result of this change, the histograms will no longer report under-sampled quantile values, and timeseries metrics-derived charts (e.g., the quantile-based SQL service latency charts on the DB console's Metrics page) will more accurately display metrics. Release note (bug fix): Updated the histogram window merge calculation to more accurately interpolate quantile values. This change will result in smoother, more accurate Metrics charts on the DB Console. Co-authored-by: Aaditya Sondhi <[email protected]>
Fixes cockroachdb#103814. Fixes cockroachdb#98266. This commit updates the windowed histogram merging approach to add the previous window's histogram bucket counts and sample count to those of the current one. As a result of this change, the histograms will no longer report under-sampled quantile values, and timeseries metrics-derived charts (e.g., the quantile-based SQL service latency charts on the DB console's Metrics page) will more accurately display metrics. Release note (bug fix): Updated the histogram window merge calculation to more accurately interpolate quantile values. This change will result in smoother, more accurate Metrics charts on the DB Console. Co-authored-by: Aaditya Sondhi <[email protected]>
Upgrading from 22.1 to 22.2 might result in a spike in the service latencies reported in the Service Latency: SQL Statements Metrics graphs on the DB Console.
This regression might be due to changes to the percentile calculations introduced in #86671.
We've reproduced this regression on a single-node local cluster with a TPCC workload:
Interestingly, changing the downsampler from
MAX
toAVG
results in the latencies for both versions to look the same.To reproduce locally:
t1
t2
t1
t2
Jira issue: CRDB-28215
The text was updated successfully, but these errors were encountered: