Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: histogram calculation results in reporting service latency spikes #103814

Closed
ericharmeling opened this issue May 23, 2023 · 2 comments · Fixed by #104088
Closed

metrics: histogram calculation results in reporting service latency spikes #103814

ericharmeling opened this issue May 23, 2023 · 2 comments · Fixed by #104088
Assignees
Labels
A-observability-inf C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs

Comments

@ericharmeling
Copy link
Contributor

ericharmeling commented May 23, 2023

Upgrading from 22.1 to 22.2 might result in a spike in the service latencies reported in the Service Latency: SQL Statements Metrics graphs on the DB Console.

This regression might be due to changes to the percentile calculations introduced in #86671.

We've reproduced this regression on a single-node local cluster with a TPCC workload:

image
image

Interestingly, changing the downsampler from MAX to AVG results in the latencies for both versions to look the same.

image

To reproduce locally:

t1

cockroach-22-1 start-single-node --insecure

t2

cockroach-22-1 workload init tpcc
cockroach-22-1 workload run tpcc

t1

ctrl+c
export COCKROACH_ENABLE_HDR_HISTOGRAMS=true
cockroach-22-2 start-single-node --insecure

t2

cockroach-22-1 workload run tpcc

Jira issue: CRDB-28215

@ericharmeling ericharmeling added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-observability-inf labels May 23, 2023
@ericharmeling ericharmeling self-assigned this May 23, 2023
@kvoli
Copy link
Collaborator

kvoli commented May 24, 2023

xref #98266.

@irfansharif irfansharif added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label May 24, 2023
@aadityasondhi
Copy link
Collaborator

aadityasondhi commented May 24, 2023

A summary of some offline discussions I have had today:

The root cause of this issue and #98266 is that we removed the logic to merge views between the histogram windows without realizing it. This ensured the histogram always had enough values for sufficient granularity in its percentile and average calculations. The fix would be to implement a merge function and use it for all derived windowed stats.

https://github.com/cockroachdb/cockroach/pull/85990/files#diff-6f6b5fc2c3dd33a03f71735caae58124ed2714538d887da0a27ced6eb1993e46L54

After spending some time thinking about this issue, I think #98621 is moot since it will make our problem even worse. It will reduce the "window" to the polling duration, which we do not want. Instead, we want to keep windowing to 60s (as it is set currently) and have two 30s windows for prev and current histograms. When we do calculations, we merge the previous with the current one. This logic will make it so that our windowing goes back to how it was before the refactoring in #85990.

TLDR:

ericharmeling added a commit to ericharmeling/cockroach that referenced this issue Jun 1, 2023
Fixes cockroachdb#103814.
Fixes cockroachdb#98266.

This commit updates the windowed histogram merging approach
to add the previous window's histogram bucket counts and sample
count to those of the current one. As a result of this change,
the histograms will no longer report under-sampled quantile values,
and timeseries metrics-derived charts (e.g., the quantile-based
SQL service latency charts on the DB console's Metrics page) will
more accurately display metrics.

Release note (bug fix): Updated the histogram window merge calculation
to more accurately interpolate quantile values. This change will result
in smoother, more accurate Metrics charts on the DB Console.

Co-authored-by: Aaditya Sondhi <[email protected]>
ericharmeling added a commit to ericharmeling/cockroach that referenced this issue Jun 1, 2023
Fixes cockroachdb#103814.
Fixes cockroachdb#98266.

This commit updates the windowed histogram merging approach
to add the previous window's histogram bucket counts and sample
count to those of the current one. As a result of this change,
the histograms will no longer report under-sampled quantile values,
and timeseries metrics-derived charts (e.g., the quantile-based
SQL service latency charts on the DB console's Metrics page) will
more accurately display metrics.

Release note (bug fix): Updated the histogram window merge calculation
to more accurately interpolate quantile values. This change will result
in smoother, more accurate Metrics charts on the DB Console.

Co-authored-by: Aaditya Sondhi <[email protected]>
ericharmeling added a commit to ericharmeling/cockroach that referenced this issue Jun 2, 2023
Fixes cockroachdb#103814.
Fixes cockroachdb#98266.

This commit updates the windowed histogram merging approach
to add the previous window's histogram bucket counts and sample
count to those of the current one. As a result of this change,
the histograms will no longer report under-sampled quantile values,
and timeseries metrics-derived charts (e.g., the quantile-based
SQL service latency charts on the DB console's Metrics page) will
more accurately display metrics.

Release note (bug fix): Updated the histogram window merge calculation
to more accurately interpolate quantile values. This change will result
in smoother, more accurate Metrics charts on the DB Console.

Co-authored-by: Aaditya Sondhi <[email protected]>
ericharmeling added a commit to ericharmeling/cockroach that referenced this issue Jun 5, 2023
Fixes cockroachdb#103814.
Fixes cockroachdb#98266.

This commit updates the windowed histogram merging approach
to add the previous window's histogram bucket counts and sample
count to those of the current one. As a result of this change,
the histograms will no longer report under-sampled quantile values,
and timeseries metrics-derived charts (e.g., the quantile-based
SQL service latency charts on the DB console's Metrics page) will
more accurately display metrics.

Release note (bug fix): Updated the histogram window merge calculation
to more accurately interpolate quantile values. This change will result
in smoother, more accurate Metrics charts on the DB Console.

Co-authored-by: Aaditya Sondhi <[email protected]>
ericharmeling added a commit to ericharmeling/cockroach that referenced this issue Jun 9, 2023
Fixes cockroachdb#103814.
Fixes cockroachdb#98266.

This commit updates the windowed histogram merging approach
to add the previous window's histogram bucket counts and sample
count to those of the current one. As a result of this change,
the histograms will no longer report under-sampled quantile values,
and timeseries metrics-derived charts (e.g., the quantile-based
SQL service latency charts on the DB console's Metrics page) will
more accurately display metrics.

Release note (bug fix): Updated the histogram window merge calculation
to more accurately interpolate quantile values. This change will result
in smoother, more accurate Metrics charts on the DB Console.

Co-authored-by: Aaditya Sondhi <[email protected]>
craig bot pushed a commit that referenced this issue Jun 13, 2023
103729: log: add headers, compression config to http servers r=healthy-pod a=dhartunian

This commit adds support for custom headers and gzip compression on requests made to `http-server` outputs in CRDB's logging configuration. Custom headers enable the inclusion of API keys for 3rd party sinks and gzip compression reduces network resource consumption.

Resolves #103477

Release note (ops change): `http-defaults` and `http-servers` sections of the log config will now accept a `headers` field containing a map of key value string pairs which will comprise custom HTTP headers appended to every request. Additionally a `compression` value is support which can be set to `gzip` or `none` to select a compression method for the HTTP requesst body. By default `gzip` is selected. This is a change from previous functionality that did not compress by default.

103963: kvclient: add x-region, x-zone metrics to DistSender r=healthy-pod a=wenyihu6

Previously, there were no metrics to observe cross-region, cross-zone traffic in
batch requests / responses at DistSender.

To improve this issue, this commit introduces six new distsender metrics -
```
"distsender.batch_requests.replica_addressed.bytes"
"distsender.batch_responses.replica_addressed.bytes"
"distsender.batch_requests.cross_region.bytes"
"distsender.batch_responses.cross_region.bytes"
"distsender.batch_requests.cross_zone.bytes"
"distsender.batch_responses.cross_zone.bytes"
```

The first two metrics track the total byte count of batch requests processed and
batch responses received at DistSender. Additionally, there are four metrics to
track the aggregate counts processed and received across different regions and
zones. Note that these metrics only track the sender node and not the receiver
node as DistSender resides on the gateway node receiving SQL queries.

Part of: #103983

Release note (ops change): Six new metrics -
"distsender.batch_requests.replica_addressed.bytes",
"distsender.batch_responses.replica_addressed.bytes",
"distsender.batch_requests.cross_region.bytes",
"distsender.batch_responses.cross_region.bytes",
"distsender.batch_requests.cross_zone.bytes",
"distsender.batch_responses.cross_zone.bytes"- are now added to DistSender 
metrics.

For accurate metrics, follow these assumptions:
- Configure region and zone tier keys consistently across nodes.
- Within a node locality, ensure unique region and zone tier keys.
- Maintain consistent configuration of region and zone tiers across nodes.

104088: metrics: fix windowed histogram merging approach r=ericharmeling a=ericharmeling

Fixes #103814.
Fixes #98266.

This commit updates the windowed histogram merging approach to add the previous window's histogram bucket counts and sample count to those of the current one. As a result of this change, the histograms will no longer report under-sampled quantile values, and timeseries metrics-derived charts (e.g., the quantile-based SQL service latency charts on the DB console's Metrics page) will more accurately display metrics.

Release note (bug fix): Updated the histogram window merge calculation to more accurately interpolate quantile values. This change will result in smoother, more accurate Metrics charts on the DB Console.

104376: sql: CREATEROLE now includes ability to grant non-admin roles r=rafiss a=rafiss

fixes #104371

This matches the PostgreSQL behavior.

Release note (security update): Users who have the CREATEROLE role
option can now grant and revoke role membership in any non-admin role.

This change also removes the sql.auth.createrole_allows_grant_role_membership.enabled
cluster setting, which was added in v23.1. Now, the cluster setting is
effectively always true.

104802: kvserver: prevent split at invalid tenant prefix keys r=arulajmani a=tbg

Closes #104796.

Epic: None
Release note (bug fix): prevents invalid splits that can crash (and prevent
restarts) of nodes that hold a replica for the right-hand side.


Co-authored-by: David Hartunian <[email protected]>
Co-authored-by: Wenyi <[email protected]>
Co-authored-by: Eric Harmeling <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
@craig craig bot closed this as completed in bae5045 Jun 13, 2023
blathers-crl bot pushed a commit that referenced this issue Jun 13, 2023
Fixes #103814.
Fixes #98266.

This commit updates the windowed histogram merging approach
to add the previous window's histogram bucket counts and sample
count to those of the current one. As a result of this change,
the histograms will no longer report under-sampled quantile values,
and timeseries metrics-derived charts (e.g., the quantile-based
SQL service latency charts on the DB console's Metrics page) will
more accurately display metrics.

Release note (bug fix): Updated the histogram window merge calculation
to more accurately interpolate quantile values. This change will result
in smoother, more accurate Metrics charts on the DB Console.

Co-authored-by: Aaditya Sondhi <[email protected]>
ericharmeling added a commit to ericharmeling/cockroach that referenced this issue Jun 14, 2023
Fixes cockroachdb#103814.
Fixes cockroachdb#98266.

This commit updates the windowed histogram merging approach
to add the previous window's histogram bucket counts and sample
count to those of the current one. As a result of this change,
the histograms will no longer report under-sampled quantile values,
and timeseries metrics-derived charts (e.g., the quantile-based
SQL service latency charts on the DB console's Metrics page) will
more accurately display metrics.

Release note (bug fix): Updated the histogram window merge calculation
to more accurately interpolate quantile values. This change will result
in smoother, more accurate Metrics charts on the DB Console.

Co-authored-by: Aaditya Sondhi <[email protected]>
ericharmeling added a commit to ericharmeling/cockroach that referenced this issue Jun 15, 2023
Fixes cockroachdb#103814.
Fixes cockroachdb#98266.

This commit updates the windowed histogram merging approach
to add the previous window's histogram bucket counts and sample
count to those of the current one. As a result of this change,
the histograms will no longer report under-sampled quantile values,
and timeseries metrics-derived charts (e.g., the quantile-based
SQL service latency charts on the DB console's Metrics page) will
more accurately display metrics.

Release note (bug fix): Updated the histogram window merge calculation
to more accurately interpolate quantile values. This change will result
in smoother, more accurate Metrics charts on the DB Console.

Co-authored-by: Aaditya Sondhi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-observability-inf C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants