Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(slo): Avoid false positive burn rate alerting with partial rolled-up data #203279

Merged
merged 12 commits into from
Jan 7, 2025

Conversation

kdelemme
Copy link
Contributor

@kdelemme kdelemme commented Dec 6, 2024

Resolves #190143

🏇🏻 Summary

This PR makes the burn rate rule less prone to false positive alerting when the rolled-up data is not fully computed for the burn rate window or when the rolled-up data consist of intermittent data, e.g. low-traffic service's SLO.

This is achieved by using the observed bad events, e.g. observed total events - observed good events, and the total slices expected for the given window duration and SLO timeslice duration, e.g. there are 60 1min-slices in a 1h window or 30 2min-slices in a 1h window.

For example, if we have only 1 total event observed and 0 good event observed during a 1h window (using 1min slices), the new burn rate becomes 1/60/(1-objective) instead of 1/(1-objective). The new burn rate is 60x smaller than the previous, which would avoid triggering the alert.

Note

Did some housecleaning in the burn rate rule as well: Adding slo.revision term filter, reusing function to generate aggs keys

🧬 Testing

buildQuery tests snapshots have been updated with the new expected aggs

@kdelemme kdelemme added release_note:skip Skip the PR/issue when compiling release notes backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) Team:obs-ux-management Observability Management User Experience Team v8.18.0 labels Dec 6, 2024
@kdelemme kdelemme self-assigned this Dec 6, 2024
@kdelemme kdelemme marked this pull request as ready for review December 6, 2024 19:46
@kdelemme kdelemme requested a review from a team as a code owner December 6, 2024 19:46
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

Comment on lines 160 to 168
const source = burnRateWindows
.map((_windDef, index) => {
const windowId = `${WINDOW}_${index}`;
return `(params.${generateAboveThresholdKey(
windowId,
SHORT_WINDOW
)} == 1 && params.${generateAboveThresholdKey(windowId, LONG_WINDOW)} == 1)`;
})
.join(' || ');
Copy link
Contributor Author

@kdelemme kdelemme Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💬 I find map().join() easier to follow than a reduce with a ternary operator

@botelastic botelastic bot added the ci:project-deploy-observability Create an Observability project label Dec 6, 2024
Copy link
Contributor

github-actions bot commented Dec 6, 2024

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • /oblt-deploy : Deploy a Kibana instance using the Observability test environments.
  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

term: { 'slo.instanceId': instanceId },
},
{ term: { 'slo.id': slo.id } },
{ term: { 'slo.revision': slo.revision } },
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💬 Filtering on the slo.revision won't change much in most cases, but cannot hurt when the SLO is updated and the previous data is still available when the rule runs

Comment on lines 68 to 70
// For timeslice budgeting method, we always compute the burn rate based on the observed bad slices, e.g. total observed - good observed = bad slices observed,
// And we compare this to the expected slices in the whole window duration
const burnRateAgg = isTimesliceBudgetingMethod
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💬 The main change is here.

@dominiqueclarke dominiqueclarke self-requested a review January 7, 2025 16:28
@elasticmachine
Copy link
Contributor

elasticmachine commented Jan 7, 2025

💛 Build succeeded, but was flaky

  • Buildkite Build
  • Commit: a460bc1
  • Kibana Serverless Image: docker.elastic.co/kibana-ci/kibana-serverless:pr-203279-a460bc1b7374

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #6 / discover/security/context_awareness security root profile cell renderers host.name DataView mode should open host.name flyout
  • [job] [logs] Jest Tests #1 / StepDefinePackagePolicy default API response should display vars coming from package policy

Metrics [docs]

✅ unchanged

History

cc @kdelemme

Copy link
Contributor

@dominiqueclarke dominiqueclarke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Alert did not fire immediately.

term: { 'slo.instanceId': instanceId },
},
{ term: { 'slo.id': slo.id } },
{ term: { 'slo.revision': slo.revision } },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@kdelemme kdelemme merged commit 0e13d86 into elastic:main Jan 7, 2025
8 checks passed
@kdelemme kdelemme deleted the feat/burn-rate-rule-improvement branch January 7, 2025 20:21
@kibanamachine
Copy link
Contributor

Starting backport for target branches: 8.x

https://github.com/elastic/kibana/actions/runs/12658708629

@kibanamachine
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
8.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

kowalczyk-krzysztof pushed a commit to kowalczyk-krzysztof/kibana that referenced this pull request Jan 7, 2025
kibanamachine added a commit that referenced this pull request Jan 7, 2025
… rolled-up data (#203279) (#205806)

# Backport

This will backport the following commits from `main` to `8.x`:
- [feat(slo): Avoid false positive burn rate alerting with partial
rolled-up data (#203279)](#203279)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Kevin
Delemme","email":"[email protected]"},"sourceCommit":{"committedDate":"2025-01-07T20:21:22Z","message":"feat(slo):
Avoid false positive burn rate alerting with partial rolled-up data
(#203279)","sha":"0e13d86fc7b37c48011b9a1e601ae9f4e7d664d9","branchLabelMapping":{"^v9.0.0$":"main","^v8.18.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","v9.0.0","backport:prev-minor","ci:project-deploy-observability","Team:obs-ux-management","v8.18.0"],"title":"feat(slo):
Avoid false positive burn rate alerting with partial rolled-up
data","number":203279,"url":"https://github.com/elastic/kibana/pull/203279","mergeCommit":{"message":"feat(slo):
Avoid false positive burn rate alerting with partial rolled-up data
(#203279)","sha":"0e13d86fc7b37c48011b9a1e601ae9f4e7d664d9"}},"sourceBranch":"main","suggestedTargetBranches":["8.x"],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/203279","number":203279,"mergeCommit":{"message":"feat(slo):
Avoid false positive burn rate alerting with partial rolled-up data
(#203279)","sha":"0e13d86fc7b37c48011b9a1e601ae9f4e7d664d9"}},{"branch":"8.x","label":"v8.18.0","branchLabelMappingKey":"^v8.18.0$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Kevin Delemme <[email protected]>
crespocarlos pushed a commit to crespocarlos/kibana that referenced this pull request Jan 8, 2025
CAWilson94 pushed a commit to CAWilson94/kibana that referenced this pull request Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) ci:project-deploy-observability Create an Observability project release_note:skip Skip the PR/issue when compiling release notes Team:obs-ux-management Observability Management User Experience Team v8.18.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SLO] Backfill timeslice buckets in rule executor
4 participants