-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(slo): Avoid false positive burn rate alerting with partial rolled-up data #203279
feat(slo): Avoid false positive burn rate alerting with partial rolled-up data #203279
Conversation
Pinging @elastic/obs-ux-management-team (Team:obs-ux-management) |
const source = burnRateWindows | ||
.map((_windDef, index) => { | ||
const windowId = `${WINDOW}_${index}`; | ||
return `(params.${generateAboveThresholdKey( | ||
windowId, | ||
SHORT_WINDOW | ||
)} == 1 && params.${generateAboveThresholdKey(windowId, LONG_WINDOW)} == 1)`; | ||
}) | ||
.join(' || '); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💬 I find map().join() easier to follow than a reduce with a ternary operator
🤖 GitHub commentsExpand to view the GitHub comments
Just comment with:
|
term: { 'slo.instanceId': instanceId }, | ||
}, | ||
{ term: { 'slo.id': slo.id } }, | ||
{ term: { 'slo.revision': slo.revision } }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💬 Filtering on the slo.revision won't change much in most cases, but cannot hurt when the SLO is updated and the previous data is still available when the rule runs
// For timeslice budgeting method, we always compute the burn rate based on the observed bad slices, e.g. total observed - good observed = bad slices observed, | ||
// And we compare this to the expected slices in the whole window duration | ||
const burnRateAgg = isTimesliceBudgetingMethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💬 The main change is here.
💛 Build succeeded, but was flaky
Failed CI StepsTest Failures
Metrics [docs]
History
cc @kdelemme |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Alert did not fire immediately.
term: { 'slo.instanceId': instanceId }, | ||
}, | ||
{ term: { 'slo.id': slo.id } }, | ||
{ term: { 'slo.revision': slo.revision } }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
Starting backport for target branches: 8.x |
…d-up data (elastic#203279) (cherry picked from commit 0e13d86)
💚 All backports created successfully
Note: Successful backport PRs will be merged automatically after passing CI. Questions ?Please refer to the Backport tool documentation |
… rolled-up data (#203279) (#205806) # Backport This will backport the following commits from `main` to `8.x`: - [feat(slo): Avoid false positive burn rate alerting with partial rolled-up data (#203279)](#203279) <!--- Backport version: 9.4.3 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Kevin Delemme","email":"[email protected]"},"sourceCommit":{"committedDate":"2025-01-07T20:21:22Z","message":"feat(slo): Avoid false positive burn rate alerting with partial rolled-up data (#203279)","sha":"0e13d86fc7b37c48011b9a1e601ae9f4e7d664d9","branchLabelMapping":{"^v9.0.0$":"main","^v8.18.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","v9.0.0","backport:prev-minor","ci:project-deploy-observability","Team:obs-ux-management","v8.18.0"],"title":"feat(slo): Avoid false positive burn rate alerting with partial rolled-up data","number":203279,"url":"https://github.com/elastic/kibana/pull/203279","mergeCommit":{"message":"feat(slo): Avoid false positive burn rate alerting with partial rolled-up data (#203279)","sha":"0e13d86fc7b37c48011b9a1e601ae9f4e7d664d9"}},"sourceBranch":"main","suggestedTargetBranches":["8.x"],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/203279","number":203279,"mergeCommit":{"message":"feat(slo): Avoid false positive burn rate alerting with partial rolled-up data (#203279)","sha":"0e13d86fc7b37c48011b9a1e601ae9f4e7d664d9"}},{"branch":"8.x","label":"v8.18.0","branchLabelMappingKey":"^v8.18.0$","isSourceBranch":false,"state":"NOT_CREATED"}]}] BACKPORT--> Co-authored-by: Kevin Delemme <[email protected]>
Resolves #190143
🏇🏻 Summary
This PR makes the burn rate rule less prone to false positive alerting when the rolled-up data is not fully computed for the burn rate window or when the rolled-up data consist of intermittent data, e.g. low-traffic service's SLO.
This is achieved by using the observed bad events, e.g. observed total events - observed good events, and the total slices expected for the given window duration and SLO timeslice duration, e.g. there are 60 1min-slices in a 1h window or 30 2min-slices in a 1h window.
For example, if we have only 1 total event observed and 0 good event observed during a 1h window (using 1min slices), the new burn rate becomes
1/60/(1-objective)
instead of1/(1-objective)
. The new burn rate is 60x smaller than the previous, which would avoid triggering the alert.Note
Did some housecleaning in the burn rate rule as well: Adding slo.revision term filter, reusing function to generate aggs keys
🧬 Testing
buildQuery tests snapshots have been updated with the new expected aggs