Use new compaction interval metric in compaction failed alert. #293

cstyan · 2021-04-20T23:02:37Z

Signed-off-by: Callum Styan [email protected]

What this PR does:
Changes compaction failed alert to use new cortex_compactor_compaction_interval_seconds metric, so that we can more accurately alert when two compactions in a row have failed, regardless of whether the compaction interval is 2h (the value currently used within the increase function of the alert) or not.

Not sure if there should be a changelog entry here.

Checklist

CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Callum Styan <[email protected]>

pstibrany

LGTM, thanks!

pracucci

I think this PR conflicts with #294. We should have already covered the case in #294 and actually ,due to how compactor works and cortex_compactor_last_successful_run_timestamp_seconds is updated, it may take longer than 2x compaction interval to complete a compaction run (and the compactor may be perfectly fine anyway).

pstibrany · 2021-04-21T07:55:48Z

Can we go back to using increase(cortex_compactor_runs_failed_total[2h]) but with interval defined in the metric? I don't think that's possible though :(

cstyan · 2021-04-21T20:21:18Z

Can we go back to using increase(cortex_compactor_runs_failed_total[2h]) but with interval defined in the metric? I don't think that's possible though :(

Yeah, my understanding is that we can't do this in PromQL right now. @beorn7 would be able to confirm.

beorn7 · 2021-04-21T21:19:07Z

Are you talking about using the value of a metric rather than the duration literal 2h in [2h]?

If that's the case, then the answer is: no, that's currently not possible in PromQL. I do think we should make that possible, but it's one of those "not as easy as it looks" problems. Right now, the PromQL engine can find out what time range the query will access in the TSDB, just from static analysis of the query. With allowing the value coming from another expression, you now have to evaluate that "inner query" first to find out what time range the "outer query" will need to access.

For reference: My brainstorming doc about timestamps and durations: https://docs.google.com/document/d/1jMeDsLvDfO92Qnry_JLAXalvMRzMSB1sBr9V7LolpYM/edit#heading=h.vmb7pe7hp12

pracucci · 2021-04-22T07:08:40Z

May you confirm we can't query the last time (seconds) a metric has increased, right?

…

On Wed, Apr 21, 2021 at 11:19 PM Björn Rabenstein ***@***.***> wrote: Are you talking about using the value of a metric rather than the duration literal 2h in [2h]? If that's the case, then the answer is: no, that's currently not possible in PromQL. I do think we should make that possible, but it's one of those "not as easy as it looks" problems. Right now, the PromQL engine can find out what time range the query will access in the TSDB, just from static analysis of the query. With allowing the value coming from another expression, you now have to evaluate that "inner query" first to find out what time range the "outer query" will need to access. For reference: My brainstorming doc about timestamps and durations: https://docs.google.com/document/d/1jMeDsLvDfO92Qnry_JLAXalvMRzMSB1sBr9V7LolpYM/edit#heading=h.vmb7pe7hp12 — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#293 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAM7QEDKH24XWAEMR6UBJOTTJ46N3ANCNFSM43JGNJQQ> .

beorn7 · 2021-04-22T17:06:44Z

I believe there is no straightforward way to do this within PromQL. But perhaps an expert can prove me wrong here?

There is the timestamp() function to access the actual timestamp, but it only works on an instant vector. So whenever you apply it to an expression (e.g. to filter for the last sample before an increase or the first after an increase), it will only yield the evaluation timestamp, not the timestamp of any samples that fed into the expression.

CLAassistant · 2022-06-15T17:47:47Z

All committers have signed the CLA.

Use new compaction interval metric in compaction failed alert.

403aa39

Signed-off-by: Callum Styan <[email protected]>

cstyan requested a review from a team as a code owner April 20, 2021 23:02

pstibrany approved these changes Apr 21, 2021

View reviewed changes

pracucci reviewed Apr 21, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use new compaction interval metric in compaction failed alert. #293

Use new compaction interval metric in compaction failed alert. #293

cstyan commented Apr 20, 2021

pstibrany left a comment

pracucci left a comment

pstibrany commented Apr 21, 2021

cstyan commented Apr 21, 2021

beorn7 commented Apr 21, 2021

pracucci commented Apr 22, 2021 via email

beorn7 commented Apr 22, 2021

CLAassistant commented Jun 15, 2022 •

edited

Loading

Use new compaction interval metric in compaction failed alert. #293

Are you sure you want to change the base?

Use new compaction interval metric in compaction failed alert. #293

Conversation

cstyan commented Apr 20, 2021

pstibrany left a comment

Choose a reason for hiding this comment

pracucci left a comment

Choose a reason for hiding this comment

pstibrany commented Apr 21, 2021

cstyan commented Apr 21, 2021

beorn7 commented Apr 21, 2021

pracucci commented Apr 22, 2021 via email

beorn7 commented Apr 22, 2021

CLAassistant commented Jun 15, 2022 • edited Loading

CLAassistant commented Jun 15, 2022 •

edited

Loading