Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use new compaction interval metric in compaction failed alert. #293

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

cstyan
Copy link
Contributor

@cstyan cstyan commented Apr 20, 2021

Signed-off-by: Callum Styan [email protected]

What this PR does:
Changes compaction failed alert to use new cortex_compactor_compaction_interval_seconds metric, so that we can more accurately alert when two compactions in a row have failed, regardless of whether the compaction interval is 2h (the value currently used within the increase function of the alert) or not.

Not sure if there should be a changelog entry here.

Checklist

  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@cstyan cstyan requested a review from a team as a code owner April 20, 2021 23:02
Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR conflicts with #294. We should have already covered the case in #294 and actually ,due to how compactor works and cortex_compactor_last_successful_run_timestamp_seconds is updated, it may take longer than 2x compaction interval to complete a compaction run (and the compactor may be perfectly fine anyway).

@pstibrany
Copy link
Member

Can we go back to using increase(cortex_compactor_runs_failed_total[2h]) but with interval defined in the metric? I don't think that's possible though :(

@cstyan
Copy link
Contributor Author

cstyan commented Apr 21, 2021

Can we go back to using increase(cortex_compactor_runs_failed_total[2h]) but with interval defined in the metric? I don't think that's possible though :(

Yeah, my understanding is that we can't do this in PromQL right now. @beorn7 would be able to confirm.

@beorn7
Copy link
Contributor

beorn7 commented Apr 21, 2021

Are you talking about using the value of a metric rather than the duration literal 2h in [2h]?

If that's the case, then the answer is: no, that's currently not possible in PromQL. I do think we should make that possible, but it's one of those "not as easy as it looks" problems. Right now, the PromQL engine can find out what time range the query will access in the TSDB, just from static analysis of the query. With allowing the value coming from another expression, you now have to evaluate that "inner query" first to find out what time range the "outer query" will need to access.

For reference: My brainstorming doc about timestamps and durations: https://docs.google.com/document/d/1jMeDsLvDfO92Qnry_JLAXalvMRzMSB1sBr9V7LolpYM/edit#heading=h.vmb7pe7hp12

@pracucci
Copy link
Collaborator

pracucci commented Apr 22, 2021 via email

@beorn7
Copy link
Contributor

beorn7 commented Apr 22, 2021

I believe there is no straightforward way to do this within PromQL. But perhaps an expert can prove me wrong here?

There is the timestamp() function to access the actual timestamp, but it only works on an instant vector. So whenever you apply it to an expression (e.g. to filter for the last sample before an increase or the first after an increase), it will only yield the evaluation timestamp, not the timestamp of any samples that fed into the expression.

@CLAassistant
Copy link

CLAassistant commented Jun 15, 2022

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants