Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alerts that should be inhibited fire on Alertmanager reload/restart #4064

Open
FUSAKLA opened this issue Oct 14, 2024 · 5 comments
Open

Alerts that should be inhibited fire on Alertmanager reload/restart #4064

FUSAKLA opened this issue Oct 14, 2024 · 5 comments

Comments

@FUSAKLA
Copy link
Contributor

FUSAKLA commented Oct 14, 2024

What did you do?
Have 2 alerts firing for a long time, and configured the inhibition rule in such a way that one of the alerts inhibits the other one.

What did you expect to see?
The Alertmanager does not send a notification for the inhibited alert if I restart/reload it.

What did you see instead? Under which circumstances?
The Alertmanager sent a notification for the alert, which should have been inhibited right away once it received the alert from Prometheus.

Root cause

We dug into the issue and found this if that is causing the issue.

What it does is that if the alert started to fire longer time before then the group_wait, it will send it right a way,
But for the inhibition to work, it is necessary for it to wait at least the group_wait after start before sending any alerts so it knows that no alerts that should inhibit it are firing.

Reasoning

We tried to find any reasoning for it, but the author is @fabxc 7y ago, and it was part of large changes without any particular obvious reason for this. The only meaning I can come up with is trying to avoid waiting the group_wait for an alert that might have started to fire during the Alertmanager unavailability, so it is sent ASAP.

Possible fixes

  • Change the alert.StartsAt in the if to alert.UpdatedAt: This might help, but not sure if it would fix it in 100% cases
  • Drop the if completely since it makes the inhibitor "flaky" during reloads/restarts
  • Make the inhibitor state persistent and gossiped (or somehow loaded from nflog) so before starting up the Alertmanager would know what alerts are firing for the purpose of inhibiting

(we have currently been running the fix, where the if is dropped in production and everything is working fine and the flakiness of inhibiting is gone)

I'd be happy to send a PR with a fix if we agree that it is a bug that should be fixed and what is the best way.

@FUSAKLA
Copy link
Contributor Author

FUSAKLA commented Oct 14, 2024

Personally, I'd drop the if completely. The benefit of speeding up the notification compared to the possible race conditions is not worth it.

@FUSAKLA
Copy link
Contributor Author

FUSAKLA commented Oct 14, 2024

@simonpasquier hope you don't mind if I ping you about this, you were in the past involved in the issues regarding inhibitor persistence etc

@FUSAKLA FUSAKLA changed the title Alerts that should be inhibited fires on Alertmanager reload/restart Alerts that should be inhibited fire on Alertmanager reload/restart Oct 14, 2024
@grobinson-grafana
Copy link
Contributor

Hi! 👋 Thanks for opening this issue! I agree that this if statement causes a race condition in inhibition rules:

if !ag.hasFlushed && alert.StartsAt.Add(ag.opts.GroupWait).Before(time.Now()) {
	ag.next.Reset(0)
}

I opened a PR in the past to fix this #3419 but it does have some disadvantages:

A potential issue with this change is that following a reload or restart of Alertmanager, alerts that were waiting for group_wait will have to wait from the beginning of group_wait again. If group_wait is large then notifications could take longer to send then expected. Frequent reloads in combination with a large group_wait could even prevent alerts from being flushed at all.

I also wrote up a blog post on how to work around this issue without needing to run a separate build with the fix https://www.grobinson.net/best-practices-for-avoiding-race-conditions-in-inhibition-rules.html.

@FUSAKLA
Copy link
Contributor Author

FUSAKLA commented Oct 15, 2024

Hi @grobinson-grafana, thanks for the info!

The blog post is great (wondering why I did not find it googling before 🤔 ). Unfortunately, those solutions are not sustainable in larger setups and makes inhibiting unusable 😖

I'm wondering why is your PR closed? Did you ever get feedback on this from the maintainers?
I do understand the concerns, might be a configuration option or leave that decision on the user? 🤔
Or, as I suggest, somehow persist state of the inhibitor over restarts.

It feels like a bug worth fixing, because without it the inhibiting is really hard to use

@grobinson-grafana
Copy link
Contributor

Unfortunately, those solutions are not sustainable in larger setups and makes inhibiting unusable 😖

It should be possible. Prometheus will let you duplicate rules if an inhibition rule is supposed to inhibit alerts across two or more rule groups.

Or, as I suggest, somehow persist state of the inhibitor over restarts.

Yes, this would work. The problem is finding someone who will offer their time to write all the code for this and test it. It's a large effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants