Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create additional alerts for stuck upgrades #31

Open
simu opened this issue Sep 1, 2023 · 1 comment
Open

Create additional alerts for stuck upgrades #31

simu opened this issue Sep 1, 2023 · 1 comment
Labels
enhancement New feature or request

Comments

@simu
Copy link
Member

simu commented Sep 1, 2023

Context

We noticed that we didn't get explicitly informed about upgrades which are stuck due to a missing admin ack. We can already check if an upgrade won't succeed due to a missing admin ack with cluster_operator_conditions{condition="Upgradeable",endpoint="metrics",name="version",reason="AdminAckRequired"} == 0, but we should write a more sophisticated query which takes into account the current and desired cluster version for the running upgrade job, and which fires when the upgrade job tries to do a minor upgrade which can't succeed due to the missing admin ack.

Additionally, we didn't realize that maintenance was stuck on one cluster for >10h due to the missing admin ack. We should add an alert for upgrade jobs which have been running for longer than 6 hours or so (exact duration TBD).

Task deliverables

  • Component configures an alert which fires if the maintenance takes too long
  • Component configures an alert which fires if the upgrade job is blocked due to a missing admin ack
@simu simu added the enhancement New feature or request label Sep 1, 2023
@bastjan
Copy link
Contributor

bastjan commented Sep 1, 2023

Additionally, we didn't realize that maintenance was stuck on one cluster for >10h due to the missing admin ack. We should add an alert for upgrade jobs which have been running for longer than 6 hours or so (exact duration TBD).

This should be already implemented with the upgrade timeouts. After a timeout the job should show up with state Failed and reason TimedOut.

Edit: While the metric is there we need an alert.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants