Create additional alerts for stuck upgrades #31

simu · 2023-09-01T07:31:51Z

Context

We noticed that we didn't get explicitly informed about upgrades which are stuck due to a missing admin ack. We can already check if an upgrade won't succeed due to a missing admin ack with cluster_operator_conditions{condition="Upgradeable",endpoint="metrics",name="version",reason="AdminAckRequired"} == 0, but we should write a more sophisticated query which takes into account the current and desired cluster version for the running upgrade job, and which fires when the upgrade job tries to do a minor upgrade which can't succeed due to the missing admin ack.

Additionally, we didn't realize that maintenance was stuck on one cluster for >10h due to the missing admin ack. We should add an alert for upgrade jobs which have been running for longer than 6 hours or so (exact duration TBD).

Task deliverables

Component configures an alert which fires if the maintenance takes too long
Component configures an alert which fires if the upgrade job is blocked due to a missing admin ack

The text was updated successfully, but these errors were encountered:

bastjan · 2023-09-01T07:48:42Z

Additionally, we didn't realize that maintenance was stuck on one cluster for >10h due to the missing admin ack. We should add an alert for upgrade jobs which have been running for longer than 6 hours or so (exact duration TBD).

This should be already implemented with the upgrade timeouts. After a timeout the job should show up with state Failed and reason TimedOut.

Edit: While the metric is there we need an alert.

simu added the enhancement New feature or request label Sep 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create additional alerts for stuck upgrades #31

Create additional alerts for stuck upgrades #31

simu commented Sep 1, 2023

bastjan commented Sep 1, 2023 •

edited

Loading

Create additional alerts for stuck upgrades #31

Create additional alerts for stuck upgrades #31

Comments

simu commented Sep 1, 2023

Context

Task deliverables

bastjan commented Sep 1, 2023 • edited Loading

bastjan commented Sep 1, 2023 •

edited

Loading