You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We noticed that we didn't get explicitly informed about upgrades which are stuck due to a missing admin ack. We can already check if an upgrade won't succeed due to a missing admin ack with cluster_operator_conditions{condition="Upgradeable",endpoint="metrics",name="version",reason="AdminAckRequired"} == 0, but we should write a more sophisticated query which takes into account the current and desired cluster version for the running upgrade job, and which fires when the upgrade job tries to do a minor upgrade which can't succeed due to the missing admin ack.
Additionally, we didn't realize that maintenance was stuck on one cluster for >10h due to the missing admin ack. We should add an alert for upgrade jobs which have been running for longer than 6 hours or so (exact duration TBD).
Task deliverables
Component configures an alert which fires if the maintenance takes too long
Component configures an alert which fires if the upgrade job is blocked due to a missing admin ack
The text was updated successfully, but these errors were encountered:
Additionally, we didn't realize that maintenance was stuck on one cluster for >10h due to the missing admin ack. We should add an alert for upgrade jobs which have been running for longer than 6 hours or so (exact duration TBD).
This should be already implemented with the upgrade timeouts. After a timeout the job should show up with state Failed and reason TimedOut.
Context
We noticed that we didn't get explicitly informed about upgrades which are stuck due to a missing admin ack. We can already check if an upgrade won't succeed due to a missing admin ack with
cluster_operator_conditions{condition="Upgradeable",endpoint="metrics",name="version",reason="AdminAckRequired"} == 0
, but we should write a more sophisticated query which takes into account the current and desired cluster version for the running upgrade job, and which fires when the upgrade job tries to do a minor upgrade which can't succeed due to the missing admin ack.Additionally, we didn't realize that maintenance was stuck on one cluster for >10h due to the missing admin ack. We should add an alert for upgrade jobs which have been running for longer than 6 hours or so (exact duration TBD).
Task deliverables
The text was updated successfully, but these errors were encountered: