You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are validating if the K8s upgrade has succeeded by looking at whether the node is shedulable, in a Ready state and with the desired K8s version. If the validations passes we proceed to the workload upgrades.
Problem
While the node is marked as Ready, Schedulable and has the correct version, its underlying core applications are still being recreated. At the same time the upgrade-controller proceeds to do workload upgrades.
This results in the workload upgrades possibly failing for a number of reasons, for example:
A workload upgrade needs network access , which cannot currently be given, because the core application for networking is being recreated.
A workload upgrade needs ingress access, which cannot currently be given, because the core ingress controller application is being recreated.
(1) and (2) are just examples, there can be a number of different possible clashes. These failures can persist through our retry mechanism and could result in the controller marking the upgrade as failed, while it actually succeeded, but just took significantly more time to complete (because of the core application recreate).
Conclusion
We need to enhance the K8s upgrade validation to also validate the state of the K8s core components before marking the K8s upgrade as complete.
The text was updated successfully, but these errors were encountered:
The most achievable approach here would be to wait for all helm-controller managed HelmCharts to have completed upgrade Jobs before proceeding to the upgrade-controller initiated chart upgrade. This would ensure that no clashes between upgrades happen.
Current status
We are validating if the K8s upgrade has succeeded by looking at whether the node is
shedulable
, in aReady
state and with the desired K8s version. If the validations passes we proceed to theworkload
upgrades.Problem
While the node is marked as
Ready
,Schedulable
and has the correct version, its underlying core applications are still being recreated. At the same time theupgrade-controller
proceeds to doworkload
upgrades.This results in the workload upgrades possibly failing for a number of reasons, for example:
(1) and (2) are just examples, there can be a number of different possible clashes. These failures can persist through our retry mechanism and could result in the controller marking the upgrade as failed, while it actually succeeded, but just took significantly more time to complete (because of the core application recreate).
Conclusion
We need to enhance the K8s upgrade validation to also validate the state of the K8s core components before marking the K8s upgrade as complete.
The text was updated successfully, but these errors were encountered: