Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance K8s upgrade validation #109

Closed
ipetrov117 opened this issue Nov 22, 2024 · 1 comment · Fixed by #116
Closed

Enhance K8s upgrade validation #109

ipetrov117 opened this issue Nov 22, 2024 · 1 comment · Fixed by #116
Assignees

Comments

@ipetrov117
Copy link
Contributor

ipetrov117 commented Nov 22, 2024

Current status

We are validating if the K8s upgrade has succeeded by looking at whether the node is shedulable, in a Ready state and with the desired K8s version. If the validations passes we proceed to the workload upgrades.

Problem

While the node is marked as Ready, Schedulable and has the correct version, its underlying core applications are still being recreated. At the same time the upgrade-controller proceeds to do workload upgrades.

This results in the workload upgrades possibly failing for a number of reasons, for example:

  1. A workload upgrade needs network access , which cannot currently be given, because the core application for networking is being recreated.
  2. A workload upgrade needs ingress access, which cannot currently be given, because the core ingress controller application is being recreated.

(1) and (2) are just examples, there can be a number of different possible clashes. These failures can persist through our retry mechanism and could result in the controller marking the upgrade as failed, while it actually succeeded, but just took significantly more time to complete (because of the core application recreate).

Conclusion

We need to enhance the K8s upgrade validation to also validate the state of the K8s core components before marking the K8s upgrade as complete.

@ipetrov117
Copy link
Contributor Author

ipetrov117 commented Nov 26, 2024

The most achievable approach here would be to wait for all helm-controller managed HelmCharts to have completed upgrade Jobs before proceeding to the upgrade-controller initiated chart upgrade. This would ensure that no clashes between upgrades happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant