Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

irsa-operator fails to remove finalizer #3618

Closed
nprokopic opened this issue Jul 30, 2024 · 2 comments
Closed

irsa-operator fails to remove finalizer #3618

nprokopic opened this issue Jul 30, 2024 · 2 comments
Assignees
Labels

Comments

@nprokopic
Copy link

In my e2e test run, cluster deletion is failing in upgrade tests (seems inconsistent), and I tracked it down to irsa-operator failing to remove its finalizer from AWSCluster CR.

These is the error from irsa-operator:

2024-07-30T12:29:37Z	ERROR	Reconciler error	{"controller": "awscluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "AWSCluster", "AWSCluster": {"name":"t-tjupwewq9lhluvusov","namespace":"org-t-rgd8t8fdvqpzy4nfad"}, "namespace": "org-t-rgd8t8fdvqpzy4nfad", "name": "t-tjupwewq9lhluvusov", "reconcileID": "4d76889d-140f-4cda-8bc9-1f519610d060", "error": "ConfigMap \"t-tjupwewq9lhluvusov-cluster-values\" not found"}

Looks like that the deletion reconciliation fails as irsa-operator fails to find (I assume already deleted) cluster values ConfigMap.

As a consequence, leftover CRs are piling up in grizzly 🙈

Does irsa-operator need cluster values ConfigMap here? Can it just ignore if it is not there?

@iuriaranda
Copy link

iuriaranda commented Aug 6, 2024

I think I found the problem. On cluster deletion, the IRSA operator first removes the finalizers from the cluster values ConfigMap, and then from the AWSCluster CR (here). When removing the finalizers from AWSCluster, a race condition can occur, where another operator is also removing their finalizers at the same time, in such case the patch operation will fail. There's already a retry mechanism in place for such situations (here).

The problem is (I think) that, during the patching, the k8s client library panics instead of returning an error, so the retry mechanism never kicks in, so the finalizer never gets removed from AWSCluster on the first deletion reconciliation loop. Then subsequent reconciliations fail because the cluster values ConfigMap is already gone.

Actually, on a second review, it's the logging library that panics, not the k8s client... The results are the same though.

Image

@iuriaranda
Copy link

This should be fixed now. Let's see how it goes when the current test clusters get deleted 🤞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

No branches or pull requests

3 participants