-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Operator forever/continuously reconciles IstioCNI helm release #430
Comments
Hey @twhite0 thanks for reporting this. If you'd like to give it a try, I suggest you add |
BTW we do have a test that checks for continuous reconciliation, but apparently it doesn't match your case. |
Thanks for the info. Looks pretty straightforward however building and publishing the controller may be a chore within my corporate environment; more to come on that. IstioCNI status is stable: status:
conditions:
- lastTransitionTime: "2024-10-16T18:01:36Z"
status: "True"
type: Reconciled
- lastTransitionTime: "2024-10-16T18:01:46Z"
status: "True"
type: Ready
observedGeneration: 1
state: Healthy Looking at the actual helm release(s) they also look stable. $ helm --kube-context=wlab2 --namespace=kube-system get all istio-cni > first.diff
$ helm --kube-context=wlab2 --namespace=kube-system get all istio-cni > second.diff
$ diff first.diff second.diff
2c2
< LAST DEPLOYED: Wed Oct 16 18:12:25 2024
---
> LAST DEPLOYED: Wed Oct 16 18:12:32 2024
4,5c4,5
< STATUS: pending-upgrade
< REVISION: 1496
---
> STATUS: deployed
> REVISION: 1512
The cni daemonset also looks stable. $ oc --context=wlab2 --namespace=kube-system get ds istio-cni-node -oyaml > first.diff
$ oc --context=wlab2 --namespace=kube-system get ds istio-cni-node -oyaml > second.diff
$ diff first.diff second.diff
$ |
Thanks @luksa. Should I anticipate it being in tomorrow's nightly? |
Yup. |
Tested again however i don't believe the changes have made their way to the registry. Based on #420, there's either something not working in the back-office sync process or there's a decent delay. Enclosed is a snippet from my deployment; (current nightly, added arg, still using /manager) - args:
- --health-probe-bind-address=:8081
- --metrics-bind-address=127.0.0.1:8080
- --zap-log-level=info
- --log-enqueue-events
- --default-profile=openshift
command:
- /manager
image: [internal proxy]/maistra-dev/sail-operator:0.2-nightly-2024-10-18 |
Incremental update. I added the enqueue logging which outlines the ServiceAccount is the culprit? 2024-10-22T11:32:29Z INFO ctrlr.istiocni Installing Helm chart {"IstioCNI": "default", "reconcileID": "de75df19-e6c1-4d3e-b30e-26d299fa6444"}
2024-10-22T11:32:30Z INFO ctrlr.istiocni Object queued for reconciliation due to event {"object": {"kind":"IstioCNI","name":"default"}, "event": {"type":"Update","object":{"kind":"ServiceAccount","namespace":"kube-system","name":"istio-cni"}}}
2024-10-22T11:32:30Z INFO ctrlr.istiocni Object queued for reconciliation due to event {"object": {"kind":"IstioCNI","name":"default"}, "event": {"type":"Update","object":{"kind":"ServiceAccount","namespace":"kube-system","name":"istio-cni"}}}
2024-10-22T11:32:30Z INFO ctrlr.istiocni Reconciliation done. Updating status. {"IstioCNI": "default", "reconcileID": "de75df19-e6c1-4d3e-b30e-26d299fa6444"}
2024-10-22T11:32:30Z INFO ctrlr.istiocni Installing Helm chart {"IstioCNI": "default", "reconcileID": "37595911-faa1-4bdf-ab30-bdd438d88829"}
2024-10-22T11:32:30Z INFO ctrlr.istiocni Object queued for reconciliation due to event {"object": {"kind":"IstioCNI","name":"default"}, "event": {"type":"Update","object":{"kind":"ServiceAccount","namespace":"kube-system","name":"istio-cni"}}}
2024-10-22T11:32:30Z INFO ctrlr.istiocni Object queued for reconciliation due to event {"object": {"kind":"IstioCNI","name":"default"}, "event": {"type":"Update","object":{"kind":"ServiceAccount","namespace":"kube-system","name":"istio-cni"}}}
2024-10-22T11:32:30Z INFO ctrlr.istiocni Reconciliation done. Updating status. {"IstioCNI": "default", "reconcileID": "37595911-faa1-4bdf-ab30-bdd438d88829"}
2024-10-22T11:32:30Z INFO ctrlr.istiocni Installing Helm chart {"IstioCNI": "default", "reconcileID": "2045189f-4573-4fee-bbbb-d59ba16ca76f"}
2024-10-22T11:32:31Z INFO ctrlr.istiocni Object queued for reconciliation due to event {"object": {"kind":"IstioCNI","name":"default"}, "event": {"type":"Update","object":{"kind":"ServiceAccount","namespace":"kube-system","name":"istio-cni"}}}
2024-10-22T11:32:31Z INFO ctrlr.istiocni Object queued for reconciliation due to event {"object": {"kind":"IstioCNI","name":"default"}, "event": {"type":"Update","object":{"kind":"ServiceAccount","namespace":"kube-system","name":"istio-cni"}}}
2024-10-22T11:32:31Z INFO ctrlr.istiocni Reconciliation done. Updating status. {"IstioCNI": "default", "reconcileID": "2045189f-4573-4fee-bbbb-d59ba16ca76f"}
2024-10-22T11:32:31Z INFO ctrlr.istiocni Installing Helm chart {"IstioCNI": "default", "reconcileID": "aa1bcab0-f237-4147-9724-826b8040462b"}
2024-10-22T11:32:31Z INFO ctrlr.istiocni Object queued for reconciliation due to event {"object": {"kind":"IstioCNI","name":"default"}, "event": {"type":"Update","object":{"kind":"ServiceAccount","namespace":"kube-system","name":"istio-cni"}}}
2024-10-22T11:32:31Z INFO ctrlr.istiocni Object queued for reconciliation due to event {"object": {"kind":"IstioCNI","name":"default"}, "event": {"type":"Update","object":{"kind":"ServiceAccount","namespace":"kube-system","name":"istio-cni"}}}
2024-10-22T11:32:31Z INFO ctrlr.istiocni Reconciliation done. Updating status. {"IstioCNI": "default", "reconcileID": "aa1bcab0-f237-4147-9724-826b8040462b"} sa resource version is changing. $ oc --context=wlab2 --namespace=kube-system get sa istio-cni -oyaml >> first.yaml
$ oc --context=wlab2 --namespace=kube-system get sa istio-cni -oyaml >> second.yaml
$
$ diff first.yaml second.yaml
32c32
< resourceVersion: "284347678"
---
> resourceVersion: "284347820" Added a similar SA to kube-system to ensure there wasn't something that may be removing/changing it. Nothing unusual. |
@twhite0 can you do Basically, this is what's happening:
You're likely capturing just the state at 1 and 3. That's why there's no difference apart from the |
@luksa: excellent advice. Looks like an SA token/secret is being added as a pull secret. The secret value is consistent across updates. $ yq -s '"file_" + $index' --no-doc watch.yaml
$ diff file_0.yml file_1.yml
31c31
< resourceVersion: "284514362"
---
> resourceVersion: "284514373"
$ diff file_1.yml file_2.yml
3a4
> - name: istio-cni-dockercfg-ktvgs
31c32
< resourceVersion: "284514373"
---
> resourceVersion: "284514374"
$ diff file_2.yml file_3.yml
4d3
< - name: istio-cni-dockercfg-ktvgs
32c31
< resourceVersion: "284514374"
---
> resourceVersion: "284514387"
$ diff file_3.yml file_4.yml
3a4
> - name: istio-cni-dockercfg-ktvgs
31c32
< resourceVersion: "284514387"
---
> resourceVersion: "284514388"
$ diff file_4.yml file_5.yml
4d3
< - name: istio-cni-dockercfg-ktvgs
32c31
< resourceVersion: "284514388"
---
> resourceVersion: "284514395" file_1.ymlapiVersion: v1
imagePullSecrets:
- name: my-secret-twhite0
kind: ServiceAccount
metadata: file_2.ymlapiVersion: v1
imagePullSecrets:
- name: my-secret-twhite0
- name: istio-cni-dockercfg-ktvgs
kind: ServiceAccount
metadata: istio-cni.yamlapiVersion: sailoperator.io/v1alpha1
kind: IstioCNI
metadata:
name: default
spec:
version: latest
namespace: kube-system
values:
cni:
# custom stuff
global:
imagePullSecrets:
- my-secret-twhite0
# other custom stuff |
Yeah, we definitely need to change the operator's behavior here. I'll try to come up with a solution. Here's a longer explanation of why this is happening: When an external actor modifies the resources deployed by the operator, the operator always re-renders the resource. If the modification was made to a field that isn't specified in the Helm chart that the operator deploys, then the operator's update operation will be a no-op and will not trigger another reconcile. However, as we see here, the Helm chart does define |
And here's why the
(from https://github.com/kubernetes/api/blob/master/core/v1/types.go#L5710) |
Interestingly, the
(from https://github.com/kubernetes/api/blob/master/core/v1/types.go#L3855-L3858) |
Apparently this is a known Kubernetes issue that can only be fixed by introducing a new (this means we have to fix this in the operator) |
@luksa: I'm half following 😊. Do you think the episode is caused by adding in a pull secret to the IstioCNI CR or because of the addition of the secret value to the pull secrets? More to the point, is there something additional i should try? |
@twhite0 unfortunately, there's nothing else you can do. This is caused by another controller (not one that's in the Sail operator) adding an additional secret to (every?) ServiceAccount. You can see the controller name by running |
Hmm. did we establish what adds the |
External system (or OpenShfit itself). I guess this is doing a similar thing as https://github.com/knrc/registry-puller. It's definitely something we need to handle. We can't work around it. |
|
@luksa: Could use some advice/guidance on my path forward. I can see how we could leverage the Istio CNI helm charts to deploy but am missing thoughts on the Sail operators future with this (await Istio upgrade, workaround in the operator). |
…external controllers Addresses istio-ecosystem#430 Signed-off-by: Marko Lukša <[email protected]>
@twhite0 I've implemented a quick hack to prevent the continuous reconciliation. I'll see if I can implement the proper fix next week. |
Thanks @luksa. |
…external controllers (#469) Addresses #430 Signed-off-by: Marko Lukša <[email protected]>
Is this the right place to submit this?
Bug Description
Openshift sail-operator install with an existing Istio installed via helm. I uninstalled the Istio cni I had installed via helm and applied the IstioCNI CR. Operator detected it and installed a helm chart in our target namespace (kube-system). The operator then proceeded to go into what looked like a continuous loop of Installing and Reconciling with an ever changing reconciliationID. The installed helm chart revisions kept incrementing until I deleted the CR. The DaemonSet looked fine with Pods running.
Thanks in advance for ideas on what I might have done or could do to troubleshoot.
Operator Version
2.0-latest
Link to Gist with Logs
No response
Additional Information
The text was updated successfully, but these errors were encountered: