[BUG] Operator forever/continuously reconciles IstioCNI helm release #430

twhite0 · 2024-10-16T17:46:26Z

Is this the right place to submit this?

This is not a question about how to use the sail-operator

Bug Description

Openshift sail-operator install with an existing Istio installed via helm. I uninstalled the Istio cni I had installed via helm and applied the IstioCNI CR. Operator detected it and installed a helm chart in our target namespace (kube-system). The operator then proceeded to go into what looked like a continuous loop of Installing and Reconciling with an ever changing reconciliationID. The installed helm chart revisions kept incrementing until I deleted the CR. The DaemonSet looked fine with Pods running.

Thanks in advance for ideas on what I might have done or could do to troubleshoot.

Operator Version

2.0-latest

Link to Gist with Logs

No response

Additional Information

2024-10-16T00:17:06Z	INFO	ctrlr.istiocni	Added finalizer	{"IstioCNI": "default", "reconcileID": "f7d27882-4faf-4b77-8bde-ecb42fd99bf7", "finalizer": "sailoperator.io/sail-operator"}
2024-10-16T00:17:06Z	INFO	ctrlr.istiocni	Installing Helm chart	{"IstioCNI": "default", "reconcileID": "f0dc0819-f00f-4be1-86ce-d2a7c2a766f0"}
2024-10-16T00:17:06Z	INFO	ctrlr.istiocni	Reconciliation done. Updating status.	{"IstioCNI": "default", "reconcileID": "f0dc0819-f00f-4be1-86ce-d2a7c2a766f0"}
2024-10-16T00:17:06Z	INFO	ctrlr.istiocni	Installing Helm chart	{"IstioCNI": "default", "reconcileID": "00968efc-03fb-4dd8-aa7e-3c6d354fddea"}
2024-10-16T00:17:06Z	INFO	ctrlr.istiocni	Reconciliation done. Updating status.	{"IstioCNI": "default", "reconcileID": "00968efc-03fb-4dd8-aa7e-3c6d354fddea"}
2024-10-16T00:17:06Z	INFO	ctrlr.istiocni	Installing Helm chart	{"IstioCNI": "default", "reconcileID": "39012a4f-15c1-490b-a197-413346236ac6"}
2024-10-16T00:17:07Z	INFO	ctrlr.istiocni	Reconciliation done. Updating status.	{"IstioCNI": "default", "reconcileID": "39012a4f-15c1-490b-a197-413346236ac6"}
2024-10-16T00:17:07Z	INFO	ctrlr.istiocni	Installing Helm chart	{"IstioCNI": "default", "reconcileID": "2f531033-a5ec-4ef7-9c61-17022148d8b9"}
2024-10-16T00:17:07Z	INFO	ctrlr.istiocni	Reconciliation done. Updating status.	{"IstioCNI": "default", "reconcileID": "2f531033-a5ec-4ef7-9c61-17022148d8b9"}
2024-10-16T00:17:07Z	INFO	ctrlr.istiocni	Installing Helm chart	{"IstioCNI": "default", "reconcileID": "d297f203-9daf-4154-880c-805544c057c7"}
2024-10-16T00:17:08Z	INFO	ctrlr.istiocni	Reconciliation done. Updating status.	{"IstioCNI": "default", "reconcileID": "d297f203-9daf-4154-880c-805544c057c7"}

The text was updated successfully, but these errors were encountered:

luksa · 2024-10-16T17:52:15Z

Hey @twhite0 thanks for reporting this. If you'd like to give it a try, I suggest you add enqueuelogger.WrapIfNecessary to the IstioCNI controller (just follow the example in the IstioRevision controller). Then enable this logger by running the operator with --log-enqueue-events and you should be able to see what is causing the reconciliation to be triggered.

luksa · 2024-10-16T17:54:52Z

BTW we do have a test that checks for continuous reconciliation, but apparently it doesn't match your case.

twhite0 · 2024-10-16T18:18:21Z

@luksa:

Thanks for the info. Looks pretty straightforward however building and publishing the controller may be a chore within my corporate environment; more to come on that.

IstioCNI status is stable:

status:
  conditions:
  - lastTransitionTime: "2024-10-16T18:01:36Z"
    status: "True"
    type: Reconciled
  - lastTransitionTime: "2024-10-16T18:01:46Z"
    status: "True"
    type: Ready
  observedGeneration: 1
  state: Healthy

Looking at the actual helm release(s) they also look stable.

$ helm  --kube-context=wlab2 --namespace=kube-system get all istio-cni > first.diff
$ helm  --kube-context=wlab2 --namespace=kube-system get all istio-cni > second.diff
$ diff first.diff second.diff
2c2
< LAST DEPLOYED: Wed Oct 16 18:12:25 2024
---
> LAST DEPLOYED: Wed Oct 16 18:12:32 2024
4,5c4,5
< STATUS: pending-upgrade
< REVISION: 1496
---
> STATUS: deployed
> REVISION: 1512

The cni daemonset also looks stable.

$ oc --context=wlab2 --namespace=kube-system get ds istio-cni-node -oyaml  > first.diff
$ oc --context=wlab2 --namespace=kube-system get ds istio-cni-node -oyaml  > second.diff
$ diff first.diff second.diff
$

luksa · 2024-10-17T07:36:08Z

@twhite0 I've extended the logging of enqueue events to the IstioCNI controller (see this). When this is merged and the operator is built, you should be able to figure out what's causing the continuous reconciliation.

twhite0 · 2024-10-17T11:12:49Z

Thanks @luksa. Should I anticipate it being in tomorrow's nightly?

luksa · 2024-10-17T11:19:04Z

Yup.

twhite0 · 2024-10-18T11:44:40Z

@luksa:

Tested again however i don't believe the changes have made their way to the registry. Based on #420, there's either something not working in the back-office sync process or there's a decent delay. Enclosed is a snippet from my deployment; (current nightly, added arg, still using /manager)

        - args:
          - --health-probe-bind-address=:8081
          - --metrics-bind-address=127.0.0.1:8080
          - --zap-log-level=info
          - --log-enqueue-events
          - --default-profile=openshift
          command:
          - /manager
          image: [internal proxy]/maistra-dev/sail-operator:0.2-nightly-2024-10-18

twhite0 · 2024-10-22T11:44:36Z

Incremental update.

I added the enqueue logging which outlines the ServiceAccount is the culprit?

2024-10-22T11:32:29Z	INFO	ctrlr.istiocni	Installing Helm chart	{"IstioCNI": "default", "reconcileID": "de75df19-e6c1-4d3e-b30e-26d299fa6444"}
2024-10-22T11:32:30Z	INFO	ctrlr.istiocni	Object queued for reconciliation due to event	{"object": {"kind":"IstioCNI","name":"default"}, "event": {"type":"Update","object":{"kind":"ServiceAccount","namespace":"kube-system","name":"istio-cni"}}}
2024-10-22T11:32:30Z	INFO	ctrlr.istiocni	Object queued for reconciliation due to event	{"object": {"kind":"IstioCNI","name":"default"}, "event": {"type":"Update","object":{"kind":"ServiceAccount","namespace":"kube-system","name":"istio-cni"}}}
2024-10-22T11:32:30Z	INFO	ctrlr.istiocni	Reconciliation done. Updating status.	{"IstioCNI": "default", "reconcileID": "de75df19-e6c1-4d3e-b30e-26d299fa6444"}
2024-10-22T11:32:30Z	INFO	ctrlr.istiocni	Installing Helm chart	{"IstioCNI": "default", "reconcileID": "37595911-faa1-4bdf-ab30-bdd438d88829"}
2024-10-22T11:32:30Z	INFO	ctrlr.istiocni	Object queued for reconciliation due to event	{"object": {"kind":"IstioCNI","name":"default"}, "event": {"type":"Update","object":{"kind":"ServiceAccount","namespace":"kube-system","name":"istio-cni"}}}
2024-10-22T11:32:30Z	INFO	ctrlr.istiocni	Object queued for reconciliation due to event	{"object": {"kind":"IstioCNI","name":"default"}, "event": {"type":"Update","object":{"kind":"ServiceAccount","namespace":"kube-system","name":"istio-cni"}}}
2024-10-22T11:32:30Z	INFO	ctrlr.istiocni	Reconciliation done. Updating status.	{"IstioCNI": "default", "reconcileID": "37595911-faa1-4bdf-ab30-bdd438d88829"}
2024-10-22T11:32:30Z	INFO	ctrlr.istiocni	Installing Helm chart	{"IstioCNI": "default", "reconcileID": "2045189f-4573-4fee-bbbb-d59ba16ca76f"}
2024-10-22T11:32:31Z	INFO	ctrlr.istiocni	Object queued for reconciliation due to event	{"object": {"kind":"IstioCNI","name":"default"}, "event": {"type":"Update","object":{"kind":"ServiceAccount","namespace":"kube-system","name":"istio-cni"}}}
2024-10-22T11:32:31Z	INFO	ctrlr.istiocni	Object queued for reconciliation due to event	{"object": {"kind":"IstioCNI","name":"default"}, "event": {"type":"Update","object":{"kind":"ServiceAccount","namespace":"kube-system","name":"istio-cni"}}}
2024-10-22T11:32:31Z	INFO	ctrlr.istiocni	Reconciliation done. Updating status.	{"IstioCNI": "default", "reconcileID": "2045189f-4573-4fee-bbbb-d59ba16ca76f"}
2024-10-22T11:32:31Z	INFO	ctrlr.istiocni	Installing Helm chart	{"IstioCNI": "default", "reconcileID": "aa1bcab0-f237-4147-9724-826b8040462b"}
2024-10-22T11:32:31Z	INFO	ctrlr.istiocni	Object queued for reconciliation due to event	{"object": {"kind":"IstioCNI","name":"default"}, "event": {"type":"Update","object":{"kind":"ServiceAccount","namespace":"kube-system","name":"istio-cni"}}}
2024-10-22T11:32:31Z	INFO	ctrlr.istiocni	Object queued for reconciliation due to event	{"object": {"kind":"IstioCNI","name":"default"}, "event": {"type":"Update","object":{"kind":"ServiceAccount","namespace":"kube-system","name":"istio-cni"}}}
2024-10-22T11:32:31Z	INFO	ctrlr.istiocni	Reconciliation done. Updating status.	{"IstioCNI": "default", "reconcileID": "aa1bcab0-f237-4147-9724-826b8040462b"}

sa resource version is changing.

$ oc --context=wlab2 --namespace=kube-system get sa istio-cni -oyaml >> first.yaml
$ oc --context=wlab2 --namespace=kube-system get sa istio-cni -oyaml >> second.yaml
$ 
$ diff first.yaml second.yaml 
32c32
<   resourceVersion: "284347678"
---
>   resourceVersion: "284347820"

Added a similar SA to kube-system to ensure there wasn't something that may be removing/changing it. Nothing unusual.

luksa · 2024-10-22T12:23:48Z

@twhite0 can you do oc --context=wlab2 --namespace=kube-system get sa istio-cni -oyaml -w and then inspect what the differences are? If you do it just by capturing two points in time, you're most likely missing the state in between (the change that actually caused the reconciliation to be triggered).

Basically, this is what's happening:

operator reconciles
something else modifies the SA
operator reconciles again (it reverts the changes made in 2.)

You're likely capturing just the state at 1 and 3. That's why there's no difference apart from the resourceVersion.

twhite0 · 2024-10-22T14:30:35Z

@luksa: excellent advice. Looks like an SA token/secret is being added as a pull secret. The secret value is consistent across updates.

$ yq -s '"file_" + $index' --no-doc watch.yaml 
$ diff file_0.yml file_1.yml
31c31
<   resourceVersion: "284514362"
---
>   resourceVersion: "284514373"
$ diff file_1.yml file_2.yml
3a4
>   - name: istio-cni-dockercfg-ktvgs
31c32
<   resourceVersion: "284514373"
---
>   resourceVersion: "284514374"
$ diff file_2.yml file_3.yml
4d3
<   - name: istio-cni-dockercfg-ktvgs
32c31
<   resourceVersion: "284514374"
---
>   resourceVersion: "284514387"
$ diff file_3.yml file_4.yml
3a4
>   - name: istio-cni-dockercfg-ktvgs
31c32
<   resourceVersion: "284514387"
---
>   resourceVersion: "284514388"
$ diff file_4.yml file_5.yml
4d3
<   - name: istio-cni-dockercfg-ktvgs
32c31
<   resourceVersion: "284514388"
---
>   resourceVersion: "284514395"

file_1.yml

apiVersion: v1
imagePullSecrets:
  - name: my-secret-twhite0
kind: ServiceAccount
metadata:

file_2.yml

apiVersion: v1
imagePullSecrets:
  - name: my-secret-twhite0
  - name: istio-cni-dockercfg-ktvgs
kind: ServiceAccount
metadata:

istio-cni.yaml

apiVersion: sailoperator.io/v1alpha1
kind: IstioCNI
metadata:
  name: default
spec:
  version: latest
  namespace: kube-system
  values:
    cni:
      # custom stuff
    global:
      imagePullSecrets: 
      - my-secret-twhite0
      # other custom stuff

luksa · 2024-10-22T14:50:49Z

Yeah, we definitely need to change the operator's behavior here. I'll try to come up with a solution.

Here's a longer explanation of why this is happening:

When an external actor modifies the resources deployed by the operator, the operator always re-renders the resource. If the modification was made to a field that isn't specified in the Helm chart that the operator deploys, then the operator's update operation will be a no-op and will not trigger another reconcile. However, as we see here, the Helm chart does define imagePullSecrets for the istio-cni ServiceAccount. This causes the operator to update the field every time an additional pull secret is added. This update triggers the other controller to add the secret again, which in turn triggers the operator. This is why we get the continuous reconcile loop.

luksa · 2024-10-22T14:53:49Z

And here's why the imagePullSecrets field gets overridden instead of merged:

// +optional
// +listType=atomic    <------------ THIS
ImagePullSecrets []LocalObjectReference `json:"imagePullSecrets,omitempty" protobuf:"bytes,3,rep,name=imagePullSecrets"`

(from https://github.com/kubernetes/api/blob/master/core/v1/types.go#L5710)

luksa · 2024-10-22T14:56:30Z

Interestingly, the imagePullSecrets field in Pod.spec uses a different merge strategy:

// +patchMergeKey=name
// +patchStrategy=merge
// +listType=map
// +listMapKey=name
ImagePullSecrets []LocalObjectReference `json:"imagePullSecrets,omitempty" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,15,rep,name=imagePullSecrets"`

(from https://github.com/kubernetes/api/blob/master/core/v1/types.go#L3855-L3858)

luksa · 2024-10-22T14:59:06Z

Apparently this is a known Kubernetes issue that can only be fixed by introducing a new v2 API version of ServiceAccount, which isn't going to happen any time soon. See kubernetes/kubernetes#72475

(this means we have to fix this in the operator)

twhite0 · 2024-10-22T14:59:28Z

@luksa: I'm half following 😊. Do you think the episode is caused by adding in a pull secret to the IstioCNI CR or because of the addition of the secret value to the pull secrets? More to the point, is there something additional i should try?

luksa · 2024-10-22T15:03:55Z

@twhite0 unfortunately, there's nothing else you can do.

This is caused by another controller (not one that's in the Sail operator) adding an additional secret to (every?) ServiceAccount. You can see the controller name by running oc get sa -w -oyaml --show-managed-fields and then looking up who owns the imagePullSecrets field in .metadata.managedFields (you have to look at the version of the SA just after the secret is added).

dgn · 2024-10-28T15:00:09Z

Hmm. did we establish what adds the imagePullSecret to the SA? If the name of the added secret is stable, maybe simply adding it to the IstioCNI resource is a workaround?

luksa · 2024-10-28T15:02:22Z

External system (or OpenShfit itself). I guess this is doing a similar thing as https://github.com/knrc/registry-puller.

It's definitely something we need to handle. We can't work around it.

twhite0 · 2024-10-29T15:38:40Z

Hmm. did we establish what adds the imagePullSecret to the SA?

OpenShift ServiceAccounts

Each service account automatically contains two secrets:

An API token

Credentials for the OpenShift Container Registry

twhite0 · 2024-11-05T21:57:20Z

@luksa: Could use some advice/guidance on my path forward. I can see how we could leverage the Istio CNI helm charts to deploy but am missing thoughts on the Sail operators future with this (await Istio upgrade, workaround in the operator).

…external controllers Addresses istio-ecosystem#430 Signed-off-by: Marko Lukša <[email protected]>

luksa · 2024-11-06T11:50:54Z

@twhite0 I've implemented a quick hack to prevent the continuous reconciliation. I'll see if I can implement the proper fix next week.

twhite0 · 2024-11-06T13:43:56Z

Thanks @luksa.

…external controllers (#469) Addresses #430 Signed-off-by: Marko Lukša <[email protected]>

twhite0 added the bug Something isn't working label Oct 16, 2024

luksa added a commit to luksa/sail-operator that referenced this issue Nov 6, 2024

Implement temporary hack to prevent removal of pull secrets added by …

9708bc8

…external controllers Addresses istio-ecosystem#430 Signed-off-by: Marko Lukša <[email protected]>

luksa mentioned this issue Nov 6, 2024

Implement temporary hack to prevent removal of pull secrets added by external controllers #469

Merged

istio-testing pushed a commit that referenced this issue Nov 7, 2024

Implement temporary hack to prevent removal of pull secrets added by …

40bb9d7

…external controllers (#469) Addresses #430 Signed-off-by: Marko Lukša <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Operator forever/continuously reconciles IstioCNI helm release #430

[BUG] Operator forever/continuously reconciles IstioCNI helm release #430

twhite0 commented Oct 16, 2024

luksa commented Oct 16, 2024

luksa commented Oct 16, 2024

twhite0 commented Oct 16, 2024 •

edited

Loading

luksa commented Oct 17, 2024

twhite0 commented Oct 17, 2024

luksa commented Oct 17, 2024

twhite0 commented Oct 18, 2024

twhite0 commented Oct 22, 2024 •

edited

Loading

luksa commented Oct 22, 2024 •

edited

Loading

twhite0 commented Oct 22, 2024

luksa commented Oct 22, 2024

luksa commented Oct 22, 2024 •

edited

Loading

luksa commented Oct 22, 2024

luksa commented Oct 22, 2024 •

edited

Loading

twhite0 commented Oct 22, 2024

luksa commented Oct 22, 2024

dgn commented Oct 28, 2024

luksa commented Oct 28, 2024

twhite0 commented Oct 29, 2024

twhite0 commented Nov 5, 2024

luksa commented Nov 6, 2024

twhite0 commented Nov 6, 2024

[BUG] Operator forever/continuously reconciles IstioCNI helm release #430

[BUG] Operator forever/continuously reconciles IstioCNI helm release #430

Comments

twhite0 commented Oct 16, 2024

Is this the right place to submit this?

Bug Description

Operator Version

Link to Gist with Logs

Additional Information

luksa commented Oct 16, 2024

luksa commented Oct 16, 2024

twhite0 commented Oct 16, 2024 • edited Loading

luksa commented Oct 17, 2024

twhite0 commented Oct 17, 2024

luksa commented Oct 17, 2024

twhite0 commented Oct 18, 2024

twhite0 commented Oct 22, 2024 • edited Loading

luksa commented Oct 22, 2024 • edited Loading

twhite0 commented Oct 22, 2024

file_1.yml

file_2.yml

istio-cni.yaml

luksa commented Oct 22, 2024

luksa commented Oct 22, 2024 • edited Loading

luksa commented Oct 22, 2024

luksa commented Oct 22, 2024 • edited Loading

twhite0 commented Oct 22, 2024

luksa commented Oct 22, 2024

dgn commented Oct 28, 2024

luksa commented Oct 28, 2024

twhite0 commented Oct 29, 2024

twhite0 commented Nov 5, 2024

luksa commented Nov 6, 2024

twhite0 commented Nov 6, 2024

twhite0 commented Oct 16, 2024 •

edited

Loading

twhite0 commented Oct 22, 2024 •

edited

Loading

luksa commented Oct 22, 2024 •

edited

Loading

luksa commented Oct 22, 2024 •

edited

Loading

luksa commented Oct 22, 2024 •

edited

Loading