Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Operator forever/continuously reconciles IstioCNI helm release #430

Open
1 task done
twhite0 opened this issue Oct 16, 2024 · 22 comments
Open
1 task done

[BUG] Operator forever/continuously reconciles IstioCNI helm release #430

twhite0 opened this issue Oct 16, 2024 · 22 comments
Labels
bug Something isn't working

Comments

@twhite0
Copy link
Contributor

twhite0 commented Oct 16, 2024

Is this the right place to submit this?

  • This is not a question about how to use the sail-operator

Bug Description

Openshift sail-operator install with an existing Istio installed via helm. I uninstalled the Istio cni I had installed via helm and applied the IstioCNI CR. Operator detected it and installed a helm chart in our target namespace (kube-system). The operator then proceeded to go into what looked like a continuous loop of Installing and Reconciling with an ever changing reconciliationID. The installed helm chart revisions kept incrementing until I deleted the CR. The DaemonSet looked fine with Pods running.

Thanks in advance for ideas on what I might have done or could do to troubleshoot.

Operator Version

2.0-latest

Link to Gist with Logs

No response

Additional Information

2024-10-16T00:17:06Z	INFO	ctrlr.istiocni	Added finalizer	{"IstioCNI": "default", "reconcileID": "f7d27882-4faf-4b77-8bde-ecb42fd99bf7", "finalizer": "sailoperator.io/sail-operator"}
2024-10-16T00:17:06Z	INFO	ctrlr.istiocni	Installing Helm chart	{"IstioCNI": "default", "reconcileID": "f0dc0819-f00f-4be1-86ce-d2a7c2a766f0"}
2024-10-16T00:17:06Z	INFO	ctrlr.istiocni	Reconciliation done. Updating status.	{"IstioCNI": "default", "reconcileID": "f0dc0819-f00f-4be1-86ce-d2a7c2a766f0"}
2024-10-16T00:17:06Z	INFO	ctrlr.istiocni	Installing Helm chart	{"IstioCNI": "default", "reconcileID": "00968efc-03fb-4dd8-aa7e-3c6d354fddea"}
2024-10-16T00:17:06Z	INFO	ctrlr.istiocni	Reconciliation done. Updating status.	{"IstioCNI": "default", "reconcileID": "00968efc-03fb-4dd8-aa7e-3c6d354fddea"}
2024-10-16T00:17:06Z	INFO	ctrlr.istiocni	Installing Helm chart	{"IstioCNI": "default", "reconcileID": "39012a4f-15c1-490b-a197-413346236ac6"}
2024-10-16T00:17:07Z	INFO	ctrlr.istiocni	Reconciliation done. Updating status.	{"IstioCNI": "default", "reconcileID": "39012a4f-15c1-490b-a197-413346236ac6"}
2024-10-16T00:17:07Z	INFO	ctrlr.istiocni	Installing Helm chart	{"IstioCNI": "default", "reconcileID": "2f531033-a5ec-4ef7-9c61-17022148d8b9"}
2024-10-16T00:17:07Z	INFO	ctrlr.istiocni	Reconciliation done. Updating status.	{"IstioCNI": "default", "reconcileID": "2f531033-a5ec-4ef7-9c61-17022148d8b9"}
2024-10-16T00:17:07Z	INFO	ctrlr.istiocni	Installing Helm chart	{"IstioCNI": "default", "reconcileID": "d297f203-9daf-4154-880c-805544c057c7"}
2024-10-16T00:17:08Z	INFO	ctrlr.istiocni	Reconciliation done. Updating status.	{"IstioCNI": "default", "reconcileID": "d297f203-9daf-4154-880c-805544c057c7"}
@twhite0 twhite0 added the bug Something isn't working label Oct 16, 2024
@luksa
Copy link
Contributor

luksa commented Oct 16, 2024

Hey @twhite0 thanks for reporting this. If you'd like to give it a try, I suggest you add enqueuelogger.WrapIfNecessary to the IstioCNI controller (just follow the example in the IstioRevision controller). Then enable this logger by running the operator with --log-enqueue-events and you should be able to see what is causing the reconciliation to be triggered.

@luksa
Copy link
Contributor

luksa commented Oct 16, 2024

BTW we do have a test that checks for continuous reconciliation, but apparently it doesn't match your case.

@twhite0
Copy link
Contributor Author

twhite0 commented Oct 16, 2024

@luksa:

Thanks for the info. Looks pretty straightforward however building and publishing the controller may be a chore within my corporate environment; more to come on that.

IstioCNI status is stable:

status:
  conditions:
  - lastTransitionTime: "2024-10-16T18:01:36Z"
    status: "True"
    type: Reconciled
  - lastTransitionTime: "2024-10-16T18:01:46Z"
    status: "True"
    type: Ready
  observedGeneration: 1
  state: Healthy

Looking at the actual helm release(s) they also look stable.

$ helm  --kube-context=wlab2 --namespace=kube-system get all istio-cni > first.diff
$ helm  --kube-context=wlab2 --namespace=kube-system get all istio-cni > second.diff
$ diff first.diff second.diff
2c2
< LAST DEPLOYED: Wed Oct 16 18:12:25 2024
---
> LAST DEPLOYED: Wed Oct 16 18:12:32 2024
4,5c4,5
< STATUS: pending-upgrade
< REVISION: 1496
---
> STATUS: deployed
> REVISION: 1512

The cni daemonset also looks stable.

$ oc --context=wlab2 --namespace=kube-system get ds istio-cni-node -oyaml  > first.diff
$ oc --context=wlab2 --namespace=kube-system get ds istio-cni-node -oyaml  > second.diff
$ diff first.diff second.diff
$

@luksa
Copy link
Contributor

luksa commented Oct 17, 2024

@twhite0 I've extended the logging of enqueue events to the IstioCNI controller (see this). When this is merged and the operator is built, you should be able to figure out what's causing the continuous reconciliation.

@twhite0
Copy link
Contributor Author

twhite0 commented Oct 17, 2024

Thanks @luksa. Should I anticipate it being in tomorrow's nightly?

@luksa
Copy link
Contributor

luksa commented Oct 17, 2024

Yup.

@twhite0
Copy link
Contributor Author

twhite0 commented Oct 18, 2024

@luksa:

Tested again however i don't believe the changes have made their way to the registry. Based on #420, there's either something not working in the back-office sync process or there's a decent delay. Enclosed is a snippet from my deployment; (current nightly, added arg, still using /manager)

        - args:
          - --health-probe-bind-address=:8081
          - --metrics-bind-address=127.0.0.1:8080
          - --zap-log-level=info
          - --log-enqueue-events
          - --default-profile=openshift
          command:
          - /manager
          image: [internal proxy]/maistra-dev/sail-operator:0.2-nightly-2024-10-18

@twhite0
Copy link
Contributor Author

twhite0 commented Oct 22, 2024

Incremental update.

I added the enqueue logging which outlines the ServiceAccount is the culprit?

2024-10-22T11:32:29Z	INFO	ctrlr.istiocni	Installing Helm chart	{"IstioCNI": "default", "reconcileID": "de75df19-e6c1-4d3e-b30e-26d299fa6444"}
2024-10-22T11:32:30Z	INFO	ctrlr.istiocni	Object queued for reconciliation due to event	{"object": {"kind":"IstioCNI","name":"default"}, "event": {"type":"Update","object":{"kind":"ServiceAccount","namespace":"kube-system","name":"istio-cni"}}}
2024-10-22T11:32:30Z	INFO	ctrlr.istiocni	Object queued for reconciliation due to event	{"object": {"kind":"IstioCNI","name":"default"}, "event": {"type":"Update","object":{"kind":"ServiceAccount","namespace":"kube-system","name":"istio-cni"}}}
2024-10-22T11:32:30Z	INFO	ctrlr.istiocni	Reconciliation done. Updating status.	{"IstioCNI": "default", "reconcileID": "de75df19-e6c1-4d3e-b30e-26d299fa6444"}
2024-10-22T11:32:30Z	INFO	ctrlr.istiocni	Installing Helm chart	{"IstioCNI": "default", "reconcileID": "37595911-faa1-4bdf-ab30-bdd438d88829"}
2024-10-22T11:32:30Z	INFO	ctrlr.istiocni	Object queued for reconciliation due to event	{"object": {"kind":"IstioCNI","name":"default"}, "event": {"type":"Update","object":{"kind":"ServiceAccount","namespace":"kube-system","name":"istio-cni"}}}
2024-10-22T11:32:30Z	INFO	ctrlr.istiocni	Object queued for reconciliation due to event	{"object": {"kind":"IstioCNI","name":"default"}, "event": {"type":"Update","object":{"kind":"ServiceAccount","namespace":"kube-system","name":"istio-cni"}}}
2024-10-22T11:32:30Z	INFO	ctrlr.istiocni	Reconciliation done. Updating status.	{"IstioCNI": "default", "reconcileID": "37595911-faa1-4bdf-ab30-bdd438d88829"}
2024-10-22T11:32:30Z	INFO	ctrlr.istiocni	Installing Helm chart	{"IstioCNI": "default", "reconcileID": "2045189f-4573-4fee-bbbb-d59ba16ca76f"}
2024-10-22T11:32:31Z	INFO	ctrlr.istiocni	Object queued for reconciliation due to event	{"object": {"kind":"IstioCNI","name":"default"}, "event": {"type":"Update","object":{"kind":"ServiceAccount","namespace":"kube-system","name":"istio-cni"}}}
2024-10-22T11:32:31Z	INFO	ctrlr.istiocni	Object queued for reconciliation due to event	{"object": {"kind":"IstioCNI","name":"default"}, "event": {"type":"Update","object":{"kind":"ServiceAccount","namespace":"kube-system","name":"istio-cni"}}}
2024-10-22T11:32:31Z	INFO	ctrlr.istiocni	Reconciliation done. Updating status.	{"IstioCNI": "default", "reconcileID": "2045189f-4573-4fee-bbbb-d59ba16ca76f"}
2024-10-22T11:32:31Z	INFO	ctrlr.istiocni	Installing Helm chart	{"IstioCNI": "default", "reconcileID": "aa1bcab0-f237-4147-9724-826b8040462b"}
2024-10-22T11:32:31Z	INFO	ctrlr.istiocni	Object queued for reconciliation due to event	{"object": {"kind":"IstioCNI","name":"default"}, "event": {"type":"Update","object":{"kind":"ServiceAccount","namespace":"kube-system","name":"istio-cni"}}}
2024-10-22T11:32:31Z	INFO	ctrlr.istiocni	Object queued for reconciliation due to event	{"object": {"kind":"IstioCNI","name":"default"}, "event": {"type":"Update","object":{"kind":"ServiceAccount","namespace":"kube-system","name":"istio-cni"}}}
2024-10-22T11:32:31Z	INFO	ctrlr.istiocni	Reconciliation done. Updating status.	{"IstioCNI": "default", "reconcileID": "aa1bcab0-f237-4147-9724-826b8040462b"}

sa resource version is changing.

$ oc --context=wlab2 --namespace=kube-system get sa istio-cni -oyaml >> first.yaml
$ oc --context=wlab2 --namespace=kube-system get sa istio-cni -oyaml >> second.yaml
$ 
$ diff first.yaml second.yaml 
32c32
<   resourceVersion: "284347678"
---
>   resourceVersion: "284347820"

Added a similar SA to kube-system to ensure there wasn't something that may be removing/changing it. Nothing unusual.

@luksa
Copy link
Contributor

luksa commented Oct 22, 2024

@twhite0 can you do oc --context=wlab2 --namespace=kube-system get sa istio-cni -oyaml -w and then inspect what the differences are? If you do it just by capturing two points in time, you're most likely missing the state in between (the change that actually caused the reconciliation to be triggered).

Basically, this is what's happening:

  1. operator reconciles
  2. something else modifies the SA
  3. operator reconciles again (it reverts the changes made in 2.)

You're likely capturing just the state at 1 and 3. That's why there's no difference apart from the resourceVersion.

@twhite0
Copy link
Contributor Author

twhite0 commented Oct 22, 2024

@luksa: excellent advice. Looks like an SA token/secret is being added as a pull secret. The secret value is consistent across updates.

$ yq -s '"file_" + $index' --no-doc watch.yaml 
$ diff file_0.yml file_1.yml
31c31
<   resourceVersion: "284514362"
---
>   resourceVersion: "284514373"
$ diff file_1.yml file_2.yml
3a4
>   - name: istio-cni-dockercfg-ktvgs
31c32
<   resourceVersion: "284514373"
---
>   resourceVersion: "284514374"
$ diff file_2.yml file_3.yml
4d3
<   - name: istio-cni-dockercfg-ktvgs
32c31
<   resourceVersion: "284514374"
---
>   resourceVersion: "284514387"
$ diff file_3.yml file_4.yml
3a4
>   - name: istio-cni-dockercfg-ktvgs
31c32
<   resourceVersion: "284514387"
---
>   resourceVersion: "284514388"
$ diff file_4.yml file_5.yml
4d3
<   - name: istio-cni-dockercfg-ktvgs
32c31
<   resourceVersion: "284514388"
---
>   resourceVersion: "284514395"

file_1.yml

apiVersion: v1
imagePullSecrets:
  - name: my-secret-twhite0
kind: ServiceAccount
metadata:

file_2.yml

apiVersion: v1
imagePullSecrets:
  - name: my-secret-twhite0
  - name: istio-cni-dockercfg-ktvgs
kind: ServiceAccount
metadata:

istio-cni.yaml

apiVersion: sailoperator.io/v1alpha1
kind: IstioCNI
metadata:
  name: default
spec:
  version: latest
  namespace: kube-system
  values:
    cni:
      # custom stuff
    global:
      imagePullSecrets: 
      - my-secret-twhite0
      # other custom stuff

@luksa
Copy link
Contributor

luksa commented Oct 22, 2024

Yeah, we definitely need to change the operator's behavior here. I'll try to come up with a solution.

Here's a longer explanation of why this is happening:

When an external actor modifies the resources deployed by the operator, the operator always re-renders the resource. If the modification was made to a field that isn't specified in the Helm chart that the operator deploys, then the operator's update operation will be a no-op and will not trigger another reconcile. However, as we see here, the Helm chart does define imagePullSecrets for the istio-cni ServiceAccount. This causes the operator to update the field every time an additional pull secret is added. This update triggers the other controller to add the secret again, which in turn triggers the operator. This is why we get the continuous reconcile loop.

@luksa
Copy link
Contributor

luksa commented Oct 22, 2024

And here's why the imagePullSecrets field gets overridden instead of merged:

// +optional
// +listType=atomic    <------------ THIS
ImagePullSecrets []LocalObjectReference `json:"imagePullSecrets,omitempty" protobuf:"bytes,3,rep,name=imagePullSecrets"`

(from https://github.com/kubernetes/api/blob/master/core/v1/types.go#L5710)

@luksa
Copy link
Contributor

luksa commented Oct 22, 2024

Interestingly, the imagePullSecrets field in Pod.spec uses a different merge strategy:

// +patchMergeKey=name
// +patchStrategy=merge
// +listType=map
// +listMapKey=name
ImagePullSecrets []LocalObjectReference `json:"imagePullSecrets,omitempty" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,15,rep,name=imagePullSecrets"`

(from https://github.com/kubernetes/api/blob/master/core/v1/types.go#L3855-L3858)

@luksa
Copy link
Contributor

luksa commented Oct 22, 2024

Apparently this is a known Kubernetes issue that can only be fixed by introducing a new v2 API version of ServiceAccount, which isn't going to happen any time soon. See kubernetes/kubernetes#72475

(this means we have to fix this in the operator)

@twhite0
Copy link
Contributor Author

twhite0 commented Oct 22, 2024

@luksa: I'm half following 😊. Do you think the episode is caused by adding in a pull secret to the IstioCNI CR or because of the addition of the secret value to the pull secrets? More to the point, is there something additional i should try?

@luksa
Copy link
Contributor

luksa commented Oct 22, 2024

@twhite0 unfortunately, there's nothing else you can do.

This is caused by another controller (not one that's in the Sail operator) adding an additional secret to (every?) ServiceAccount. You can see the controller name by running oc get sa -w -oyaml --show-managed-fields and then looking up who owns the imagePullSecrets field in .metadata.managedFields (you have to look at the version of the SA just after the secret is added).

@dgn
Copy link
Collaborator

dgn commented Oct 28, 2024

Hmm. did we establish what adds the imagePullSecret to the SA? If the name of the added secret is stable, maybe simply adding it to the IstioCNI resource is a workaround?

@luksa
Copy link
Contributor

luksa commented Oct 28, 2024

External system (or OpenShfit itself). I guess this is doing a similar thing as https://github.com/knrc/registry-puller.

It's definitely something we need to handle. We can't work around it.

@twhite0
Copy link
Contributor Author

twhite0 commented Oct 29, 2024

Hmm. did we establish what adds the imagePullSecret to the SA?

OpenShift ServiceAccounts

Each service account automatically contains two secrets:

  • An API token
  • Credentials for the OpenShift Container Registry

@twhite0
Copy link
Contributor Author

twhite0 commented Nov 5, 2024

@luksa: Could use some advice/guidance on my path forward. I can see how we could leverage the Istio CNI helm charts to deploy but am missing thoughts on the Sail operators future with this (await Istio upgrade, workaround in the operator).

@luksa
Copy link
Contributor

luksa commented Nov 6, 2024

@twhite0 I've implemented a quick hack to prevent the continuous reconciliation. I'll see if I can implement the proper fix next week.

@twhite0
Copy link
Contributor Author

twhite0 commented Nov 6, 2024

Thanks @luksa.

istio-testing pushed a commit that referenced this issue Nov 7, 2024
…external controllers (#469)

Addresses #430

Signed-off-by: Marko Lukša <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants