Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kafka-controller unable to bring up dispatcher pods after hitting quota issues during a cluster BOM update #4168

Open
mdwhitley opened this issue Nov 20, 2024 · 14 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@mdwhitley
Copy link

Describe the bug
I have observed this issue in our production clusters under both 1.14 and 1.15 releases.

Initial issue was discovered during our upgrade from 1.14 => 1.15. Rollouts hung after we discovered that the knative-eventing namespace resource quota had been incorrectly modified during a region wide update. Both the ConfigMap and Secret resources were beyond the quota limits as a result. In 1 of the 2 clusters that hit this issue a cluster wide BOM update was also in progress. Once the quota issue was corrected, the cluster which was not undergoing a BOM update came back up successfully while the other did not.

The cluster that did not come back up was the one which also had the BOM update running. The update was immediately paused, but services did not come back online afterward. The primary impact was 1/2 dispatcher pods stuck in Terminating state:

kafka-broker-dispatcher-0                1/1     Running       0          15d
kafka-broker-dispatcher-1                1/1     Terminating   0          15d

with failed triggers reporting

failed to bind resource to pod: Internal error occurred: failed calling webhook "pods.defaulting.webhook.kafka.eventing.knative.dev": failed to call webhook: Post "[https://kafka-webhook-eventing.knative-eventing.svc:443/pods-defaulting?timeout=2s](https://kafka-webhook-eventing.knative-eventing.svc/pods-defaulting?timeout=2s)": context deadline exceeded

We have experienced some timeouts on webhooks before related to cluster BOM updates, even some that have prevented unrelated operators in non-knative namespaces to fail. A number of our configurations are modified to run with failurePolicy: Ignore as a result. pods.defaulting.webhook.kafka.eventing.knative.dev was still left with Fail, and after changing it to Ignore resulted in the dispatcher statefulsets not coming up because of

Warning  FailedCreate  66s (x15 over 2m47s)  statefulset-controller  create Pod kafka-broker-dispatcher-1 in StatefulSet kafka-broker-dispatcher failed error: Pod "kafka-broker-dispatcher-1" is invalid: spec.containers[0].volumeMounts[1].name: Not found: "contract-resources"

the kafka-broker-dispatcher-1 pod would not start due to the invalid statefulset template and lack of kafka-controller doing its thing. Within kafka-controller there were error logs reporting failures due to not being able to communicate to the kafka-broker-dispatcher-1 pod, which didn't exist anymore.

We tried various deployment/statefulset restarts which only resulted in the other working dispatcher pod to go down and not come back up putting all triggers in failed states.

We tried deleting all ConfigMap/Deployment/StatefulSet resources and doing another 1.15 deployment which resulted in the same stuck behavior. We also tried a downgrade to 1.14 with the same results.

Mitigations were put in place to manually define contract-resources volume and route all triggers into the single dispatcher pod to get eventing limping along. This has allowed us to finish running BOM updates on the cluster and keep normal operations during this time.

Not long after we mitigated the previous issue, BOM updates were started in additional clusters with stable/working 1.14 knative installations. Unfortunately, those clusters suffered the same quota issue and so the knative-eventing namespace was over quota for CM/Secrets, and both experienced partial degradation with dispatcher pods stuck in Terminating and kakfa-controller not properly starting up new ones. The same mitigations were put in place, though not ideal, as processing through 300+ triggers took around 2 hours to fully come back online.

To try and workaround the webhook timeout, I increased pods.defaulting.webhook.kafka.eventing.knative.dev timeout to 10s. When attempting an upgrade from 1.14 => 1.15 (with config changes) on the original cluster, everything rolled out as expected and came back up. I had the same result when upgrading one of the other impacted 1.14 clusters as well. Both of these clusters were post-BOM update. Our dev/pstg clusters also received a BOM update during this time (both had latest mentioned 1.15 changes) and both maintained expected availability during the entire time.

My working hypothesis is the quota issue combined with any pod movements/restarts results in this type of "stuck" behavior, which makes sense. This combined with a cluster BOM update, which on its own can cause API timeouts, was able to get us into a state which could not be automatically recovered once the quota issue was resolved.

Expected behavior

To Reproduce
Potentially:

  • Knative 1.14 or 1.15 installed properly
  • Reduce ConfigMap/Secret quota in knative namespace such that # existing > quota
  • Initiate full cluster upgrade/cycle process
    Then watch for when dispatcher pods are moved off nodes that go under maintenance.

Knative release version
1.14 + 1.15

Additional context
Add any other context about the problem here such as proposed priority

@mdwhitley mdwhitley added the kind/bug Categorizes issue or PR as related to a bug. label Nov 20, 2024
@mdwhitley
Copy link
Author

We had another cluster start a BOM update tonight. I had already put in place the 1.15 upgrade + timeout changes and as soon as the update began on the cluster, all triggers began flipping from Ready to ConsumerBinding. With the 10s timeout the error returned from kafka-controller:

{"level":"error","ts":"2024-11-21T00:12:02.752Z","logger":"kafka-broker-controller","caller":"controller/controller.go:564","msg":"Reconcile error","commit":"7092bb9-dirty","knative.dev/pod":"kafka-controller-7b9d5f8f95-qhdxn","knative.dev/controller":"knative.dev.eventing-kafka-broker.control-plane.pkg.reconciler.consumer.Reconciler","knative.dev/kind":"internal.kafka.eventing.knative.dev.Consumer","knative.dev/traceid":"a32f1de7-3e78-478b-8ef3-d8b5800b1980","knative.dev/key":"conversation/839dd499-424e-40f3-9dff-bff69e9c6a2b-n4cq4","duration":5.197144629,"error":"failed to bind resource to pod: Internal error occurred: failed calling webhook \"pods.defaulting.webhook.kafka.eventing.knative.dev\": failed to call webhook: Post \"https://kafka-webhook-eventing.knative-eventing.svc:443/pods-defaulting?timeout=10s\": http: server gave HTTP response to HTTPS client","stacktrace":"knative.dev/pkg/controller.(*Impl).handleErr\n\tknative.dev/[email protected]/controller/controller.go:564\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/[email protected]/controller/controller.go:541\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/[email protected]/controller/controller.go:489"}

After around 20 minutes, all our triggers are now in the unavailable state. No quota issues in this case. Just a working 1.15 installation one minute and then failing with the above error on BOM update.

@pierDipi
Copy link
Member

@mdwhitley maybe you get http: server gave HTTP response to HTTPS client because the webhook server secret kafka-webhook-eventing-certs is not present.

That secret is populated by Knative to serve HTTPS requests for the defaulting webhook [1] and is part of the released artifacts [2], so it's expected to exist, when is not present, it defaults to serving no certs [3].

[1] https://github.com/knative/pkg/blob/a7fd9b10bb9febf537db69284723a5337adc0a50/webhook/certificates/certificates.go#L65-L110
[2] https://github.com/knative-extensions/eventing-kafka-broker/blob/main/control-plane/config/eventing-kafka-broker/200-webhook/400-webhook-secret.yaml
[3] https://github.com/knative/pkg/blob/a7fd9b10bb9febf537db69284723a5337adc0a50/webhook/webhook.go#L194-L198

Can you confirm the state of the kafka-webhook-eventing-certs when you get http: server gave HTTP response to HTTPS client ?

Another part that we could improve is to define PodDistruptionBudget to force at least 1 (or more) webhook instance at any given point so that there is not webhook unavailability that causes the pod to not have the volume definition.

@mdwhitley
Copy link
Author

@pierDipi The secret is present and does not appear to have been recreated since our original deployment

$ k get secrets -n knative-eventing
NAME                                          TYPE                                  DATA   AGE
eventing-controller-token-fstts               kubernetes.io/service-account-token   3      596d
eventing-webhook-certs                        Opaque                                3      596d
eventing-webhook-token-bdn52                  kubernetes.io/service-account-token   3      596d
kafka-broker-secret                           Opaque                                4      596d
kafka-controller-token-cch7j                  kubernetes.io/service-account-token   3      596d
kafka-webhook-eventing-certs                  Opaque                                3      596d
kafka-webhook-eventing-token-rnfgg            kubernetes.io/service-account-token   3      596d
knative-eventing-alert-secret                 Opaque                                2      596d
knative-kafka-broker-data-plane-token-dl588   kubernetes.io/service-account-token   3      596d
pingsource-mt-adapter-token-92f4d             kubernetes.io/service-account-token   3      596d

@pierDipi
Copy link
Member

if you try to do a TLS handshake with the kafka-webhook-eventing server, does it succeed ? do you see any relevant logs?

@mdwhitley
Copy link
Author

mdwhitley commented Nov 21, 2024

At present, yes. Checking from one of our dead letter pods that has openssl:

$ k exec {pod} -- bash -c "openssl s_client -connect kafka-webhook-eventing.knative-eventing.svc:443 -showcerts"
Connecting to 172.21.71.107
depth=0 O=knative.dev, CN=kafka-webhook-eventing.knative-eventing.svc
verify error:num=18:self-signed certificate
verify return:1
depth=0 O=knative.dev, CN=kafka-webhook-eventing.kCONNECTED(00000003)

I did not have debug logging enabled on the cluster from last night though, and no errors presented in the webhook pods.

@mdwhitley
Copy link
Author

mdwhitley commented Nov 21, 2024

The BOM updates in question are upgrades from v1.28.14 to v1.28.15. In cases where we had a fully working 1.15 install, as soon as master nodes began maintenance, that is when triggers went down and webhook HTTP errors began.

In our clusters that have had a BOM update without disruption to Knative (dev/stg), those were upgraded v1.28.15 to v1.29.10.

@mdwhitley
Copy link
Author

I've pulled the kafka-webhook logs and can observe most traffic stops to the pods around the time the BOM update begins
image
Incident begins around 18:56. No more remote admission controller occur when everything goes down.

@pierDipi
Copy link
Member

pierDipi commented Nov 22, 2024

what is BOM update? In terms of operations, etc, what is that doing?

@pierDipi
Copy link
Member

pierDipi commented Nov 22, 2024

s soon as master nodes began maintenance, that is when triggers went down

when you say, triggers went down, what does that actually mean? Stopped sending events or the status went to not ready? or both?

@mdwhitley
Copy link
Author

mdwhitley commented Nov 22, 2024

what is BOM update? In terms of operations, etc, what is that doing?

BOM updates for our cluster are upgrading k8 version (v1.28.15) as well as OS updates and other vulnerability updates across the nodes. It starts with master nodes, then edge nodes, then worker nodes.

when you say, triggers went down, what does that actually mean? Stopped sending events or the status went to not ready? or both?

Trigger is reporting Ready=False and Reason=BindInProgress. While in this state, events continue to flow through existing dispatcher pods. Once one of the pods goes down, moved due to node maintenance, it is stuck in Terminating state due to finalizers hung on kafka-controller due to the webhook errors. Inside the terminating dispatcher pod all threads have exited, so no more work is being done and now all triggers handled by that pod are completely offline and will not recover.

@pierDipi
Copy link
Member

do you see any logs in the webhook that look like the ones here?

  • "failed to fetch secret", zap.Error(err)
  • "server key missing"
  • "server cert missing"

https://github.com/knative/pkg/blob/a7fd9b10bb9febf537db69284723a5337adc0a50/webhook/webhook.go#L195-L210

@mdwhitley
Copy link
Author

Checking the knative-eventing namespace for the entire day of these incidents on our most recent cluster to have issues, I am unable to find any of those logs messages present.

@pierDipi
Copy link
Member

pierDipi commented Nov 26, 2024

is there any way we could trace the apiserver -> kafka webhook requests?

Unclear why the api server is getting HTTP response from an HTTPS server without seeing the logs above. Also the webhook secret is cached (as we use informers/listers) so the webhook serves the TLS certs from memory and therefore it's not relying on the api server being up/available

@mdwhitley
Copy link
Author

I am not sure if we would be able to do any tracing, what exactly would you want enabled? The remaining 2 clusters that need this k8 update are our largest and I will likely be putting mitigations in place prior to the update as we cannot afford to have them go completely down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

2 participants