-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kafka-controller unable to bring up dispatcher pods after hitting quota issues during a cluster BOM update #4168
Comments
We had another cluster start a BOM update tonight. I had already put in place the 1.15 upgrade + timeout changes and as soon as the update began on the cluster, all triggers began flipping from Ready to
After around 20 minutes, all our triggers are now in the unavailable state. No quota issues in this case. Just a working 1.15 installation one minute and then failing with the above error on BOM update. |
@mdwhitley maybe you get That secret is populated by Knative to serve HTTPS requests for the defaulting webhook [1] and is part of the released artifacts [2], so it's expected to exist, when is not present, it defaults to serving no certs [3]. [1] https://github.com/knative/pkg/blob/a7fd9b10bb9febf537db69284723a5337adc0a50/webhook/certificates/certificates.go#L65-L110 Can you confirm the state of the Another part that we could improve is to define PodDistruptionBudget to force at least 1 (or more) webhook instance at any given point so that there is not webhook unavailability that causes the pod to not have the volume definition. |
@pierDipi The secret is present and does not appear to have been recreated since our original deployment
|
if you try to do a TLS handshake with the kafka-webhook-eventing server, does it succeed ? do you see any relevant logs? |
At present, yes. Checking from one of our dead letter pods that has
I did not have debug logging enabled on the cluster from last night though, and no errors presented in the webhook pods. |
The BOM updates in question are upgrades from In our clusters that have had a BOM update without disruption to Knative (dev/stg), those were upgraded |
what is BOM update? In terms of operations, etc, what is that doing? |
when you say, triggers went down, what does that actually mean? Stopped sending events or the status went to not ready? or both? |
BOM updates for our cluster are upgrading k8 version (v1.28.15) as well as OS updates and other vulnerability updates across the nodes. It starts with master nodes, then edge nodes, then worker nodes.
Trigger is reporting |
do you see any logs in the webhook that look like the ones here?
|
Checking the |
is there any way we could trace the apiserver -> kafka webhook requests? Unclear why the api server is getting HTTP response from an HTTPS server without seeing the logs above. Also the webhook secret is cached (as we use informers/listers) so the webhook serves the TLS certs from memory and therefore it's not relying on the api server being up/available |
I am not sure if we would be able to do any tracing, what exactly would you want enabled? The remaining 2 clusters that need this k8 update are our largest and I will likely be putting mitigations in place prior to the update as we cannot afford to have them go completely down. |
Describe the bug
I have observed this issue in our production clusters under both 1.14 and 1.15 releases.
Initial issue was discovered during our upgrade from 1.14 => 1.15. Rollouts hung after we discovered that the
knative-eventing
namespace resource quota had been incorrectly modified during a region wide update. Both the ConfigMap and Secret resources were beyond the quota limits as a result. In 1 of the 2 clusters that hit this issue a cluster wide BOM update was also in progress. Once the quota issue was corrected, the cluster which was not undergoing a BOM update came back up successfully while the other did not.The cluster that did not come back up was the one which also had the BOM update running. The update was immediately paused, but services did not come back online afterward. The primary impact was 1/2 dispatcher pods stuck in Terminating state:
with failed triggers reporting
We have experienced some timeouts on webhooks before related to cluster BOM updates, even some that have prevented unrelated operators in non-knative namespaces to fail. A number of our configurations are modified to run with
failurePolicy: Ignore
as a result.pods.defaulting.webhook.kafka.eventing.knative.dev
was still left withFail
, and after changing it toIgnore
resulted in the dispatcher statefulsets not coming up because ofthe
kafka-broker-dispatcher-1
pod would not start due to the invalid statefulset template and lack of kafka-controller doing its thing. Within kafka-controller there were error logs reporting failures due to not being able to communicate to thekafka-broker-dispatcher-1
pod, which didn't exist anymore.We tried various deployment/statefulset restarts which only resulted in the other working dispatcher pod to go down and not come back up putting all triggers in failed states.
We tried deleting all ConfigMap/Deployment/StatefulSet resources and doing another 1.15 deployment which resulted in the same stuck behavior. We also tried a downgrade to 1.14 with the same results.
Mitigations were put in place to manually define
contract-resources
volume and route all triggers into the single dispatcher pod to get eventing limping along. This has allowed us to finish running BOM updates on the cluster and keep normal operations during this time.Not long after we mitigated the previous issue, BOM updates were started in additional clusters with stable/working 1.14 knative installations. Unfortunately, those clusters suffered the same quota issue and so the
knative-eventing
namespace was over quota for CM/Secrets, and both experienced partial degradation with dispatcher pods stuck in Terminating and kakfa-controller not properly starting up new ones. The same mitigations were put in place, though not ideal, as processing through 300+ triggers took around 2 hours to fully come back online.To try and workaround the webhook timeout, I increased
pods.defaulting.webhook.kafka.eventing.knative.dev
timeout to 10s. When attempting an upgrade from 1.14 => 1.15 (with config changes) on the original cluster, everything rolled out as expected and came back up. I had the same result when upgrading one of the other impacted 1.14 clusters as well. Both of these clusters were post-BOM update. Our dev/pstg clusters also received a BOM update during this time (both had latest mentioned 1.15 changes) and both maintained expected availability during the entire time.My working hypothesis is the quota issue combined with any pod movements/restarts results in this type of "stuck" behavior, which makes sense. This combined with a cluster BOM update, which on its own can cause API timeouts, was able to get us into a state which could not be automatically recovered once the quota issue was resolved.
Expected behavior
To Reproduce
Potentially:
Then watch for when dispatcher pods are moved off nodes that go under maintenance.
Knative release version
1.14 + 1.15
Additional context
Add any other context about the problem here such as proposed priority
The text was updated successfully, but these errors were encountered: