problem wait for podInfo repo sync, times out waiting for condition #3902

drewgonzales360 · 2024-10-18T01:10:42Z

Describe the bug
On Kubernetes clusters with lots of pods (>10k), the PodInfo reflector times out. I'm only measuring cluster size in number of pods because I think that's the only resource that informs the PodInfo cache.

We also have a cluster that's 30,000 pods and see the same behavior. On our smaller clusters (400 pods), we don't see this behavior

Steps to reproduce

Create a cluster with 10,000 pods
Start the aws-load-balancer-controller
Controller goes into CrashLoopBackOff

Expected outcome
I'd expect for there to be some way we can increase the timeout so we can give the controller more time to get the PodInfo.

Environment
We've been running v2.4.4 that that works on the same cluster (with more than 10k pods). When we tried to upgrade, we couldn't get the controller to stay online. Newer versions of the controller will start, then log

{"level":"info","msg":"starting podInfo repo"}
⌛ this takes a good amount of time ~60s
{"level":"error","msg":"problem wait for podInfo repo sync", "error": "timed out waiting for the condition"}

❌ v2.7.2, v2.8.3, v2.9.0: These are the versions we see this behavior on.
✅ v2.4.4: These versions of the controller work as expected on clusters with more than 30,000 pods.
1.28: Our Kubernetes Version
Not using EKS

The text was updated successfully, but these errors were encountered:

zac-nixon · 2024-10-22T16:06:18Z

Hello! Thanks for bringing this to our attention. My first thought is perhaps your Kubernetes API is underscaled to handle such a load. I've done a quick scan of our change history between 2.4.4 -> 2.9.0 and don't see any changes on how the pod info cache is hydrated. I just worked on this performance feature and tested with another user of the controller. We were able to run the controller with 20k+ pods with no issue, before and after my feature.

With that being said above, I think there is room for improvement in the controller to accommodate this case. I'll spin up a new cluster with the load you mention and try to re-produce.

drewgonzales360 · 2024-10-22T20:48:43Z

Ah, you're right about there not being many changes to that code path between 2.4.4 and 2.9.0. I just realized I get the same issue on 2.4.4.

I also realize now that I didn't configure the liveness probe well and I was just using the default one. I messed around with those numbers and now I can get the controller to stay online.

With that change in, I can close this issue. Thanks 🙏

drewgonzales360 closed this as completed Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

problem wait for podInfo repo sync, times out waiting for condition #3902

problem wait for podInfo repo sync, times out waiting for condition #3902

drewgonzales360 commented Oct 18, 2024 •

edited

Loading

zac-nixon commented Oct 22, 2024

drewgonzales360 commented Oct 22, 2024

problem wait for podInfo repo sync, times out waiting for condition #3902

problem wait for podInfo repo sync, times out waiting for condition #3902

Comments

drewgonzales360 commented Oct 18, 2024 • edited Loading

zac-nixon commented Oct 22, 2024

drewgonzales360 commented Oct 22, 2024

drewgonzales360 commented Oct 18, 2024 •

edited

Loading