Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem wait for podInfo repo sync, times out waiting for condition #3902

Closed
drewgonzales360 opened this issue Oct 18, 2024 · 2 comments
Closed

Comments

@drewgonzales360
Copy link

drewgonzales360 commented Oct 18, 2024

Describe the bug
On Kubernetes clusters with lots of pods (>10k), the PodInfo reflector times out. I'm only measuring cluster size in number of pods because I think that's the only resource that informs the PodInfo cache.

Screenshot 2024-10-17 at 17 43 39

We also have a cluster that's 30,000 pods and see the same behavior. On our smaller clusters (400 pods), we don't see this behavior

Steps to reproduce

  1. Create a cluster with 10,000 pods
  2. Start the aws-load-balancer-controller
  3. Controller goes into CrashLoopBackOff

Expected outcome
I'd expect for there to be some way we can increase the timeout so we can give the controller more time to get the PodInfo.

Environment
We've been running v2.4.4 that that works on the same cluster (with more than 10k pods). When we tried to upgrade, we couldn't get the controller to stay online. Newer versions of the controller will start, then log

{"level":"info","msg":"starting podInfo repo"}
⌛ this takes a good amount of time ~60s
{"level":"error","msg":"problem wait for podInfo repo sync", "error": "timed out waiting for the condition"}
  • ❌ v2.7.2, v2.8.3, v2.9.0: These are the versions we see this behavior on.
  • ✅ v2.4.4: These versions of the controller work as expected on clusters with more than 30,000 pods.
  • 1.28: Our Kubernetes Version
  • Not using EKS
@zac-nixon
Copy link
Collaborator

Hello! Thanks for bringing this to our attention. My first thought is perhaps your Kubernetes API is underscaled to handle such a load. I've done a quick scan of our change history between 2.4.4 -> 2.9.0 and don't see any changes on how the pod info cache is hydrated. I just worked on this performance feature and tested with another user of the controller. We were able to run the controller with 20k+ pods with no issue, before and after my feature.

With that being said above, I think there is room for improvement in the controller to accommodate this case. I'll spin up a new cluster with the load you mention and try to re-produce.

@drewgonzales360
Copy link
Author

Ah, you're right about there not being many changes to that code path between 2.4.4 and 2.9.0. I just realized I get the same issue on 2.4.4.

I also realize now that I didn't configure the liveness probe well and I was just using the default one. I messed around with those numbers and now I can get the controller to stay online.

With that change in, I can close this issue. Thanks 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants