You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
On Kubernetes clusters with lots of pods (>10k), the PodInfo reflector times out. I'm only measuring cluster size in number of pods because I think that's the only resource that informs the PodInfo cache.
We also have a cluster that's 30,000 pods and see the same behavior. On our smaller clusters (400 pods), we don't see this behavior
Steps to reproduce
Create a cluster with 10,000 pods
Start the aws-load-balancer-controller
Controller goes into CrashLoopBackOff
Expected outcome
I'd expect for there to be some way we can increase the timeout so we can give the controller more time to get the PodInfo.
Environment
We've been running v2.4.4 that that works on the same cluster (with more than 10k pods). When we tried to upgrade, we couldn't get the controller to stay online. Newer versions of the controller will start, then log
{"level":"info","msg":"starting podInfo repo"}
⌛ this takes a good amount of time ~60s
{"level":"error","msg":"problem wait for podInfo repo sync", "error": "timed out waiting for the condition"}
❌ v2.7.2, v2.8.3, v2.9.0: These are the versions we see this behavior on.
✅ v2.4.4: These versions of the controller work as expected on clusters with more than 30,000 pods.
1.28: Our Kubernetes Version
Not using EKS
The text was updated successfully, but these errors were encountered:
Hello! Thanks for bringing this to our attention. My first thought is perhaps your Kubernetes API is underscaled to handle such a load. I've done a quick scan of our change history between 2.4.4 -> 2.9.0 and don't see any changes on how the pod info cache is hydrated. I just worked on this performance feature and tested with another user of the controller. We were able to run the controller with 20k+ pods with no issue, before and after my feature.
With that being said above, I think there is room for improvement in the controller to accommodate this case. I'll spin up a new cluster with the load you mention and try to re-produce.
Ah, you're right about there not being many changes to that code path between 2.4.4 and 2.9.0. I just realized I get the same issue on 2.4.4.
I also realize now that I didn't configure the liveness probe well and I was just using the default one. I messed around with those numbers and now I can get the controller to stay online.
With that change in, I can close this issue. Thanks 🙏
Describe the bug
On Kubernetes clusters with lots of pods (>10k), the
PodInfo
reflector times out. I'm only measuring cluster size in number of pods because I think that's the only resource that informs the PodInfo cache.We also have a cluster that's 30,000 pods and see the same behavior. On our smaller clusters (400 pods), we don't see this behavior
Steps to reproduce
aws-load-balancer-controller
Expected outcome
I'd expect for there to be some way we can increase the timeout so we can give the controller more time to get the PodInfo.
Environment
We've been running v2.4.4 that that works on the same cluster (with more than 10k pods). When we tried to upgrade, we couldn't get the controller to stay online. Newer versions of the controller will start, then log
The text was updated successfully, but these errors were encountered: