-
Notifications
You must be signed in to change notification settings - Fork 138
Description
Hello everyone! I'm experiencing a very strange problem in my Kubernetes cluster that's especially affecting the availability of my NGINX Gateway Fabric dataplane pods.
Here's the issue: Even though my node running the pod has plenty of spare CPU and memory, the /health endpoint of the dataplane pods still fails occasionally. If this simple connection fails, my clients' connections fail as well. I can see in my application monitoring graphs that whenever the NGINX /health check fails, the usage of my apps drops significantly.
I found this issue that seems related: https://stackoverflow.com/questions/410616/increasing-the-maximum-number-of-tcp-ip-connections-in-linux
Should I apply some configuration like this on my machines? Can I set this directly in the NGINX deployment?
My current setup is a GKE 1.33 cluster with Autopilot. I've already contacted GCP support, and they recommended using C4 machine types, but that didn't solve my problem. During peak hours, we reach up to 600,000 requests per minute and I believe there are people handling even more traffic—what do they do to manage this load?
Our workaround has been to scale horizontally: for 600,000 requests, I need 18 nodes JUST for NGINX, so I don't hit the "machine connection limit" mentioned in the Stack Overflow link above.
Can anyone help or at least point me in the right direction? This issue is severely impacting our application's availability and it feels like it shouldn't be so hard to solve. We've been struggling with this for over a month and haven't found a solution, not even with GCP's own support. Could this be some NGINX configuration I'm missing?
Sometimes, even other types of connections fail, such as when I try to fetch logs from the pods. I'll attach some images below to illustrate the issue.

Metadata
Metadata
Assignees
Labels
Type
Projects
Status