Skip to content

NGINX Gateway Fabric Pods Failing Health Checks Under High Connection Load on GKE Autopilot #4066

@NicolasPires777

Description

@NicolasPires777

Hello everyone! I'm experiencing a very strange problem in my Kubernetes cluster that's especially affecting the availability of my NGINX Gateway Fabric dataplane pods.

Here's the issue: Even though my node running the pod has plenty of spare CPU and memory, the /health endpoint of the dataplane pods still fails occasionally. If this simple connection fails, my clients' connections fail as well. I can see in my application monitoring graphs that whenever the NGINX /health check fails, the usage of my apps drops significantly.

I found this issue that seems related: https://stackoverflow.com/questions/410616/increasing-the-maximum-number-of-tcp-ip-connections-in-linux
Should I apply some configuration like this on my machines? Can I set this directly in the NGINX deployment?

My current setup is a GKE 1.33 cluster with Autopilot. I've already contacted GCP support, and they recommended using C4 machine types, but that didn't solve my problem. During peak hours, we reach up to 600,000 requests per minute and I believe there are people handling even more traffic—what do they do to manage this load?

Our workaround has been to scale horizontally: for 600,000 requests, I need 18 nodes JUST for NGINX, so I don't hit the "machine connection limit" mentioned in the Stack Overflow link above.

Can anyone help or at least point me in the right direction? This issue is severely impacting our application's availability and it feels like it shouldn't be so hard to solve. We've been struggling with this for over a month and haven't found a solution, not even with GCP's own support. Could this be some NGINX configuration I'm missing?

Sometimes, even other types of connections fail, such as when I try to fetch logs from the pods. I'll attach some images below to illustrate the issue.

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    🆕 New

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions