NGINX Gateway Fabric Pods Failing Health Checks Under High Connection Load on GKE Autopilot

Hello everyone! I'm experiencing a very strange problem in my Kubernetes cluster that's especially affecting the availability of my NGINX Gateway Fabric dataplane pods.

Here's the issue: Even though my node running the pod has plenty of spare CPU and memory, the /health endpoint of the dataplane pods still fails occasionally. If this simple connection fails, my clients' connections fail as well. I can see in my application monitoring graphs that whenever the NGINX /health check fails, the usage of my apps drops significantly.

I found this issue that seems related: https://stackoverflow.com/questions/410616/increasing-the-maximum-number-of-tcp-ip-connections-in-linux
Should I apply some configuration like this on my machines? Can I set this directly in the NGINX deployment?

My current setup is a GKE 1.33 cluster with Autopilot. I've already contacted GCP support, and they recommended using C4 machine types, but that didn't solve my problem. During peak hours, we reach up to 600,000 requests per minute and I believe there are people handling even more traffic—what do they do to manage this load?

Our workaround has been to scale horizontally: for 600,000 requests, I need 18 nodes JUST for NGINX, so I don't hit the "machine connection limit" mentioned in the Stack Overflow link above.

Can anyone help or at least point me in the right direction? This issue is severely impacting our application's availability and it feels like it shouldn't be so hard to solve. We've been struggling with this for over a month and haven't found a solution, not even with GCP's own support. Could this be some NGINX configuration I'm missing?

Sometimes, even other types of connections fail, such as when I try to fetch logs from the pods. I'll attach some images below to illustrate the issue.

<img width="1910" height="148" alt="Image" src="https://github.com/user-attachments/assets/05432f1a-d639-431a-8077-b59d50e720af" />





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NGINX Gateway Fabric Pods Failing Health Checks Under High Connection Load on GKE Autopilot #4066

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NGINX Gateway Fabric Pods Failing Health Checks Under High Connection Load on GKE Autopilot #4066

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions