GKE Subsetting recalculation only appears to occur when nodes are added, not removed #2650

glasser · 2024-08-28T00:21:51Z

I use GCE ingresses via hosted GKE (not running my own ingress-gce).

We have a large-ish cluster with GKE subsetting and a lot of internal network pass-through load balancers.

We wanted to move all our pods to a new nodepool so we created the nodepool, drained the old nodes gradually until they didn't have any pods (other than daemonsets), and then scaled the original node pool down to 0 nodes.

This broke many of our load balancers because it removed all the GCE_VM_IP endpoints from their zonal NEGs.

After some experimentation, it appears that when you remove nodes from GKE, GCE_VM_IP endpoints can be removed from the zonal NEG groups associated with internal network pass-through load balancers (with externalTrafficPolicy: Cluster) and the controller won't actively add endpoints for newer nodes.

Adding just one more node to the cluster after this seems sufficient to trigger recalculation and get those NEGs back to 25. But if you don't do that, it seems like you can just lose endpoints from your NEGs and eventually break them!

The text was updated successfully, but these errors were encountered:

swetharepakula · 2024-08-28T00:35:31Z

Thanks @glasser for reporting!

This should be fixed with #2622.

We are currently qualifying the new release and hope to have it fixed in GKE in the next couple weeks.

swetharepakula · 2024-08-28T00:37:32Z

We also have made release nodes with possible mitigations for the time being: https://cloud.google.com/kubernetes-engine/docs/release-notes#August_14_2024

The easiest way to avoid downtime is to switch to using xTP=Local

glasser · 2024-08-28T00:43:16Z

Thanks, this sounds exactly like our issue. Not quite up to date enough on reading our release notes I guess!

Switching over our production cluster to Local today seems like a bit less of a scary thing than just scaling down bit by bit and trigger some sort of scale up on another node pool after each time. (Though I don't know a better way to trigger scale-up than to schedule something that doesn't fit... I don't want to just scale the pool normally because there's currently a fair amount of imbalance across zones.)

glasser · 2024-08-28T00:49:41Z

(Ah OK, we can just scale up a random other node pool.)

This will be called out in the release notes when it is fixed?

gauravkghildiyal · 2024-09-26T22:49:51Z

Hi @glasser. Not sure if you already noticed, but we recently sent out the release updates for this fix: https://cloud.google.com/kubernetes-engine/docs/release-notes#September_10_2024

glasser · 2024-09-26T23:28:40Z

@gauravkghildiyal Thanks! And the release notes are clear about what we need to do (control plane upgrade). Trying it in our dev cluster now!

k8s-triage-robot · 2024-12-26T00:18:56Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

glasser · 2025-01-02T17:59:42Z

I believe I tested out the new version and validated it didn't have this problem any more, though it's been a few months.

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 26, 2024

glasser closed this as completed Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GKE Subsetting recalculation only appears to occur when nodes are added, not removed #2650

GKE Subsetting recalculation only appears to occur when nodes are added, not removed #2650

glasser commented Aug 28, 2024

swetharepakula commented Aug 28, 2024

swetharepakula commented Aug 28, 2024

glasser commented Aug 28, 2024

glasser commented Aug 28, 2024

gauravkghildiyal commented Sep 26, 2024

glasser commented Sep 26, 2024

k8s-triage-robot commented Dec 26, 2024

glasser commented Jan 2, 2025

GKE Subsetting recalculation only appears to occur when nodes are added, not removed #2650

GKE Subsetting recalculation only appears to occur when nodes are added, not removed #2650

Comments

glasser commented Aug 28, 2024

swetharepakula commented Aug 28, 2024

swetharepakula commented Aug 28, 2024

glasser commented Aug 28, 2024

glasser commented Aug 28, 2024

gauravkghildiyal commented Sep 26, 2024

glasser commented Sep 26, 2024

k8s-triage-robot commented Dec 26, 2024

glasser commented Jan 2, 2025