Switching service selector causes small amount of downtime with NEGs #638

inversion · 2019-02-13T14:27:37Z

I have implemented a blue/green update strategy for my deployment by updating a version label in the deployment template. I then update the service selector to use the new version label when enough pods of the new version are ready.

I have found that strategy this causes about 5-10 seconds of downtime (HTTP 502s) when using NEGs. The service events show:

Detach 2 network endpoint(s) (NEG "k8s1-my-service-neg" in zone "us-central1-b")
... 2 second delay...
Attach 1 network endpoint(s) (NEG "k8s1-my-service-neg" in zone "us-central1-b")

There were two pods of the old version (matched by the old service selector) and there is one pod of the new version at this point.

I'm wondering if there is a way to avoid changing service selectors causing downtime?

I suspect this issue is related to
https://cloud.google.com/kubernetes-engine/docs/how-to/container-native-load-balancing#scale-to-zero_workloads_interruption (#583)

The text was updated successfully, but these errors were encountered:

rramkumar1 · 2019-02-13T16:58:33Z

@freehan Any thoughts here?

freehan · 2019-02-20T22:52:29Z

Can you allow some period of overlap during transition?

For instance, you have blue pods and green pods. Suppose the service points to blue pods initially.
Change the service selector to include both blue and green pods.
Wait for some time and make sure both are taking traffic.
Then change the service selector again to only target green pods.

This should help. Please let us know if this still causes downtime.

fejta-bot · 2019-05-21T23:25:59Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-06-21T00:13:41Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

bowei · 2019-06-21T00:37:09Z

/lifecycle frozen

rubensayshi · 2020-06-19T08:02:16Z

@freehan @bowei I'm also facing this issue, not sure how to best revive this discussion?

re: a short overlap, the point of a blue/green deploy is to not have any overlap at all,
but if this isn't possible maybe you can give a suggestion on how to achieve minimal overlap?
I'm experimenting with 2 selectors, app: myapp; version: v1 -> app: myapp -> sleep X; -> app: myapp; version: v2, but choosing the sleep is so arbitrary :/

it seems that with NEG there's always few seconds of 502s when changing the selector, it varies between 1 ~ 5s to sometimes more towards 20 ~ 30s.

without NEG (ofc in that case using a NodePort service) there's a pretty large window of time in which requests continue to flow to the old pods.

the only thing that actually works properly is a LoadBalancer service without an ingress, in which case it switches perfectly.
but unfortunately that's not enough because we need an ingress. (as described here; https://cloud.google.com/solutions/implementing-deployment-and-testing-strategies-on-gke#perform_a_bluegreen_deployment)

rubensayshi · 2020-06-22T14:00:06Z

it seems like the issue is really that when the new endpoints are added to the NEG, their health checks are UNKNOWN for a few seconds (regardless if the pods already exist / are read or not) and it will return 502s for those ~5 seconds?

changing the health checks interval doesn't matter if it's 1s or 30s, it's always ~5s of 502s...

freehan · 2020-06-22T21:21:10Z

@rubensayshi Yes. You pointed out the key gap. The 502s are caused by the race between 2 parallel operations: 1. old endpoints getting removed 2. new endpoints getting added. If 2 is slower than 1, then you ended up with 502s during the transitions.

I would suggest replacing sleep Xhere with something slightly more sophisticated. For instance, validating if the v2 app is taking traffic and performing as expected. Then the rollout system can decide whether roll back or proceed roll out. Hence the workflow looks like:

app: myapp; version: v1 -> app: myapp -> monitor/validation -> app: myapp; version: v2 -> monitor/validation

FYI, we are building out something that would allow a more seamless transition for blue/green deployments. That is going to be exposed in a higher level API hence one would not need to tweak the service selector.

rubensayshi · 2020-06-23T07:18:18Z

hey @freehan if there's something better on the horizon then I think we'll settle for the almost-perfect solution with the sleep, and ye the sleep was just a rough example, I already put the version in the response header and just ping the endpoint until I get 100% of the new version :D.

is that "something better" to be expected in the near future? end of Q2 or begin of Q3? and how can I make sure I won't miss the release of that, will it be listed in the GKE update notes? https://cloud.google.com/kubernetes-engine/docs/release-notes

and thanks for replying, specially to such an old issue <3

haizaar · 2021-02-18T13:12:49Z

Google acknowledge they have an issue and there is a public bug we can track: https://issuetracker.google.com/issues/180490128

(i.e. people still stumble upon this one)

rubensayshi · 2021-02-19T09:43:21Z

nice, thanks for sharing @haizaar , I'll subscribe to that issue.

haizaar · 2021-03-03T02:00:38Z

Had a chance to talk to GKE PM regarding this problem. While they won't fix the current issue (which is caused by NEG switching), they do have plans to alleviate it by supporting K8s Gateway API, which is really great since we would be able to do at least basic traffic management and routing without bringing heavy-weight likes of Istio.

If you are on GKE, reach out to your GCP account manager to join a private preview of this feature once it's available. Or use other K8s Gateway implementations that already implement it.

/CC @keidrych

rramkumar1 assigned freehan Feb 13, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 21, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 21, 2019

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jun 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switching service selector causes small amount of downtime with NEGs #638

Switching service selector causes small amount of downtime with NEGs #638

inversion commented Feb 13, 2019 •

edited

Loading

rramkumar1 commented Feb 13, 2019

freehan commented Feb 20, 2019

fejta-bot commented May 21, 2019

fejta-bot commented Jun 21, 2019

bowei commented Jun 21, 2019

rubensayshi commented Jun 19, 2020 •

edited

Loading

rubensayshi commented Jun 22, 2020 •

edited

Loading

freehan commented Jun 22, 2020 •

edited

Loading

rubensayshi commented Jun 23, 2020 •

edited

Loading

haizaar commented Feb 18, 2021 •

edited

Loading

rubensayshi commented Feb 19, 2021

haizaar commented Mar 3, 2021

Switching service selector causes small amount of downtime with NEGs #638

Switching service selector causes small amount of downtime with NEGs #638

Comments

inversion commented Feb 13, 2019 • edited Loading

rramkumar1 commented Feb 13, 2019

freehan commented Feb 20, 2019

fejta-bot commented May 21, 2019

fejta-bot commented Jun 21, 2019

bowei commented Jun 21, 2019

rubensayshi commented Jun 19, 2020 • edited Loading

rubensayshi commented Jun 22, 2020 • edited Loading

freehan commented Jun 22, 2020 • edited Loading

rubensayshi commented Jun 23, 2020 • edited Loading

haizaar commented Feb 18, 2021 • edited Loading

rubensayshi commented Feb 19, 2021

haizaar commented Mar 3, 2021

inversion commented Feb 13, 2019 •

edited

Loading

rubensayshi commented Jun 19, 2020 •

edited

Loading

rubensayshi commented Jun 22, 2020 •

edited

Loading

freehan commented Jun 22, 2020 •

edited

Loading

rubensayshi commented Jun 23, 2020 •

edited

Loading

haizaar commented Feb 18, 2021 •

edited

Loading