Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switching service selector causes small amount of downtime with NEGs #638

Open
inversion opened this issue Feb 13, 2019 · 12 comments
Open
Assignees
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@inversion
Copy link

inversion commented Feb 13, 2019

I have implemented a blue/green update strategy for my deployment by updating a version label in the deployment template. I then update the service selector to use the new version label when enough pods of the new version are ready.

I have found that strategy this causes about 5-10 seconds of downtime (HTTP 502s) when using NEGs. The service events show:

Detach 2 network endpoint(s) (NEG "k8s1-my-service-neg" in zone "us-central1-b")
... 2 second delay...
Attach 1 network endpoint(s) (NEG "k8s1-my-service-neg" in zone "us-central1-b")

There were two pods of the old version (matched by the old service selector) and there is one pod of the new version at this point.

I'm wondering if there is a way to avoid changing service selectors causing downtime?

I suspect this issue is related to
https://cloud.google.com/kubernetes-engine/docs/how-to/container-native-load-balancing#scale-to-zero_workloads_interruption (#583)

@rramkumar1
Copy link
Contributor

@freehan Any thoughts here?

@freehan
Copy link
Contributor

freehan commented Feb 20, 2019

Can you allow some period of overlap during transition?

For instance, you have blue pods and green pods. Suppose the service points to blue pods initially.
Change the service selector to include both blue and green pods.
Wait for some time and make sure both are taking traffic.
Then change the service selector again to only target green pods.

This should help. Please let us know if this still causes downtime.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 21, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 21, 2019
@bowei
Copy link
Member

bowei commented Jun 21, 2019

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jun 21, 2019
@rubensayshi
Copy link

rubensayshi commented Jun 19, 2020

@freehan @bowei I'm also facing this issue, not sure how to best revive this discussion?

re: a short overlap, the point of a blue/green deploy is to not have any overlap at all,
but if this isn't possible maybe you can give a suggestion on how to achieve minimal overlap?
I'm experimenting with 2 selectors, app: myapp; version: v1 -> app: myapp -> sleep X; -> app: myapp; version: v2, but choosing the sleep is so arbitrary :/

it seems that with NEG there's always few seconds of 502s when changing the selector, it varies between 1 ~ 5s to sometimes more towards 20 ~ 30s.

without NEG (ofc in that case using a NodePort service) there's a pretty large window of time in which requests continue to flow to the old pods.

the only thing that actually works properly is a LoadBalancer service without an ingress, in which case it switches perfectly.
but unfortunately that's not enough because we need an ingress. (as described here; https://cloud.google.com/solutions/implementing-deployment-and-testing-strategies-on-gke#perform_a_bluegreen_deployment)

@rubensayshi
Copy link

rubensayshi commented Jun 22, 2020

it seems like the issue is really that when the new endpoints are added to the NEG, their health checks are UNKNOWN for a few seconds (regardless if the pods already exist / are read or not) and it will return 502s for those ~5 seconds?

changing the health checks interval doesn't matter if it's 1s or 30s, it's always ~5s of 502s...

@freehan
Copy link
Contributor

freehan commented Jun 22, 2020

@rubensayshi Yes. You pointed out the key gap. The 502s are caused by the race between 2 parallel operations: 1. old endpoints getting removed 2. new endpoints getting added. If 2 is slower than 1, then you ended up with 502s during the transitions.

I would suggest replacing sleep Xhere with something slightly more sophisticated. For instance, validating if the v2 app is taking traffic and performing as expected. Then the rollout system can decide whether roll back or proceed roll out. Hence the workflow looks like:

app: myapp; version: v1 -> app: myapp -> monitor/validation -> app: myapp; version: v2 -> monitor/validation

FYI, we are building out something that would allow a more seamless transition for blue/green deployments. That is going to be exposed in a higher level API hence one would not need to tweak the service selector.

@rubensayshi
Copy link

rubensayshi commented Jun 23, 2020

hey @freehan if there's something better on the horizon then I think we'll settle for the almost-perfect solution with the sleep, and ye the sleep was just a rough example, I already put the version in the response header and just ping the endpoint until I get 100% of the new version :D.

is that "something better" to be expected in the near future? end of Q2 or begin of Q3? and how can I make sure I won't miss the release of that, will it be listed in the GKE update notes? https://cloud.google.com/kubernetes-engine/docs/release-notes

and thanks for replying, specially to such an old issue <3

@haizaar
Copy link

haizaar commented Feb 18, 2021

Google acknowledge they have an issue and there is a public bug we can track: https://issuetracker.google.com/issues/180490128

(i.e. people still stumble upon this one)

@rubensayshi
Copy link

nice, thanks for sharing @haizaar , I'll subscribe to that issue.

@haizaar
Copy link

haizaar commented Mar 3, 2021

Had a chance to talk to GKE PM regarding this problem. While they won't fix the current issue (which is caused by NEG switching), they do have plans to alleviate it by supporting K8s Gateway API, which is really great since we would be able to do at least basic traffic management and routing without bringing heavy-weight likes of Istio.

If you are on GKE, reach out to your GCP account manager to join a private preview of this feature once it's available. Or use other K8s Gateway implementations that already implement it.

/CC @keidrych

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

8 participants