Document how to avoid 502s #34

bowei · 2017-10-11T17:49:16Z

From @esseti on September 20, 2017 9:22

Hello,
i've a problem with the ingress and the fact that the 502 page pops up when there are "several" request. I've a JMeter spinning 10 threads for 20 times, and I get more than 50 times the 502 over 2000 calls in total (less than 0,5%).

reading the readme it says
it says that this error is probably due to

The loadbalancer is probably bootstrapping itself.

but the loadbalancer is already there, so does it means that all the pods serving that url are busy? is there a way to avoid the 502 waiting for a pod to be free?

if not, is there a way to personalize the 502 page? because I expose APIs in JSON format, and I would like to show a JSON error rather than a html page.

Copied from original issue: kubernetes/ingress-nginx#1396

bowei · 2017-10-11T17:49:17Z

From @nicksardo on September 26, 2017 16:12

https://serverfault.com/questions/849230/is-there-a-way-to-use-customized-502-page-for-load-balancer
This is a question for GCP, not the ingress controller. Though, I suggest you investigate why you're getting 502s.

bowei · 2017-10-11T17:49:17Z

From @esseti on September 29, 2017 9:53

Regarding the several 502, i found out that it's due to how long the LB keeps the connection alive vs what the container provides as keepalive timout. it's explained here (point 3) https://blog.percy.io/tuning-nginx-behind-google-cloud-platform-http-s-load-balancer-305982ddb340

In my case I also added a timout of 5s to the probe, not sure but that solved the 502 (i'm using uwsgi)

montanaflynn · 2017-10-27T00:23:40Z

I also get google's 502 html error page and would like to understand why and how to avoid it or customize the response. The backend pods have been running without restarting but still maybe 1/1000 requests return a 502. Using GKE with ingress that sends to API pod running nginx.

esseti · 2017-10-27T12:51:06Z

@montanaflynn have you tried this (point 3) https://blog.percy.io/tuning-nginx-behind-google-cloud-platform-http-s-load-balancer-305982ddb340 ? (i also had to increase the time of probeness, there are my comments on that page).
I've solved the problem with the 502 page, still not able to modified it but that it's not part of kubernetes.

mikesparr · 2017-10-28T03:05:27Z

Using GKE 1.7.8 (google cloud)

I'm getting these too, and backend services have 2/2 for cluster health and green. Using Ingress with kube-lego and gce for TLS provisioning. One app served by ingress (name-based virtual hosts from single ingress) works fine but other app every other request returning 502 and preventing QA. There were zero issues with either app during initial QA when behind LoadBalancer service. I changed to NodePort services and added Ingress with TLS in front of them and plagued with 502 errors.

Updated liveness and readiness probes, and confirmed 200 responses both at probe URI and / but still getting these errors.

Redacted Ingress Config

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: staging-ingress
  annotations:
    kubernetes.io/ingress.global-static-ip-name: "kubernetes-ingress-stg"
    kubernetes.io/tls-acme: "true"
    kubernetes.io/ingress.class: "gce"
spec:
  tls:
  - hosts:
    - eval.redacted-site1.com
    - eval.redacted-site2.com
    secretName: legacy-tls
  rules:
  - host: eval.redacted-site1.com
    http:
      paths:
      - path: /*
        backend:
          serviceName: site1-app
          servicePort: 80
  - host: eval.redacted-site2.com
    http:
      paths:
      - path: /*
        backend:
          serviceName: site2-app
          servicePort: 80

montanaflynn · 2017-10-28T03:39:34Z

@esseti I tried increasing the timeout as suggested but still get 502s

Also like @mike-saparov we're using TLS with the ingress (not acme) and NodePort services.

mikesparr · 2017-10-28T06:22:23Z

I increased keepalive too per recommendation and didn't fix.

fejta-bot · 2018-01-26T06:40:15Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

lunemec · 2018-02-23T09:57:54Z

We're also having these issues. Using NodePort on our services, TLS ingress with kube-lego.

We noticed 502 right after this message showed up in kubectl get events:

2m          17d          2637     ingressID                                  Ingress                                                     Normal    Service                 loadbalancer-controller                            default backend set to serviceID

Any ideas how to figure out what is causing these 502's?

lunemec · 2018-02-23T09:58:11Z

/remove-lifecycle stale

gylu · 2018-03-30T23:47:26Z

I'm seeing these too. Using 1.8.6 with Kubefed trying to set up a federated Ingress (which I think I got set up), but I now keep getting 502s, and there's nothing to log/debug except stackdriver, which shows 502s

My backends occasionally say "unhealthy" though for some reason...

Even though I've fulfilled this comment:
Services exposed through an Ingress must serve a response with HTTP 200 status to the GET requests on / path. This is used for health checking. If your application does not serve HTTP 200 on /, the backend will be marked unhealthy and will not get traffic.
https://cloud.google.com/kubernetes-engine/docs/tutorials/http-balancer

mikesparr · 2018-03-31T05:02:00Z

I increased the VM size and errors subsided. It appears the containers would crash due to memory limits, then as new ones spun up the health check failed and Ingress served up 502. No smoking gun but that solved for me. You may be underutilized.

…

Sent from my iPhone

On Mar 30, 2018, at 5:47 PM, George ***@***.***> wrote: I'm seeing these too. Using 1.8.6 with Kubefed trying to set up a federated Ingress (which I think I got set up), but I now keep getting 502s, and there's nothing to log/debug except stackdriver, which shows 502s My backends occasionally say "unhealthy" though for some reason... Even though I've fulfilled this comment: Services exposed through an Ingress must serve a response with HTTP 200 status to the GET requests on / path. This is used for health checking. If your application does not serve HTTP 200 on /, the backend will be marked unhealthy and will not get traffic. https://cloud.google.com/kubernetes-engine/docs/tutorials/http-balancer — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

petercgrant · 2018-06-20T17:27:22Z

The simple way to avoid 502s: setup a cluster that hosts only your app and does not use preemptible nodes or cluster node-pool autoscaling. Schedule app downtime for node upgrades.

If you want to avoid 502s and want cluster autoscaling, preemptibe nodes or zero downtime, you probably need to switch to the nginx ingress controller. The L7 load balancer lives in the cluster and is able to respond faster and more proactively to events occurring in the cluster. The built-in retry logic also helps.

The GCE ingress controller creates an L7 load balancer that communicates to a kubernetes NodePort service. If you use the default settings for your service, externalTrafficPolicy will be set to Cluster, meaning every node will forward requests to nodes that host the pods backing the service (which I'll just call your app). If you leave externalTrafficPolicy=Cluster, any node can cause 502s and timeouts, even if it is not running your app. Examples: you take an unrelated node down for an upgrade, even if you cordon/drain it properly; a random node in your cluster crashes (as @mikesparr noted, this could be a node out of memory); you use preemptible nodes; you have some nodes that are over-provisioned (traffic forwarding can be CPU starved?). One final note about leaving externalTrafficPolicy=Cluster: the backend for your app will show as available with N instances, where N is the number of nodes in your cluster, and it will show as available in every zone where you have nodes. This is misleading because requests have to be served by nodes hosting the pods, which is an arbitrarily small subset of the cluster nodes. Maybe this situation will improve with network endpoint group support in the GCE ingress controller?

Setting externalTrafficPolicy=Local on the NodePort service will prevent nodes from forwarding traffic to other nodes. Health checks will fail for any node not hosting your app, and your load balancer backend will show the proper number of nodes serving your app and the zone they're in. This removes the 502 caused by unrelated nodes, but you will still get them if a node hosting your app is unhealthy. We've just introduced a new failure mode: if a pod dies on any node, your service is down on that node. You can fix this by ensuring at least two pods are running on each node where your service lives. You may also want to set strategy.rollingUpdate.maxUnavailable=0 on your deployment so it will create new pods before deleting the old ones. The GCE ingress controller on GKE adds a minute to whatever health check interval I set, which is too slow to find a dead node.

fejta-bot · 2018-09-18T17:29:47Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

metral · 2018-09-18T17:34:31Z

/remove-lifecycle stale

wminshew · 2018-10-11T03:43:19Z

This thread was helpful for [crossing my fingers for now] eliminating my 502s. Just in case they come back though, is it possible to customize the response body? Have tried digging quite a bit without luck so guessing no, but asking here to be extra sure [noticed the title of this issue originally reference personalizing 502s as well]

bowei · 2018-11-06T19:21:55Z

/lifecycle frozen

acasademont · 2018-12-03T20:20:28Z

I believe some of this issues will be solved by the new Network Endpoints Group load balancing (https://cloud.google.com/kubernetes-engine/docs/how-to/container-native-load-balancing) due to the removed network hop in the kube-proxy

Arconapalus · 2018-12-12T22:43:30Z

would you be able to deploy container-native-load-balancing alongside nginx-ingress-controller?

stefanotto · 2019-02-12T19:15:55Z

Hi, where exactly can I set the two NGINX settings described?

keepalive_timeout 650;
keepalive_requests 10000;

I have an Ingress based on nginx-ingress-controller. How exactly can I pass these to the NGINX used in the image?

mikesparr · 2019-02-12T20:04:49Z

I believe you must create a ConfigMap and that is what overrides Nginx. I'm currently using GCE ingress but if memory serves, that is what you need to add.

…

On Tue, Feb 12, 2019 at 12:15 PM Stefan ***@***.***> wrote: Hi, where exactly can I set the two NGINX settings described? keepalive_timeout 650; keepalive_requests 10000; I have an Ingress based on nginx-ingress-controller. How exactly can I pass these to the NGINX used in the image? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#34 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFS70ZTTna19sR5f-1A0NsNV--ADqpWzks5vMxLugaJpZM4P13yZ> .

mikesparr · 2019-02-12T20:06:42Z

Here's an example that might get you started, Stefan: https://github.com/nginxinc/kubernetes-ingress/blob/master/docs/configmap-and-annotations.md

…

On Tue, Feb 12, 2019 at 1:04 PM Mike Sparr ***@***.***> wrote: I believe you must create a ConfigMap and that is what overrides Nginx. I'm currently using GCE ingress but if memory serves, that is what you need to add. On Tue, Feb 12, 2019 at 12:15 PM Stefan ***@***.***> wrote: > Hi, where exactly can I set the two NGINX settings described? > > keepalive_timeout 650; > keepalive_requests 10000; > > I have an Ingress based on nginx-ingress-controller. How exactly can I > pass these to the NGINX used in the image? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#34 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AFS70ZTTna19sR5f-1A0NsNV--ADqpWzks5vMxLugaJpZM4P13yZ> > . >

stefanotto · 2019-02-12T21:05:20Z

Perfect. Thank you so much @mikesparr

keperry · 2019-03-13T21:19:36Z

Is the solution here to move to nginx ingress controller? This seems like a good workaround. Is there any downsides to doing this? Will it still work with a nodeport service?
Seems a little crazy to me that this isn't fixed in the gce controller.

mikesparr · 2019-03-13T22:07:30Z

Normally the 502s are from a failed health check and I've found playing around with the initialDelaySeconds on readiness probe, etc. to provide ample time for Docker build / deploy reduced a lot. Furthermore, I Dockerized some legacy stuff from merger in PHP using sessions so the health check in correlation to out of memory in pod forcing destroy / rebuild are main causes. The health check intervals are probably first place to tweak but it is trial/error depending on your app. I am directing all our projects to leverage Docker's multi-stage build v17.1 and later and in Node apps we saw image size reduce from 225MB to 71MB, further speeding up deployments and minimizing health check timeout risk. Golang images are under 10MB in some cases so they are awesome. ;-) Hope that helps.

…

On Wed, Mar 13, 2019 at 3:19 PM keperry ***@***.***> wrote: Is the solution here to move to nginx ingress controller? This seems like a good workaround. Is there any downsides to doing this? Will it still work with a nodeport service? Seems a little crazy to me that this isn't fixed in the gce controller. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#34 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFS70R2vUxF-wTUu7hffLjm2yIsFwD0Fks5vWWtqgaJpZM4P13yZ> .

stefanotto · 2019-04-11T18:17:05Z

Hi, I'm still experiencing this issue, although I added the upstream-keepalive-requests and upstream-keepalive-timeout settings in the config map. I actually did not setup the k8s cluster myself. A config map by the name nginx-ingress-controller was already present so I just assumed it was used by the nginx-ingress. But now I'm not so sure it is. How can make sure the map is in use and the settings are actually used by Nginx? Sadly, I find the documentation far from clear on how to connect the config map with the nginx-ingress load balancer.

axot · 2019-04-12T01:42:50Z

There are something you could try.

use NEG, Network Endpoint Groups
graceful shutdown use preStop
turn http keep-alive off

freehan · 2019-07-26T16:56:35Z

Ref: #769

VGerris · 2020-05-14T09:12:40Z

I sent a feedback about that containers should return 200 at / here:
https://cloud.google.com/kubernetes-engine/docs/how-to/load-balance-ingress
That would improve understanding at that point I think.
Perhaps more people can do that so it will be added there.

mikesparr · 2020-05-14T14:08:36Z

Changing externalTrafficPolicy from Local to Cluster has fixed seemingly random 502 errors with low or single-replica deployments for one project.

…

Sent from my iPhone

On May 14, 2020, at 3:12 AM, OpenMinded ***@***.***> wrote: I sent a feedback about that containers should return 200 at / here: https://cloud.google.com/kubernetes-engine/docs/how-to/load-balance-ingress That would improve understanding at that point I think. Perhaps more people can do that so it will be added there. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

imsrgadich · 2020-07-11T00:17:29Z

Changing externalTrafficPolicy from Local to Cluster has fixed seemingly random 502 errors with low or single-replica deployments for one project.

@mikesparr comment above helped me solve my issue. It was actually throwing a CORS error. My app was 3 replica deployment. This thread was particularly useful in solving the issue. Thank you.

therzv · 2020-11-16T05:40:23Z

but if we change
externalTrafficPolicy: Local to externalTrafficPolicy: Cluster

we cannot preserve client IP address, real client IP

bowei assigned nicksardo Oct 11, 2017

bowei added the backend/gce label Oct 11, 2017

bowei mentioned this issue Oct 11, 2017

[GCE] 502 page - how to avoid it and how to personalize it kubernetes/ingress-nginx#1396

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 26, 2018

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 23, 2018

nicksardo changed the title ~~[GCE] 502 page - how to avoid it and how to personalize it~~ [GCE] Document how to avoid 502s May 4, 2018

nicksardo changed the title ~~[GCE] Document how to avoid 502s~~ Document how to avoid 502s May 4, 2018

nicksardo added the kind/documentation Categorizes issue or PR as related to documentation. label May 4, 2018

nicksardo mentioned this issue May 4, 2018

502 Server Error #164

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 18, 2018

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 18, 2018

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Nov 6, 2018

rramkumar1 removed the backend/gce label Feb 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document how to avoid 502s #34

Document how to avoid 502s #34

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

montanaflynn commented Oct 27, 2017

esseti commented Oct 27, 2017

mikesparr commented Oct 28, 2017 •

edited

Loading

montanaflynn commented Oct 28, 2017

mikesparr commented Oct 28, 2017

fejta-bot commented Jan 26, 2018

lunemec commented Feb 23, 2018

lunemec commented Feb 23, 2018

gylu commented Mar 30, 2018

mikesparr commented Mar 31, 2018 via email

petercgrant commented Jun 20, 2018

fejta-bot commented Sep 18, 2018

metral commented Sep 18, 2018

wminshew commented Oct 11, 2018

bowei commented Nov 6, 2018

acasademont commented Dec 3, 2018

Arconapalus commented Dec 12, 2018

stefanotto commented Feb 12, 2019

mikesparr commented Feb 12, 2019 via email

mikesparr commented Feb 12, 2019 via email

stefanotto commented Feb 12, 2019

keperry commented Mar 13, 2019

mikesparr commented Mar 13, 2019 via email

stefanotto commented Apr 11, 2019

axot commented Apr 12, 2019

freehan commented Jul 26, 2019

VGerris commented May 14, 2020

mikesparr commented May 14, 2020 via email

imsrgadich commented Jul 11, 2020 •

edited

Loading

therzv commented Nov 16, 2020

Document how to avoid 502s #34

Document how to avoid 502s #34

Comments

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

montanaflynn commented Oct 27, 2017

esseti commented Oct 27, 2017

mikesparr commented Oct 28, 2017 • edited Loading

Redacted Ingress Config

montanaflynn commented Oct 28, 2017

mikesparr commented Oct 28, 2017

fejta-bot commented Jan 26, 2018

lunemec commented Feb 23, 2018

lunemec commented Feb 23, 2018

gylu commented Mar 30, 2018

mikesparr commented Mar 31, 2018 via email

petercgrant commented Jun 20, 2018

fejta-bot commented Sep 18, 2018

metral commented Sep 18, 2018

wminshew commented Oct 11, 2018

bowei commented Nov 6, 2018

acasademont commented Dec 3, 2018

Arconapalus commented Dec 12, 2018

stefanotto commented Feb 12, 2019

mikesparr commented Feb 12, 2019 via email

mikesparr commented Feb 12, 2019 via email

stefanotto commented Feb 12, 2019

keperry commented Mar 13, 2019

mikesparr commented Mar 13, 2019 via email

stefanotto commented Apr 11, 2019

axot commented Apr 12, 2019

freehan commented Jul 26, 2019

VGerris commented May 14, 2020

mikesparr commented May 14, 2020 via email

imsrgadich commented Jul 11, 2020 • edited Loading

therzv commented Nov 16, 2020

mikesparr commented Oct 28, 2017 •

edited

Loading

imsrgadich commented Jul 11, 2020 •

edited

Loading