Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document how to avoid 502s #34

Open
bowei opened this issue Oct 11, 2017 · 32 comments
Open

Document how to avoid 502s #34

bowei opened this issue Oct 11, 2017 · 32 comments
Assignees
Labels
kind/documentation Categorizes issue or PR as related to documentation. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@bowei
Copy link
Member

bowei commented Oct 11, 2017

From @esseti on September 20, 2017 9:22

Hello,
i've a problem with the ingress and the fact that the 502 page pops up when there are "several" request. I've a JMeter spinning 10 threads for 20 times, and I get more than 50 times the 502 over 2000 calls in total (less than 0,5%).

reading the readme it says
it says that this error is probably due to

The loadbalancer is probably bootstrapping itself.

but the loadbalancer is already there, so does it means that all the pods serving that url are busy? is there a way to avoid the 502 waiting for a pod to be free?

if not, is there a way to personalize the 502 page? because I expose APIs in JSON format, and I would like to show a JSON error rather than a html page.

Copied from original issue: kubernetes/ingress-nginx#1396

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @nicksardo on September 26, 2017 16:12

https://serverfault.com/questions/849230/is-there-a-way-to-use-customized-502-page-for-load-balancer
This is a question for GCP, not the ingress controller. Though, I suggest you investigate why you're getting 502s.

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @esseti on September 29, 2017 9:53

Regarding the several 502, i found out that it's due to how long the LB keeps the connection alive vs what the container provides as keepalive timout. it's explained here (point 3) https://blog.percy.io/tuning-nginx-behind-google-cloud-platform-http-s-load-balancer-305982ddb340

In my case I also added a timout of 5s to the probe, not sure but that solved the 502 (i'm using uwsgi)

@montanaflynn
Copy link

I also get google's 502 html error page and would like to understand why and how to avoid it or customize the response. The backend pods have been running without restarting but still maybe 1/1000 requests return a 502. Using GKE with ingress that sends to API pod running nginx.

@esseti
Copy link

esseti commented Oct 27, 2017

@montanaflynn have you tried this (point 3) https://blog.percy.io/tuning-nginx-behind-google-cloud-platform-http-s-load-balancer-305982ddb340 ? (i also had to increase the time of probeness, there are my comments on that page).
I've solved the problem with the 502 page, still not able to modified it but that it's not part of kubernetes.

@mikesparr
Copy link

mikesparr commented Oct 28, 2017

Using GKE 1.7.8 (google cloud)

I'm getting these too, and backend services have 2/2 for cluster health and green. Using Ingress with kube-lego and gce for TLS provisioning. One app served by ingress (name-based virtual hosts from single ingress) works fine but other app every other request returning 502 and preventing QA. There were zero issues with either app during initial QA when behind LoadBalancer service. I changed to NodePort services and added Ingress with TLS in front of them and plagued with 502 errors.

Updated liveness and readiness probes, and confirmed 200 responses both at probe URI and / but still getting these errors.

Redacted Ingress Config

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: staging-ingress
  annotations:
    kubernetes.io/ingress.global-static-ip-name: "kubernetes-ingress-stg"
    kubernetes.io/tls-acme: "true"
    kubernetes.io/ingress.class: "gce"
spec:
  tls:
  - hosts:
    - eval.redacted-site1.com
    - eval.redacted-site2.com
    secretName: legacy-tls
  rules:
  - host: eval.redacted-site1.com
    http:
      paths:
      - path: /*
        backend:
          serviceName: site1-app
          servicePort: 80
  - host: eval.redacted-site2.com
    http:
      paths:
      - path: /*
        backend:
          serviceName: site2-app
          servicePort: 80

@montanaflynn
Copy link

@esseti I tried increasing the timeout as suggested but still get 502s

Also like @mike-saparov we're using TLS with the ingress (not acme) and NodePort services.

@mikesparr
Copy link

I increased keepalive too per recommendation and didn't fix.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 26, 2018
@lunemec
Copy link

lunemec commented Feb 23, 2018

We're also having these issues. Using NodePort on our services, TLS ingress with kube-lego.

We noticed 502 right after this message showed up in kubectl get events:

2m          17d          2637     ingressID                                  Ingress                                                     Normal    Service                 loadbalancer-controller                            default backend set to serviceID

Any ideas how to figure out what is causing these 502's?

@lunemec
Copy link

lunemec commented Feb 23, 2018

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 23, 2018
@gylu
Copy link

gylu commented Mar 30, 2018

I'm seeing these too. Using 1.8.6 with Kubefed trying to set up a federated Ingress (which I think I got set up), but I now keep getting 502s, and there's nothing to log/debug except stackdriver, which shows 502s

My backends occasionally say "unhealthy" though for some reason...

Even though I've fulfilled this comment:
Services exposed through an Ingress must serve a response with HTTP 200 status to the GET requests on / path. This is used for health checking. If your application does not serve HTTP 200 on /, the backend will be marked unhealthy and will not get traffic.
https://cloud.google.com/kubernetes-engine/docs/tutorials/http-balancer

@mikesparr
Copy link

mikesparr commented Mar 31, 2018 via email

@nicksardo nicksardo changed the title [GCE] 502 page - how to avoid it and how to personalize it [GCE] Document how to avoid 502s May 4, 2018
@nicksardo nicksardo changed the title [GCE] Document how to avoid 502s Document how to avoid 502s May 4, 2018
@nicksardo nicksardo added the kind/documentation Categorizes issue or PR as related to documentation. label May 4, 2018
@petercgrant
Copy link

The simple way to avoid 502s: setup a cluster that hosts only your app and does not use preemptible nodes or cluster node-pool autoscaling. Schedule app downtime for node upgrades.

If you want to avoid 502s and want cluster autoscaling, preemptibe nodes or zero downtime, you probably need to switch to the nginx ingress controller. The L7 load balancer lives in the cluster and is able to respond faster and more proactively to events occurring in the cluster. The built-in retry logic also helps.

The GCE ingress controller creates an L7 load balancer that communicates to a kubernetes NodePort service. If you use the default settings for your service, externalTrafficPolicy will be set to Cluster, meaning every node will forward requests to nodes that host the pods backing the service (which I'll just call your app). If you leave externalTrafficPolicy=Cluster, any node can cause 502s and timeouts, even if it is not running your app. Examples: you take an unrelated node down for an upgrade, even if you cordon/drain it properly; a random node in your cluster crashes (as @mikesparr noted, this could be a node out of memory); you use preemptible nodes; you have some nodes that are over-provisioned (traffic forwarding can be CPU starved?). One final note about leaving externalTrafficPolicy=Cluster: the backend for your app will show as available with N instances, where N is the number of nodes in your cluster, and it will show as available in every zone where you have nodes. This is misleading because requests have to be served by nodes hosting the pods, which is an arbitrarily small subset of the cluster nodes. Maybe this situation will improve with network endpoint group support in the GCE ingress controller?

Setting externalTrafficPolicy=Local on the NodePort service will prevent nodes from forwarding traffic to other nodes. Health checks will fail for any node not hosting your app, and your load balancer backend will show the proper number of nodes serving your app and the zone they're in. This removes the 502 caused by unrelated nodes, but you will still get them if a node hosting your app is unhealthy. We've just introduced a new failure mode: if a pod dies on any node, your service is down on that node. You can fix this by ensuring at least two pods are running on each node where your service lives. You may also want to set strategy.rollingUpdate.maxUnavailable=0 on your deployment so it will create new pods before deleting the old ones. The GCE ingress controller on GKE adds a minute to whatever health check interval I set, which is too slow to find a dead node.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 18, 2018
@metral
Copy link

metral commented Sep 18, 2018

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 18, 2018
@wminshew
Copy link

This thread was helpful for [crossing my fingers for now] eliminating my 502s. Just in case they come back though, is it possible to customize the response body? Have tried digging quite a bit without luck so guessing no, but asking here to be extra sure [noticed the title of this issue originally reference personalizing 502s as well]

@bowei
Copy link
Member Author

bowei commented Nov 6, 2018

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Nov 6, 2018
@acasademont
Copy link

I believe some of this issues will be solved by the new Network Endpoints Group load balancing (https://cloud.google.com/kubernetes-engine/docs/how-to/container-native-load-balancing) due to the removed network hop in the kube-proxy

@Arconapalus
Copy link

would you be able to deploy container-native-load-balancing alongside nginx-ingress-controller?

@stefanotto
Copy link

Hi, where exactly can I set the two NGINX settings described?

keepalive_timeout 650;
keepalive_requests 10000;

I have an Ingress based on nginx-ingress-controller. How exactly can I pass these to the NGINX used in the image?

@mikesparr
Copy link

mikesparr commented Feb 12, 2019 via email

@mikesparr
Copy link

mikesparr commented Feb 12, 2019 via email

@stefanotto
Copy link

Perfect. Thank you so much @mikesparr

@keperry
Copy link

keperry commented Mar 13, 2019

Is the solution here to move to nginx ingress controller? This seems like a good workaround. Is there any downsides to doing this? Will it still work with a nodeport service?
Seems a little crazy to me that this isn't fixed in the gce controller.

@mikesparr
Copy link

mikesparr commented Mar 13, 2019 via email

@stefanotto
Copy link

Hi, I'm still experiencing this issue, although I added the upstream-keepalive-requests and upstream-keepalive-timeout settings in the config map. I actually did not setup the k8s cluster myself. A config map by the name nginx-ingress-controller was already present so I just assumed it was used by the nginx-ingress. But now I'm not so sure it is. How can make sure the map is in use and the settings are actually used by Nginx? Sadly, I find the documentation far from clear on how to connect the config map with the nginx-ingress load balancer.

@axot
Copy link

axot commented Apr 12, 2019

There are something you could try.

  1. use NEG, Network Endpoint Groups
  2. graceful shutdown use preStop
  3. turn http keep-alive off

@freehan
Copy link
Contributor

freehan commented Jul 26, 2019

Ref: #769

@VGerris
Copy link

VGerris commented May 14, 2020

I sent a feedback about that containers should return 200 at / here:
https://cloud.google.com/kubernetes-engine/docs/how-to/load-balance-ingress
That would improve understanding at that point I think.
Perhaps more people can do that so it will be added there.

@mikesparr
Copy link

mikesparr commented May 14, 2020 via email

@imsrgadich
Copy link

imsrgadich commented Jul 11, 2020

Changing externalTrafficPolicy from Local to Cluster has fixed seemingly random 502 errors with low or single-replica deployments for one project.

@mikesparr comment above helped me solve my issue. It was actually throwing a CORS error. My app was 3 replica deployment. This thread was particularly useful in solving the issue. Thank you.

@therzv
Copy link

therzv commented Nov 16, 2020

but if we change
externalTrafficPolicy: Local to externalTrafficPolicy: Cluster

we cannot preserve client IP address, real client IP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/documentation Categorizes issue or PR as related to documentation. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests