Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad gateway failures of gRPC #156

Open
leokondrashov opened this issue Feb 24, 2023 · 6 comments
Open

Bad gateway failures of gRPC #156

leokondrashov opened this issue Feb 24, 2023 · 6 comments
Labels
bug Something isn't working

Comments

@leokondrashov
Copy link
Contributor

Version & Branch

main branch, default configuration in scripts/setup/setup.cfg

Expected Behaviour

No failures in calls

Actual Behaviour

Sometimes, ~10% of invocations fails with error gRPC timeout exceeded for function trace-func-0-18129329275763004641 - rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 502 (Bad Gateway); transport: received unexpected content-type "text/plain; charset=utf-8"

Steps to Reproduce the Problem

  1. xl170, 2 nodes (master + worker)
  2. Run attached trace in loader (synthetic trace with one function, 4s runtime, invoked every ~4 seconds)
    durations.csv
    invocations.csv
    memory.csv

Additional Info

Error observed on number of clusters and traces.
Seems that this issue is connected to #129. Before failing with "Invalid argument provided..." there are messages telling that there are many failures, that happens to be due to this error.
Sometimes, error changes to 504 (Gateway Timeout) or unexpected EOF.
Most of the failures happen in the first minutes, but it continues to fail throughout the experiment. Logs from the loader is attached as well.
loader_trace.txt

@leokondrashov leokondrashov added the bug Something isn't working label Feb 27, 2023
@leokondrashov
Copy link
Contributor Author

leokondrashov commented Mar 1, 2023

@cvetkovic, can you please take a look and tell if you have some input on these error messages?

@leokondrashov
Copy link
Contributor Author

These errors were caused by knative terminating pods that were handling requests. This was in the queueproxy container logs of one of the pods:

{"severity":"INFO","timestamp":"2023-03-02T16:33:32.872021599Z","message":"Attached drain handler from user-container"}
{"severity":"INFO","timestamp":"2023-03-02T16:33:32.878217861Z","message":"Received TERM signal, attempting to gracefully shutdown servers."}
{"severity":"INFO","timestamp":"2023-03-02T16:33:32.878243962Z","message":"Sleeping 30s to allow K8s propagation of non-ready state"}
{"severity":"INFO","timestamp":"2023-03-02T16:34:02.872890824Z","message":"Shutting down server: admin"}
{"severity":"INFO","timestamp":"2023-03-02T16:34:02.874405272Z","message":"Shutting down server: metrics"}
{"severity":"INFO","timestamp":"2023-03-02T16:34:02.875615551Z","message":"Shutdown complete, exiting..."}

Four other pods were terminated at the same time (scaling down from initial burst of pod deployments). And five gRPC failures were registered at 16:34:02.910 UTC.

Such errors might surface when we have bursty invocation pattern: scaling down after burst ended might terminate pod handling the request.

The question is, how autoscaler should detect working pods and not terminate them before they finish? @anshalshukla, do you have something on that matter from theoretical standpoint?

@cvetkovic
Copy link
Contributor

What do you use for sandboxing? containerd or Firecracker?

@leokondrashov
Copy link
Contributor Author

Containerd. Configuration is running ./scripts/setup/create_multinode.sh <nodes> with default config in scripts/setup/setup.cfg

@cvetkovic
Copy link
Contributor

I have personally never had any issues like what you posted with containerd, and I set up a cluster last week that still works without any issues. With Firecracker, I did have a lot of gRPC timeout as the uVMs got killed for some reason I never managed to explain. Probably a buggy implementation of the kubelet <-> Firecracker-containerd interface.

Another thing that I think about maybe your cluster is too small to make everything fit. I see that the PR that updates Knative to v1.9 plays with tainting and that may have implications.

As per your question about how running pods are detected, I think the activator takes care of that, i.e. it has the list of all sandbox endpoints that are provided by the Kubernetes endpoint controller. The autoscaler just issues a decision to upscale or downscale to some number of instances.

@leokondrashov
Copy link
Contributor Author

I don't think that this has something to do with the size of the cluster. For attached trace, at most it used 25 pods with 83m CPU limit, so ~2 cores total.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants