Bad gateway failures of gRPC #156

leokondrashov · 2023-02-24T14:49:08Z

Version & Branch

main branch, default configuration in scripts/setup/setup.cfg

Expected Behaviour

No failures in calls

Actual Behaviour

Sometimes, ~10% of invocations fails with error gRPC timeout exceeded for function trace-func-0-18129329275763004641 - rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 502 (Bad Gateway); transport: received unexpected content-type "text/plain; charset=utf-8"

Steps to Reproduce the Problem

xl170, 2 nodes (master + worker)
Run attached trace in loader (synthetic trace with one function, 4s runtime, invoked every ~4 seconds)
durations.csv
invocations.csv
memory.csv

Additional Info

Error observed on number of clusters and traces.
Seems that this issue is connected to #129. Before failing with "Invalid argument provided..." there are messages telling that there are many failures, that happens to be due to this error.
Sometimes, error changes to 504 (Gateway Timeout) or unexpected EOF.
Most of the failures happen in the first minutes, but it continues to fail throughout the experiment. Logs from the loader is attached as well.
loader_trace.txt

The text was updated successfully, but these errors were encountered:

leokondrashov · 2023-03-01T10:46:55Z

@cvetkovic, can you please take a look and tell if you have some input on these error messages?

leokondrashov · 2023-03-02T18:32:36Z

These errors were caused by knative terminating pods that were handling requests. This was in the queueproxy container logs of one of the pods:

{"severity":"INFO","timestamp":"2023-03-02T16:33:32.872021599Z","message":"Attached drain handler from user-container"}
{"severity":"INFO","timestamp":"2023-03-02T16:33:32.878217861Z","message":"Received TERM signal, attempting to gracefully shutdown servers."}
{"severity":"INFO","timestamp":"2023-03-02T16:33:32.878243962Z","message":"Sleeping 30s to allow K8s propagation of non-ready state"}
{"severity":"INFO","timestamp":"2023-03-02T16:34:02.872890824Z","message":"Shutting down server: admin"}
{"severity":"INFO","timestamp":"2023-03-02T16:34:02.874405272Z","message":"Shutting down server: metrics"}
{"severity":"INFO","timestamp":"2023-03-02T16:34:02.875615551Z","message":"Shutdown complete, exiting..."}

Four other pods were terminated at the same time (scaling down from initial burst of pod deployments). And five gRPC failures were registered at 16:34:02.910 UTC.

Such errors might surface when we have bursty invocation pattern: scaling down after burst ended might terminate pod handling the request.

The question is, how autoscaler should detect working pods and not terminate them before they finish? @anshalshukla, do you have something on that matter from theoretical standpoint?

cvetkovic · 2023-03-03T12:35:59Z

What do you use for sandboxing? containerd or Firecracker?

leokondrashov · 2023-03-03T12:44:12Z

Containerd. Configuration is running ./scripts/setup/create_multinode.sh <nodes> with default config in scripts/setup/setup.cfg

cvetkovic · 2023-03-03T13:03:13Z

I have personally never had any issues like what you posted with containerd, and I set up a cluster last week that still works without any issues. With Firecracker, I did have a lot of gRPC timeout as the uVMs got killed for some reason I never managed to explain. Probably a buggy implementation of the kubelet <-> Firecracker-containerd interface.

Another thing that I think about maybe your cluster is too small to make everything fit. I see that the PR that updates Knative to v1.9 plays with tainting and that may have implications.

As per your question about how running pods are detected, I think the activator takes care of that, i.e. it has the list of all sandbox endpoints that are provided by the Kubernetes endpoint controller. The autoscaler just issues a decision to upscale or downscale to some number of instances.

leokondrashov · 2023-03-03T13:13:31Z

I don't think that this has something to do with the size of the cluster. For attached trace, at most it used 25 pods with 83m CPU limit, so ~2 cores total.

leokondrashov added the bug Something isn't working label Feb 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad gateway failures of gRPC #156

Bad gateway failures of gRPC #156

leokondrashov commented Feb 24, 2023

leokondrashov commented Mar 1, 2023 •

edited

Loading

leokondrashov commented Mar 2, 2023

cvetkovic commented Mar 3, 2023

leokondrashov commented Mar 3, 2023

cvetkovic commented Mar 3, 2023

leokondrashov commented Mar 3, 2023

Bad gateway failures of gRPC #156

Bad gateway failures of gRPC #156

Comments

leokondrashov commented Feb 24, 2023

Version & Branch

Expected Behaviour

Actual Behaviour

Steps to Reproduce the Problem

Additional Info

leokondrashov commented Mar 1, 2023 • edited Loading

leokondrashov commented Mar 2, 2023

cvetkovic commented Mar 3, 2023

leokondrashov commented Mar 3, 2023

cvetkovic commented Mar 3, 2023

leokondrashov commented Mar 3, 2023

leokondrashov commented Mar 1, 2023 •

edited

Loading