Overload Manager - Max Global Downstream Connections #6309

seth-epps · 2024-03-28T23:25:14Z

Part of the edge best practices for running envoy, in addition to the max-heap resource monitors, there's a global connection limit set,
https://www.envoyproxy.io/docs/envoy/latest/configuration/best_practices/edge#best-practices-edge

layered_runtime:
 layers:
 - name: static_layer_0
   static_layer:
     envoy:
       resource_limits:
         listener:
           example_listener_name:
             connection_limit: 10000
     overload:
       global_downstream_max_connections: 50000

This is particularly helpful for one of our cases where the connections came in so rapidly that the heap resource monitor based actions couldn't kick in and envoy ended up in a death loop.

Adding the limit itself is not particularly invasive, however, an extension of that our requirements was to make it possible to fail readiness checks, but keep passing liveness checks which leads to some extra listener configuration requirements. The way I've addressed this is by using the ignore_global_conn_limit on the configured stats admin listeners. By default all listeners ignore the connection limit, but we add an optional configuration to setup a listener that has the flag set to false.

The Draft PR I've opened (#6308) does a minimal configuration update, but it it makes for a very cumbersome interactions between the serve-context config, the contour config CRD, and how it's passed to the underlying internal envoy package. Before I make any other changes (and add more tests, etc.) I'd like to get some feedback on a better path forward re: configuration.

My proposal would be to shift the configuration of the listeners up to the serve command, which would be responsible for constructing the listener configurations from it's parameters, effectively moving the core branch logic out of v3.StatsListeners to main.doServe. I see an added benefit of reducing how deep the contour apis get passed into the internal components and it keeps the configuration isolated to the serve command instead of having it distributed to the inner parts. I suspect this may simplify some of the test configuration as well so we don't have tor rely on assumptions about the listeners returned (eg, taking the first listener https://github.com/projectcontour/contour/blob/main/internal/featuretests/v3/envoy.go#L597 and taking that for granted in the discovery responses)

I wrote a proof of concept change to illustrate what that might look like, which if reasonable will implement in my draft (#6308)
https://github.com/projectcontour/contour/compare/main...seth-epps:options-listener-config?expand=1

The text was updated successfully, but these errors were encountered:

davinci26 · 2024-04-01T15:21:48Z

cc @projectcontour/maintainers

sunjayBhatia · 2024-05-08T13:24:48Z

My proposal would be to shift the configuration of the listeners up to the serve command

An issue with this is Gateway API Gateway reconciliation, Listeners can be dynamically changed without a contour restart/recreate so they do need to be generated similarly to other resources

sunjayBhatia · 2024-05-08T18:32:33Z

My proposal would be to shift the configuration of the listeners up to the serve command

An issue with this is Gateway API Gateway reconciliation, Listeners can be dynamically changed without a contour restart/recreate so they do need to be generated similarly to other resources

ah i guess this is just concerning the admin/operator facing Listeners so maybe disregard this point

sunjayBhatia · 2024-05-08T20:00:16Z

To summarize some of the context from here and what we have available today:

we do have per-Listener downstream connection limits available, but these apply only to the HTTP/HTTPS listeners for proxying app traffic, not stats/health/admin Listeners:

contour/apis/projectcontour/v1alpha1/contourconfig.go

Line 436 in a485abb

MaxConnectionsPerListener *uint32 `json:"maxConnectionsPerListener,omitempty"`
a global connection limit would be enforced in tandem with per-Listener limits, i.e. you can set a per-Listener limit lower than the global limit
having a Listener ignore the global limit means their downstream connections still count against the limit but connections to that Listener are not limited
in the original issue content we have an example configuring the global connection limit via runtime key, but that method is deprecated so we need to configure it via bootstrap overload manager resource monitor

It makes sense to me to leave the stats endpoints to ignore the global limit, we still want to be able to serve readiness checks and stats even when envoy is overloaded. the admin Listener (listens on a Unix domain socket) should also be made to ignore the global limit as well.

I think in general the linked PR looks like it should work just fine, only thing is on the readiness/liveness check note from above:

an extension of that our requirements was to make it possible to fail readiness checks, but keep passing liveness checks

at the moment the base install YAMLs don't include a liveness check, only a readiness check (see:

contour/examples/contour/03-envoy.yaml

Lines 76 to 79 in a485abb

    
           readinessProbe: 
        
             httpGet: 
        
               path: /ready 
        
               port: 8002

)

are you all adding your own liveness check? if so it would be interesting to see how that is configured and how that has worked out

given we only have a readiness check at the moment, if i'm understanding correctly we could just set that existing Listener to respect the global connection limit so that instances that are overloaded are taken out of the ready Service endpoints but not shut down

the admin/stats Listeners should always ignore the global limit and seems valid to have the readiness probe optionally respect the global connection Limit (or always respect it, given this will only matter if you actuall set a limit)

github-actions · 2024-07-08T00:22:58Z

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

After 60d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

Mark this Issue as fresh by commenting
Close this Issue
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

github-actions · 2024-08-12T00:23:48Z

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

After 60d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

Mark this Issue as fresh by commenting
Close this Issue
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

seth-epps · 2024-12-06T17:15:10Z

Reviving this because I got distracted and wasn't able to keep contributing to it.

given we only have a readiness check at the moment, if i'm understanding correctly we could just set that existing Listener to respect the global connection limit so that instances that are overloaded are taken out of the ready Service endpoints but not shut down

the admin/stats Listeners should always ignore the global limit and seems valid to have the readiness probe optionally respect the global connection Limit (or always respect it, given this will only matter if you actuall set a limit)

I think the way we have it running right now is the reverse; bypass the global connection limit on the default stats listeners and then configure a separate listener (:8003) that enforces the connection limit. This is done partly because it retains the existing behavior of the current stats listeners not enforcing the limits and to also keep metrics accessible by default. This is paired with overriding the deployment readiness checks so we end up with

readinessProbe:
  # ...fields
  httpGet:
    path: /ready
    port: 8003 # enforces connection limit so pods are removed from endpoints
    scheme: HTTP
livenessProbe:
  # ...fields
  httpGet:
    path: /ready
    port: 8002 # does not enforce connection limit so pods stay alive until they can start serving traffic again
    scheme: HTTP

are you all adding your own liveness check? if so it would be interesting to see how that is configured and how that has worked out

Right now we inject the listener configuration, but the only real difference is the readiness/liveness probes above (which is configured using the helm customReadinessProve to point at the listener enforcing the connection limits listening on8003

It's been working pretty great so far. We've observed the connection limits get hit in a few instances and this has self-recovered when it otherwise would have probably overwhelmed the resources.

I'm working on cleaning up main...seth-epps:contour:globalconnections-overloadmanager and combining the changes needed for the listener configuration main...seth-epps:contour:options-listener-config

seth-epps · 2024-12-17T16:55:49Z

I've put together a proper change request now
#6794

and also a proposed refactor to how we setup the stats listeners to make the configs a bit more ergonomic (IMO). I left it out of the core PR because it's a bit out of left-field and wasn't sure if it made sense in the larger context. There's no functional chnages
seth-epps#1

seth-epps added kind/feature Categorizes issue or PR as related to a new feature. lifecycle/needs-triage Indicates that an issue needs to be triaged by a project contributor. labels Mar 28, 2024

seth-epps mentioned this issue Mar 28, 2024

Globalconnections overloadmanager #6308

Closed

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 8, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 12, 2024

seth-epps mentioned this issue Dec 6, 2024

Bootstrap flag for connection limit overload manager resource monitor #6794

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overload Manager - Max Global Downstream Connections #6309

Overload Manager - Max Global Downstream Connections #6309

seth-epps commented Mar 28, 2024 •

edited

Loading

davinci26 commented Apr 1, 2024

sunjayBhatia commented May 8, 2024

sunjayBhatia commented May 8, 2024

sunjayBhatia commented May 8, 2024 •

edited

Loading

github-actions bot commented Jul 8, 2024

github-actions bot commented Aug 12, 2024

seth-epps commented Dec 6, 2024 •

edited

Loading

seth-epps commented Dec 17, 2024

Overload Manager - Max Global Downstream Connections #6309

Overload Manager - Max Global Downstream Connections #6309

Comments

seth-epps commented Mar 28, 2024 • edited Loading

davinci26 commented Apr 1, 2024

sunjayBhatia commented May 8, 2024

sunjayBhatia commented May 8, 2024

sunjayBhatia commented May 8, 2024 • edited Loading

github-actions bot commented Jul 8, 2024

github-actions bot commented Aug 12, 2024

seth-epps commented Dec 6, 2024 • edited Loading

seth-epps commented Dec 17, 2024

seth-epps commented Mar 28, 2024 •

edited

Loading

sunjayBhatia commented May 8, 2024 •

edited

Loading

seth-epps commented Dec 6, 2024 •

edited

Loading