Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Fail Safe" control knob for Extension Server #4155

Closed
logan-hcg opened this issue Sep 4, 2024 · 6 comments · Fixed by #4936
Closed

"Fail Safe" control knob for Extension Server #4155

logan-hcg opened this issue Sep 4, 2024 · 6 comments · Fixed by #4936
Assignees
Labels
kind/decision A record of a decision made by the community.
Milestone

Comments

@logan-hcg
Copy link

logan-hcg commented Sep 4, 2024

Currently, the processing of Extension Server logic (ie Translate() step) is "best effort". If the Extension Server fails in some way (ie is not available, crashes during request handling), than the envoy configuration is not impacted.

In some situations, this could be a "bad thing". For example, if the Extension Server is being used to add a default Authz filter to all Listeners, if the grpc call to the Extension Server fails, than the Listener will still be activated but will not have the Authz filter (thus incorrectly exposing the resource without the desired protection).

Ideally, similar to #3873, there would be an option added to ExtensionManager which would allow either "fail open" (current behavior of best effort) or "fail closed" (alternate behavior of disabling the resource associated with the failed hook).

@arkodg arkodg added help wanted Extra attention is needed kind/decision A record of a decision made by the community. and removed triage labels Sep 4, 2024
@arkodg arkodg added this to the v1.2.0-rc1 milestone Sep 4, 2024
@arkodg
Copy link
Contributor

arkodg commented Sep 4, 2024

we've fail closed by default for all Policy APIs, should we do the same for extension manager (if unable to connect or unable to get a response in time ) ptal @envoyproxy/gateway-maintainers . If unable to connect during startup, it would mean no config would ever be programmed in envoy proxy, and if the gRPC request times out, we could skip that xDS update, and the data plane would have the last good config

@liorokman
Copy link
Contributor

Maybe add a configuration flag in the extension manager configuration section to specify if the extension manager should fail-open or fail-close?

@arkodg
Copy link
Contributor

arkodg commented Sep 4, 2024

Maybe add a configuration flag in the extension manager configuration section to specify if the extension manager should fail-open or fail-close?

yeah thats a good home to add the mode config if user expectations vary, we still need to figure out what the default mode should be - fail open or fail close 😄

I vote for fail close by default, to ensure we fail fast and early like we do for Filters and Policies

@ardikabs
Copy link
Contributor

ardikabs commented Sep 19, 2024

if i understand correctly, since the extension has several hooks (HTTPListener, VirtualHost, Route, and Translation), it would be make sense if 500 behavior is used when Route hook failing. But what about on other hooks?

@logan-hcg
Copy link
Author

logan-hcg commented Sep 19, 2024

I'm not sure about the others, but in the case of HTTPListener, in "failsafe mode" the listener should not be added to list of active listener resources if the hook fails.

@arkodg arkodg modified the milestones: v1.2.0-rc1, Backlog Oct 18, 2024
@liorokman
Copy link
Contributor

Here's a proposal around how Extension Server failures should be handled by Envoy Gateway:

Failure to connect to the Extension Server

If an Extension Server was configured but was not reachable, then the safest action is to fail-close. An Extension Server might be used to add security critical features into the XDS. If the Extension Server is not reachable then Envoy Gateway should generate a default XDS configuration for each configured Listener that returns a 500 error code for all requests.

The generated configuration should override the XDS generated by Envoy Gateway - basically creating XDS that listens on all the ports that are configured as Listeners and override all of the configuration in them.

Envoy Gateway contacts the Extension Server multiple times during the translation process from IR to XDS. The Extension Server might become unreachable at any time during the translation process. If an Extension Server is defined and becomes unreachable at any point during the translation process, all the already translated XDS should be discarded and the translation process should be stopped and restarted.

Error returned from one of the hooks in the extension server

The current Extension Server interface has four different hooks. The following table describes the semantics for each of the hooks:

  • PostRouteModifyHook - This hook allows the Extension Server to modify XDS generated for a specific route definition.
  • PostVirtualHostModifyHook - allows the Extension Server to modify XDS generated for a specific virtual host definition.
  • PostHTTPListenerModifyHook - allows the Extension Server to modify XDS generated for a specific listener definition.
  • PostTranslateModifyHook - allows the Extension Server to make any modifications it requires for the entire XDS generated by Envoy Gateway.

The following sequence diagram shows the sequence of calls to the Extension Server during the translation process:

sequenceDiagram
  participant EG as Envoy Gateway
  participant ES as Extension Server


  EG->>EG: Generate IR from CRD resources
  loop For each Listener
     EG->>EG: Generate XDS for Listener
     loop For each Route
        EG->>EG: Generate XDS for Route<br>attached to Listener
        EG->>EG: Generate XDS for VirtualHost<br>attached to Route
        activate ES
          EG->>ES: Call PostRouteModifyHook
        deactivate ES
     end
     loop For each VirtualHost attached to Routes of the Listener
        activate ES
          EG->>ES: Call PostVirtualHostModifyHook
        deactivate ES
     end

        activate ES
          EG->>ES: Call PostHTTPListenerModifyHook
        deactivate ES
  end
  activate ES
    EG->>ES: Call PostTranslateModifyHook
  deactivate ES
Loading

Each hook in the Extension Server implementation could potentially handle its internal errors itself. If the Extension Server handles its errors internally then Envoy Gateway will not be aware that an error even occurred. However, if the Extension Server returns an error then it basically delegates responsibility for the error to Envoy Gateway. In this case the safest action is to fail-close. The Extension Server could return a default XDS configuration that returns a 500 error code for all requests. The following are some examples of how the Extension Server could handle its internal errors:

  • Failing a call to PostRouteModifyHook could replace the entire route definition with a direct response that returns a 500 error code for all requests.
  • Failing a call to PostVirtualHostModifyHook could replace the entire virtual host definition with a default virtual host that returns a 500 error code for all requests.
  • Failing a call to PostHTTPListenerModifyHook could replace the entire listener definition with a listener that returns a 500 error code for all requests.
  • Failing a call to PostTranslateModifyHook could replace the entire XDS configuration with a default XDS configuration that returns a 500 error code for all requests.

Requiring the Extension Server to handle its internal errors represents overhead for the Extension Server developer. In most cases, an Extension Server developer might prefer that if the Extension Server returns an error, then Envoy Gateway should fail-close automatically using the logic described above. Defaulting to "fail-close" in this way would not require any changes to the Extension Server configuration in Envoy Gateway.

Some middle ground might also be reasonable. For example, when an Extension Server is registered in Envoy Gateway, the fail-open or fail-close behavior could be configured per enabled hook. The default should be fail-close, but the current (fail-open) behavior could be adopted instead. The configuration could look like this:

extensionManager:
  hooks:
    # The type of hooks that should be invoked
    xdsTranslator:
      post:
      - type: HTTPListener
        failOpen: false # This is the default, if the failOpen attribute is not specified
  service:
    # The service that is hosting the extension server
    fqdn: 
      hostname: extension-server.envoy-gateway-system.svc.cluster.local
      port: 5005

Note that the proposed configuration breaks backwards compatibility with the current Extension Manager configuration since the set of enabled hooks changes from a []string to a structure.

@arkodg arkodg modified the milestones: Backlog, v1.3.0-rc.1 Nov 22, 2024
@liorokman liorokman self-assigned this Dec 16, 2024
@liorokman liorokman removed the help wanted Extra attention is needed label Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/decision A record of a decision made by the community.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants
@liorokman @arkodg @ardikabs @logan-hcg and others