Failover between backendRefs #1764

kflynn · 2023-02-28T16:00:39Z

kflynn
Feb 28, 2023

First, let's talk about invalid backendRefs. The HTTPRoute specification states:

For example, if two backends are specified with equal weights, and one is invalid, 50 percent of traffic must receive a 500. Implementations may choose how that 50 percent is determined.

and the HTTPBackendRef adds:

A BackendRef is invalid if:
...

It refers to a resource that does not exist. In this case, the Reason must be set to BackendNotFound and the Message of the Condition must explain which resource does not exist.

...

It appears to me (and that word "appears" is important!) that the intent here is to try to make a misconfigured backendRef very obvious, very quickly. That makes all kinds of sense to me: we clearly should surface errors quickly and meaningfully. What I'm worried about is that the specified behavior won't be meaningful to end users.

Consider the end user's experience when you define an HTTPRoute with a single valid backendRef: they see a working application. Great.

Now suppose you, as the app developer or as cluster ops, add a second backendRef -- maybe you're doing a canary release of a new version of the workload, maybe you just want two for redundancy, who knows. In any case, you botch your typing and your new backendRef points to a nonexistent resource... and boom, half your end users are facing a dead application.

This feels like the opposite of how an HTTPRoute should treat the end user. 🙂 I'd like to see this changed: as long as you have valid backendRefs, I think that any invalid backendRefs should be ignored for routing while continuing to be surfaced in the Route, as currently defined.

From there, we can continue to considering backends that become unresponsive, fail health checks, etc.: in these situations, the backendRef is "valid", but the user experience will be much improved by the implementation choosing not to consider these backends in routing decisions. It's not clear to me that the spec as written allows for that; thoughts? I definitely believe that not forcing the implementation to route to a backend it knows to be dead will be a better user experience.

howardjohn · 2023-02-28T16:17:59Z

howardjohn
Feb 28, 2023

Devils advocate:

If I have 2 backends:

- name: handle-heavy-load
 weight: 999999
- name: super-fragile
  weight: 1

and suddenly super-fragile gets all my traffic, that may be very bad.

I could maybe accept that this is fine if handle-heavy-load is invalid, but failing over to super-fragile when handle-heavy-load is just unhealthy seems inappropriate without explicit config IMO

0 replies

youngnick · 2023-03-07T23:43:57Z

youngnick
Mar 7, 2023
Maintainer

@kflynn, the way that you've read this is correct - this is designed to make implementations fail fast if there's a mistake. We made a conscious decision, knowing that this will not protect end users in this case, which sucks, but we considered the alternative worse.

At the end of the day, it came down to "respecting user intent". The only way we have to understand what the user is intending to do is what's in the API objects. I agree that it's likely that at some point, end users are going to make errors with adding backends. But we have no way of telling if it's an error, or if it's what the user wants.

For example, take doing blue/green rollouts. If you're adding two backends, like this:

  - name: app-v1
    weight: 99
  - name: app-v2
    weight: 1

If the app-v2 backend has a typo somewhere, then if implementations are ignoring errors, then there's no way for the clients to know without checking the status. You could conceivably miss the status, update the app-v2 weight to 99, and the app-v1 weight to 1, and never notice, because the weight 99 app-v2 will be ignored. You won't know that it's not working until you change the app-v2 weight to 100, and all of a sudden nothing works. A similar argument applies to zero-endpoints or other "valid but not routable" backend use cases. By trying to protect end users from their own mistakes, we're building the likelihood for a much worse problem down the road.

This is what I mean by "respecting user intent" - we have no way of knowing if the user intended to set things up the way they did. Maybe they want the nonexistent service there for some reason, but at the end of the day, we decided that we had to trust that the end user knows what they're doing, and give them tools to check to know if there's a mistake (i.e. the status). I think that when @kflynn and I discussed this before, he referred to this as an "SRE-centric" view, which I will cop to - I think that a lot of the design for Gateway API is centered around Gateway, which is effectively an SRE-owned resource (usually).

To me, it comes down to the fact that the use cases we are handling are complex enough that it's difficult to impossible to guess if a particular config is a mistake or not. So we have to err on the side of letting people shoot themselves in the foot. Sadly.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failover between backendRefs #1764

{{title}}

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Failover between backendRefs #1764

kflynn Feb 28, 2023

Replies: 2 comments

howardjohn Feb 28, 2023

youngnick Mar 7, 2023 Maintainer

kflynn
Feb 28, 2023

howardjohn
Feb 28, 2023

youngnick
Mar 7, 2023
Maintainer