5xx Series Status Code Usage (i.e. 500 vs 504) #9

travisgosselin · 2022-05-05T01:08:24Z

travisgosselin
May 5, 2022
Maintainer

In the API Standards today, status code guidelines indicate that REST APIs should only ever surface the status codes provided in the list given. This is helpful for us to align on specific status codes and avoid deviation for funky usages like 207 (multi-status) that probably should be left untouched by our implementations, or help with consistency on various 4xx series codes to use. In the 5xx series ONLY the 500 status code is indicated as something that a REST API should return. However, thats not to say you won't encounter other 5xx status codes codes as a consumer of course. Various other 5xx codes can be returned at any layer or interception point or gateway/proxy inbetween. Similarily it may be reasonable to think that non-RESTful aspects of your API implementation or middleware may have to return more specific 5xx series status codes for non-RESTful purposes beyond the scoping of the REST API Standards. Taking a look at the 5xx series of status codes though, there are limited codes of interest for the REST application level:

501 - Not Implemented - Not useful as we don't expect to ever indicate paths/methods that exist but are not yet finished.
502 - Bad Gateway - Focused on gateway timeouts, and useful for understanding when infrastructure has not allowed a request to reach the actual API implementation. Not something you'd expect a RESTful API implementation itself to expose or indicate in a spec.
503 - Service Unavailable - Arises when the server cannot process the request and is indicative that the request cannot even be routed to any API implementation.
504 - Gateway Timeout - A response when the server is acting as a gateway and cannot get a response before timeout.
505,506,507,508,510,511 - all not worth discussing or WebDAV related (https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#server_error_responses).

https://docs.platform.spscommerce.com/api-design/standards/request-response/#supported-status-codes

My initial interpretation is that status codes 501-504 all apply to situations where you would not indicate these codes on your Open API specification for example, as its not a product-related response (i.e. you would indicate a response such as a 409 conflict on certain endpoints that as a product would mean something specific about the conflict in the request). The focus in the REST API Standards is less on implementation and more on contract (what does the consumer care about).

GETTING TO THE POINT:
The concern is API implementations will throw only 500 errors when needing to provide a server error response leaving out some differentiation for possible usage of 504. The thought behind the 504 is that when your API implementation is calling another service it is considered a "Gateway". In the past at SPS, the 504 Gateway Timeout has been used and very helpful return status code for an operator to know right away that a downstream dependency is timing out, rather than a problem in the direct service request. This additionally manifests itself in logs and sentry errors as a very quick operational understanding of where the problem is (or rather where it is not). However, differentating a 500 from a 504 from an external consumer perspective (either SPS external or external to the team and operator) does not seem helpful. Both 500 and 504 are typically retriable status codes, which is all a consumer can do anyhow.

Should the REST API Standards call out the additional usage of the 504 status code (https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/504)? Should it call out the usage of other 5xx status codes?

Answered by mkokotovich

May 5, 2022

Personally, I've been trained to not rely on different 5XX status codes for much and use the response content instead. So just always using 500 with a good response seems like the easiest choice.

View full answer

travisgosselin · 2022-05-05T01:08:45Z

travisgosselin
May 5, 2022
Maintainer Author

As a point of reference and alternative perspective, CloudFoundry provide a distinct look at how they use 5xx server errors in their REST API guidelines to indicate different types of dependency failures (external or internal), and it explicitly doesn't use the 504 status code at all. Its interpretations of the 502 and 503 status codes are quite different than my interpretations and usage in the past. A good example why it is necessary to align on these across the organization, as I imagine our internal perspectives can be very different as well.

502 Bad Gateway - This status MUST be returned when an external service failure causes a request to fail.
503 Service Unavailable - This status MUST be returned when an internal service failure causes a request to fail.

https://github.com/cloudfoundry/cc-api-v3-style-guide#server-errors

Thoughts on the way 502 and 503 are used here?

2 replies

mkokotovich May 5, 2022

That doesn't seem like a very standard usage of those two codes. And while I see the advantage of knowing internal vs external, if a call to identity fails because a problem in EKS due to an AWS incident, is that really an internal failure? Seems like a "nice in theory" kind of idea.

travisgosselin May 5, 2022
Maintainer Author

It does seem like there might be times that becomes hard to determine what exactly "internal" or "external" is. I wonder if this approach is used operationally to drive anything they have internally? Would be odd, as getting a 503 is not a guarantee the 503 is used in that context and really not just a true server unavailable...

travisgosselin · 2022-05-05T01:09:44Z

travisgosselin
May 5, 2022
Maintainer Author

@jerelthompson I know this is a pattern your teams have used in the past with great success. Can you provide additional insight and perspective on what you think is the right approach here?

0 replies

mkokotovich · 2022-05-05T03:25:10Z

mkokotovich
May 5, 2022

Personally, I've been trained to not rely on different 5XX status codes for much and use the response content instead. So just always using 500 with a good response seems like the easiest choice.

7 replies

ca0abinary Aug 25, 2022
Collaborator

Since using AWS CloudFront / Load Balancers I've become much much more sensitive to 502's. I wonder if looking at this through the lens of SRE makes sense. Here's a direct link to the AWS docs which I've found invaluable when hunting down problems in their cloud. load-balancer-troubleshooting -> load-balancer-http-error-codes

edit
For more clarity, I feel like there's a slightly unclear line when it comes to SaaS/PaaS gateways and APIs acting as gateways. In general my recommendation for 500s would not be to have a body (security concerns, 500s would be especially vulnerable to being leaky) and to pass along any downstream 500s. A non-gateway API probably doesn't need to worry about exact an 5xx unless we want to expose something to SRE (db connection down, etc.).

travisgosselin Aug 25, 2022
Maintainer Author

@ca0abinary very timely response as we resolved this under a similar thought-process in the discussion of the working group this morning. Much of the thinking was similar in regards to thinking how infrastructure like AWS load balancers make use of 500+ status codes, and how our REST APIs should avoid that range. Additionally, avoiding complications around how those status codes are handled by infrastructure, especially internally to a service mesh and its interpretation of them is very helpful to reduce conflict, confusion, and assumption/expectations.

To be clear, this is not an indication that any API or deployable piece of code should not return such status codes, it simply indicates that from a REST-style API design and architecture and implementation perspective we do not use them... we would never expect them to appear in an OpenAPI spec for our REST APIs as an example. I'm sure we have many non-REST-style implementations that need to make good use of these status codes.

ca0abinary Aug 25, 2022
Collaborator

@travisgosselin very interesting! So what would be the mechanism for an API to communicate a specific internal failure such as lack of database connectivity that could arise from misconfiguration?

travisgosselin Aug 25, 2022
Maintainer Author

I don't believe a misconfiguration on an API like the wrong database connection string is something you'd communicate via a status code. So in the case this happened and resulted in a 500 error you may return information about that in the details of the 500 body - but as you say that can be risky if leaking information. Anything internal like that is likely not returned details through the REST API interface. As an API Consumer, I don't really care to know the reasons for the 500 (you might be thinking it might help you to know whether to retry, but its often too ambiguous to make such fine-grained decisions - you should always retry a 500). You of course may have other mechanisms for reporting that information as an API producer or operator, such as internal endpoints in a cluster (like a Kubernetes readiness probe and logging context, etc).

Example: If I call an AWS Service API and it returns to me information that the database cannot connect, that would be odd - why would I be privy to such information? And would I do anything different knowing that?

ca0abinary Aug 25, 2022
Collaborator

Great points! Maybe it's a concept better relegated to something like a health check endpoint?

nickclarity · 2022-05-12T15:15:51Z

nickclarity
May 12, 2022
Collaborator

I have a few thoughts.

502s, 504s

It seems like 502 and 504 go hand in hand. So if we are committed to using one or the other I think we should consider using both.

Context

The context in which we are using these error codes might be important. I would make a distinction between a data API that has dependencies and an API that we are actually using as a gateway or a proxy. An example of the latter might be something like the backend for frontend pattern. We might use this in scenarios where we have a client side application that we want to delegate the identity management to a backend service. I those cases, we can either proxy the API request to an actual data API or act as a gateway getting data from several data APIs and aggregating that into a view model.

However, if we are making a call directly to a data API and that data API depends on another data API to complete its action, that may not constitute a 502 or 504 in my opinion. A scenario might be the DEX API depends on NCS service to create a new Document entity. If the NCS service is unavailable DEX can't create a Document. The DEX API isn't returning NCS data nor acting as a gateway or proxy for NCS data in any way. Despite the DEX API having dependencies, it is the intended target/authority of the request for a Document entity. In this case, I think it would make sense to return a 500.

Implementation Details

This I think also answers the question related to "exposing implementation details" in that its find to return a 504 from a BFF because we know that it isn't the intended target.

503s

For a 503, I feel like if you API is in a position to respond with a 503, that is, if it's a choice you are making in code at runtime, then it likely isn't fitting the definition of a 503. Returning a 503 might increase confusion rather than help to relieve it.

5 replies

travisgosselin May 16, 2022
Maintainer Author

Nice, thanks @nickclarity . To clarify your perspective (making some statements):

Context

A gateway is only something that proxies data through it for example. This means that any REST API should not return status codes that indicate it as a Gateway / Proxy. So your position is that we should not add anything beyond a 500 to the existing standards?

503

Well put!

nickclarity May 16, 2022
Collaborator

I am not familiar enough with all 5xx codes to make a blanket statement such as:

we should not add anything beyond a 500 to the existing standards

But yes, I think for our REST standards we should not include a 502 or 504. If there are instances where one of our API is acting as a gateway/proxy then I think we should probably reconsider if that is necessary. That might be something we should move up to the application layer somewhere.

That being said, we may want to add some guardrails around REST API's vs "Backend API's" that only in support a single consumer; their front end. In the BFF example I described above, here is no reason the backend can't proxy some requests and also be the intended target for others (implying that 502/504 might be a reasonable response on some endpoints but not others). A guardrail might be "if your API has uses cases/business logic or exposes any entities, you should consider move that to a dedicated REST API".

travisgosselin May 17, 2022
Maintainer Author

Let's zoom in on your example statement:

If your API has use cases/business logic or exposes any entities, you should consider moving that to a dedicated REST API.

I'm not sure I understand the connection (or perhaps terminology) that says "exposed entities" != REST API? In this is the implication that an "exposed entity" is purely a representation of a data model rather than a REST-style resource?

nickclarity May 18, 2022
Collaborator

I think somewhere in my original answer is something that makes sense. However, I might have confused myself in the process 😆.

I think I was making an unnecessary distinction between different "types" of services despite them both being RESTful. So contradicting myself again (I think for the 3rd time):

But yes, I think for our REST standards we should not include a 502 or 504

☝️ this statement doesn't really make sense given my example. If both "types" are RESTful then clearly in my example I'm saying we should include 502, 504.

Reading back, I think my original motivation for making the distinction was in response one of @travisgosselin answers:

to indicate different types of dependency failures (external or internal)

I agree with @mkokotovich in that this seems like a misuse of the codes. I guess the rest of my thought process might come down to implementation details. I would think however it would be rare for any of our services to actually need to return a 502 or 504. If we use them as described in the Mozilla docs (for actually proxy or gateway requests) I would be surprised if this was common among any of our services.

travisgosselin May 19, 2022
Maintainer Author

Gotcha - ya it feels like it comes down to your interpretation of the terms gateway and proxy as indicators if we should use them.

5xx Series Status Code Usage (i.e. 500 vs 504) #9

Uh oh!

travisgosselin May 5, 2022 Maintainer

Replies: 4 comments · 14 replies

Uh oh!

travisgosselin May 5, 2022 Maintainer Author

Uh oh!

mkokotovich May 5, 2022

Uh oh!

travisgosselin May 5, 2022 Maintainer Author

Uh oh!

travisgosselin May 5, 2022 Maintainer Author

Uh oh!

mkokotovich May 5, 2022

Uh oh!

Uh oh!

ca0abinary Aug 25, 2022 Collaborator

Uh oh!

travisgosselin Aug 25, 2022 Maintainer Author

Uh oh!

ca0abinary Aug 25, 2022 Collaborator

Uh oh!

travisgosselin Aug 25, 2022 Maintainer Author

Uh oh!

ca0abinary Aug 25, 2022 Collaborator

Uh oh!

Uh oh!

nickclarity May 12, 2022 Collaborator

502s, 504s

Context

Implementation Details

503s

Uh oh!

travisgosselin May 16, 2022 Maintainer Author

Uh oh!

Uh oh!

nickclarity May 16, 2022 Collaborator

Uh oh!

travisgosselin May 17, 2022 Maintainer Author

Uh oh!

Uh oh!

nickclarity May 18, 2022 Collaborator

Uh oh!

travisgosselin May 19, 2022 Maintainer Author

travisgosselin
May 5, 2022
Maintainer

Replies: 4 comments 14 replies

travisgosselin
May 5, 2022
Maintainer Author

travisgosselin May 5, 2022
Maintainer Author

travisgosselin
May 5, 2022
Maintainer Author

mkokotovich
May 5, 2022

ca0abinary Aug 25, 2022
Collaborator

travisgosselin Aug 25, 2022
Maintainer Author

ca0abinary Aug 25, 2022
Collaborator

travisgosselin Aug 25, 2022
Maintainer Author

ca0abinary Aug 25, 2022
Collaborator

nickclarity
May 12, 2022
Collaborator

travisgosselin May 16, 2022
Maintainer Author

nickclarity May 16, 2022
Collaborator

travisgosselin May 17, 2022
Maintainer Author

nickclarity May 18, 2022
Collaborator

travisgosselin May 19, 2022
Maintainer Author