Ambiguity around which connection errors are retriable #171

paddybyers · 2023-11-04T08:28:14Z

A recent issue in ably-cocoa highlighted the problem that a specific TLS-related connection error response was being treated as fatal instead of retriable. If this error occurs in a connection attempt, the connection would then move to failed.

However, there are valid arguments for some TLS failures not being retried. Suppose for example that a platform only supports TLS 1.1, and the Ably endpoint only admits TLS 1.2+, then you're going to get a TLS handshake error. If you keep retrying in that situation, then it's still never going to succeed in connecting. So there might be an argument to detect that specific case, and not attempt retrying; however, the risk with that is that you also exclude things that you really do want to retry, which is the case we have in the issue linked. So we have to err on the side of retrying in any situation where the possible underlying cause is ambiguous, and then we have to try to eliminate the possibility of things like protocol incompatibility by other means.

Similar arbitrary decisions as to what is retried are made in ably-java. The spec is currently ambiguous about what errors should be retried. The language of RTN14d is also quite confusing. It attempts to be prescriptive about what’s a recoverable error “(ie a network failure, a timeout such as RTN14c, or a disconnected response, other than a token failure RTN14b)“, whilst not really describing what should be considered a “network failure”.

The text was updated successfully, but these errors were encountered:

sync-by-unito · 2023-11-04T08:31:57Z

➤ Automation for Jira commented:

The link to the corresponding Jira issue is https://ably.atlassian.net/browse/SDK-3927

paddybyers · 2023-11-04T08:38:43Z

Internal discussion: https://github.com/ably/realtime/issues/5283

paddybyers · 2023-11-04T08:39:06Z

cc @lawrence-forooghian

RTN14d says that a transport error should be considered recoverable if it is > a network failure, a timeout such as RTN14c, or a disconnected > response, other than a token failure RTN14b) However, it does not define what it means by a "network failure", leading to each platform having to provide its own interpretation. In particular, a client recently told us that they’ve been seeing lots of OSStatus 9806 (errSSLClosedAbort) errors. This appears to indicate some sort of failure to perform an SSL handshake. We don’t understand the cause of this issue but we noticed that it was causing the client to transition to the FAILED state. Speaking to Paddy, he said that this error should be provoking a connection retry, not failure. And more broadly, he indicated that, basically, _all_ transport errors should be considered recoverable. So, that’s what we do here. He’s raised a specification issue [1] for us to properly specify this and to decide if there are any nuances to consider, but is keen for us to implement this broad behaviour in ably-cocoa ASAP to help the customer experiencing the errSSLClosedAbort errors. Resolves #1817. [1] ably/specification#171

SimonWoolf · 2023-11-06T14:32:40Z

The language of RTN14d is also quite confusing. It attempts to be prescriptive about what’s a recoverable error “(ie a network failure, a timeout such as RTN14c, or a disconnected response, other than a token failure RTN14b)“, whilst not really describing what should be considered a “network failure”.

This should definitely be improved, yes. The spec is clearer about what failures should be retried to a fallback host: RSC15l, including RSC15l1: "host unresolvable or unreachable". I'd have thought TLS issues definitely fall under that. (That's a spec item about the rest api but is incorporated into the realtime api by RTN17f, which adds DISCONNECTED with 5xx).

The set of failures that should be retried to a fallback host is by definition a subset of the set of failures that should be retried. So perhaps RTN14d should reference RTN17f, and additionally list any conditions that should result in a retry but without changing the endpoint endpoint (which I guess think is just 4xx disconnecteds?)

paddybyers · 2023-11-06T15:33:54Z

So we're clear that the philosophy intended in the spec, perhaps able to be improved, is correct:

If Ably specifically gives you an error response, and that error implies that you shouldn't retry, then you respect that. In every other situation, you retry. So:

any error response that is not from an Ably realtime endpoint (which is determined by having an Ably error code, either in the response body or a header, or is a ProtocolMessage containing an error), is retriable;
an error response that is from an Ably endpoint:
- is a token error if it has a code in the token range, and should be retried after a token renewal sequence;
- otherwise, if it has a 4xx status code, is non-retriable
- otherwise, is retriable.

This will mean that some connection problems (such as the TLS protocol version incompatibility example above) will give rise to indefinite doomed retries, but this is an accepted penalty for giving ourselves the best chance of connecting after transient errors.

There are also some responses from non-realtime elements of the infrastructure that we want clients to respect, such as rate-limiting responses from the WAF, but there might be others. We need to decide how to address these in the spec.

RTN14d says that a transport error should be considered recoverable if it is > a network failure, a timeout such as RTN14c, or a disconnected > response, other than a token failure RTN14b) However, it does not define what it means by a "network failure", leading to each platform having to provide its own interpretation. In particular, a client recently told us that they’ve been seeing lots of OSStatus 9806 (errSSLClosedAbort) errors. This appears to indicate some sort of failure to perform an SSL handshake. We don’t understand the cause of this issue but we noticed that it was causing the client to transition to the FAILED state. Speaking to Paddy, he said that this error should be provoking a connection retry, not failure. And more broadly, he indicated that, basically, _all_ transport errors should be considered recoverable. So, that’s what we do here. He’s raised a specification issue [1] for us to properly specify this and to decide if there are any nuances to consider, but is keen for us to implement this broad behaviour in ably-cocoa ASAP to help the customer experiencing the errSSLClosedAbort errors. Resolves #1817. [1] ably/specification#171

lawrence-forooghian · 2023-11-06T15:56:05Z

@paddybyers I think I mentioned in our Slack chat that it wasn't clear to me what it meant by "an error response" in the context of a Realtime client using a WebSocket transport. Does it imply that the client needs to pay attention to the response code and headers received in the WebSocket handshake?

lawrence-forooghian · 2023-11-06T15:57:08Z

(@paddybyers when I mentioned this to you, you rephrased the logic in terms of WebSocket API close and error events.)

paddybyers · 2023-11-06T16:30:49Z

it wasn't clear to me what it meant by "an error response" in the context of a Realtime client using a WebSocket transport. Does it imply that the client needs to pay attention to the response code and headers received in the WebSocket handshake?

I attempted to reflect that in what I wrote above, "or is a ProtocolMessage containing an error".

If realtime is trying to tell you something, and you're a ws connection, then it does it by sending a ProtocolMessage.

If you get a ws-level event (close, error), without having received a prior ProtocolMessage, then you have to treat that as a retriable error.

A ProtocolMessage whose action is error, and that does not have a channel, is treated as a connection-level error indication, and the rules above apply, ie if msg.error.statusCode is 4xx then it's non-retriable.

A ProtocolMessage whose action is error and does have a channel, is treated as a channel-level error indication.

A ProtoclMessage whose action is disconnected is by definition retriable.

lawrence-forooghian · 2023-11-06T16:34:22Z

A ProtocolMessage whose action is error, and that does not have a channel, is treated as a connection-level error indication, and the rules above apply, ie if msg.error.statusCode is 4xx then it's non-retriable.

A ProtocolMessage whose action is error and does have a channel, is treated as a channel-level error indication.

A ProtoclMessage whose action is disconnected is by definition retriable.

We already have spec points that describe expected behaviour upon receiving ERROR and DISCONNECTED ProtocolMessages. Do you believe that the spec needs expanding in this sense beyond what's already there?

paddybyers · 2023-11-06T17:12:52Z

We already have spec points that describe expected behaviour upon receiving ERROR and DISCONNECTED ProtocolMessages. Do you believe that the spec needs expanding in this sense beyond what's already there?

Without trawling through the detail now, I think the answer is:

the handling of disconnections and declined connections, in the case that there is a ProtocolMessage, is well defined;
the possible ambiguity of what do do with an error or close event, when there was no ProtocolMessage, is what we need to check

RTN14d says that a transport error should be considered recoverable if it is > a network failure, a timeout such as RTN14c, or a disconnected > response, other than a token failure RTN14b) However, it does not define what it means by a "network failure", leading to each platform having to provide its own interpretation. In particular, a client recently told us that they’ve been seeing lots of OSStatus 9806 (errSSLClosedAbort) errors. This appears to indicate some sort of failure to perform an SSL handshake. We don’t understand the cause of this issue but we noticed that it was causing the client to transition to the FAILED state. Speaking to Paddy, he said that this error should be provoking a connection retry, not failure. And more broadly, he indicated that, basically, _all_ transport errors should be considered recoverable. So, that’s what we do here. He’s raised a specification issue [1] for us to properly specify this and to decide if there are any nuances to consider, but is keen for us to implement this broad behaviour in ably-cocoa ASAP to help the customer experiencing the errSSLClosedAbort errors. Resolves #1817. [1] ably/specification#171

paddybyers added the bug Something isn't working label Nov 4, 2023

paddybyers mentioned this issue Nov 6, 2023

[SDK-3926] Treat all transport errors as recoverable ably/ably-cocoa#1823

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ambiguity around which connection errors are retriable #171

Ambiguity around which connection errors are retriable #171

paddybyers commented Nov 4, 2023

sync-by-unito bot commented Nov 4, 2023

paddybyers commented Nov 4, 2023

paddybyers commented Nov 4, 2023

SimonWoolf commented Nov 6, 2023

paddybyers commented Nov 6, 2023

lawrence-forooghian commented Nov 6, 2023 •

edited

Loading

lawrence-forooghian commented Nov 6, 2023

paddybyers commented Nov 6, 2023

lawrence-forooghian commented Nov 6, 2023

paddybyers commented Nov 6, 2023

Ambiguity around which connection errors are retriable #171

Ambiguity around which connection errors are retriable #171

Comments

paddybyers commented Nov 4, 2023

sync-by-unito bot commented Nov 4, 2023

paddybyers commented Nov 4, 2023

paddybyers commented Nov 4, 2023

SimonWoolf commented Nov 6, 2023

paddybyers commented Nov 6, 2023

lawrence-forooghian commented Nov 6, 2023 • edited Loading

lawrence-forooghian commented Nov 6, 2023

paddybyers commented Nov 6, 2023

lawrence-forooghian commented Nov 6, 2023

paddybyers commented Nov 6, 2023

lawrence-forooghian commented Nov 6, 2023 •

edited

Loading