-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ambiguity around which connection errors are retriable #171
Comments
➤ Automation for Jira commented: The link to the corresponding Jira issue is https://ably.atlassian.net/browse/SDK-3927 |
Internal discussion: https://github.com/ably/realtime/issues/5283 |
RTN14d says that a transport error should be considered recoverable if it is > a network failure, a timeout such as RTN14c, or a disconnected > response, other than a token failure RTN14b) However, it does not define what it means by a "network failure", leading to each platform having to provide its own interpretation. In particular, a client recently told us that they’ve been seeing lots of OSStatus 9806 (errSSLClosedAbort) errors. This appears to indicate some sort of failure to perform an SSL handshake. We don’t understand the cause of this issue but we noticed that it was causing the client to transition to the FAILED state. Speaking to Paddy, he said that this error should be provoking a connection retry, not failure. And more broadly, he indicated that, basically, _all_ transport errors should be considered recoverable. So, that’s what we do here. He’s raised a specification issue [1] for us to properly specify this and to decide if there are any nuances to consider, but is keen for us to implement this broad behaviour in ably-cocoa ASAP to help the customer experiencing the errSSLClosedAbort errors. Resolves #1817. [1] ably/specification#171
This should definitely be improved, yes. The spec is clearer about what failures should be retried to a fallback host: RSC15l, including RSC15l1: "host unresolvable or unreachable". I'd have thought TLS issues definitely fall under that. (That's a spec item about the rest api but is incorporated into the realtime api by RTN17f, which adds DISCONNECTED with 5xx). The set of failures that should be retried to a fallback host is by definition a subset of the set of failures that should be retried. So perhaps RTN14d should reference RTN17f, and additionally list any conditions that should result in a retry but without changing the endpoint endpoint (which I guess think is just 4xx disconnecteds?) |
So we're clear that the philosophy intended in the spec, perhaps able to be improved, is correct: If Ably specifically gives you an error response, and that error implies that you shouldn't retry, then you respect that. In every other situation, you retry. So:
This will mean that some connection problems (such as the TLS protocol version incompatibility example above) will give rise to indefinite doomed retries, but this is an accepted penalty for giving ourselves the best chance of connecting after transient errors. There are also some responses from non-realtime elements of the infrastructure that we want clients to respect, such as rate-limiting responses from the WAF, but there might be others. We need to decide how to address these in the spec. |
RTN14d says that a transport error should be considered recoverable if it is > a network failure, a timeout such as RTN14c, or a disconnected > response, other than a token failure RTN14b) However, it does not define what it means by a "network failure", leading to each platform having to provide its own interpretation. In particular, a client recently told us that they’ve been seeing lots of OSStatus 9806 (errSSLClosedAbort) errors. This appears to indicate some sort of failure to perform an SSL handshake. We don’t understand the cause of this issue but we noticed that it was causing the client to transition to the FAILED state. Speaking to Paddy, he said that this error should be provoking a connection retry, not failure. And more broadly, he indicated that, basically, _all_ transport errors should be considered recoverable. So, that’s what we do here. He’s raised a specification issue [1] for us to properly specify this and to decide if there are any nuances to consider, but is keen for us to implement this broad behaviour in ably-cocoa ASAP to help the customer experiencing the errSSLClosedAbort errors. Resolves #1817. [1] ably/specification#171
@paddybyers I think I mentioned in our Slack chat that it wasn't clear to me what it meant by "an error response" in the context of a Realtime client using a WebSocket transport. Does it imply that the client needs to pay attention to the response code and headers received in the WebSocket handshake? |
(@paddybyers when I mentioned this to you, you rephrased the logic in terms of WebSocket API |
I attempted to reflect that in what I wrote above, "or is a ProtocolMessage containing an error". If realtime is trying to tell you something, and you're a ws connection, then it does it by sending a ProtocolMessage. If you get a ws-level event ( A A A |
We already have spec points that describe expected behaviour upon receiving |
Without trawling through the detail now, I think the answer is:
|
RTN14d says that a transport error should be considered recoverable if it is > a network failure, a timeout such as RTN14c, or a disconnected > response, other than a token failure RTN14b) However, it does not define what it means by a "network failure", leading to each platform having to provide its own interpretation. In particular, a client recently told us that they’ve been seeing lots of OSStatus 9806 (errSSLClosedAbort) errors. This appears to indicate some sort of failure to perform an SSL handshake. We don’t understand the cause of this issue but we noticed that it was causing the client to transition to the FAILED state. Speaking to Paddy, he said that this error should be provoking a connection retry, not failure. And more broadly, he indicated that, basically, _all_ transport errors should be considered recoverable. So, that’s what we do here. He’s raised a specification issue [1] for us to properly specify this and to decide if there are any nuances to consider, but is keen for us to implement this broad behaviour in ably-cocoa ASAP to help the customer experiencing the errSSLClosedAbort errors. Resolves #1817. [1] ably/specification#171
A recent issue in ably-cocoa highlighted the problem that a specific TLS-related connection error response was being treated as fatal instead of retriable. If this error occurs in a connection attempt, the connection would then move to
failed
.However, there are valid arguments for some TLS failures not being retried. Suppose for example that a platform only supports TLS 1.1, and the Ably endpoint only admits TLS 1.2+, then you're going to get a TLS handshake error. If you keep retrying in that situation, then it's still never going to succeed in connecting. So there might be an argument to detect that specific case, and not attempt retrying; however, the risk with that is that you also exclude things that you really do want to retry, which is the case we have in the issue linked. So we have to err on the side of retrying in any situation where the possible underlying cause is ambiguous, and then we have to try to eliminate the possibility of things like protocol incompatibility by other means.
Similar arbitrary decisions as to what is retried are made in ably-java. The spec is currently ambiguous about what errors should be retried. The language of RTN14d is also quite confusing. It attempts to be prescriptive about what’s a recoverable error “(ie a network failure, a timeout such as RTN14c, or a disconnected response, other than a token failure RTN14b)“, whilst not really describing what should be considered a “network failure”.
The text was updated successfully, but these errors were encountered: