Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connection resume/recover #167

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
21 changes: 11 additions & 10 deletions textile/features.textile
Original file line number Diff line number Diff line change
Expand Up @@ -502,16 +502,17 @@ h3(#realtime-connection). Connection
*** @(RTN15h2)@ If the @DISCONNECTED@ message contains a token error (@statusCode@ value of 401 and error @code@ value in the range @40140 <= code < 40150@) and the library has the means to renew the token, a single attempt to create a new token should be made and a new connection attempt initiated using the new token. If the token creation fails or the next connection attempt fails due to a token error, the connection will transition to the @DISCONNECTED@ state and the @Connection#errorReason@ will be set.
** @(RTN15i)@ If an @ERROR@ @ProtocolMessage@ is received, this indicates a fatal error in the connection. The server will close the transport immediately after. The client should transition to the @FAILED@ state triggering all attached channels to transition to the @FAILED@ state as well. Additionally the @Connection#errorReason@ should be set with the error received from Ably
** @(RTN15a)@ If a @Connection@ transport is disconnected unexpectedly or because a token has expired, then the @Connection@ manager will immediately attempt to reconnect and restore the connection state. Connection state recovery is provided by the Ably service and ensures that whilst the client is disconnected, all events are queued and channel state is retained on the Ably servers. When a new connection is made with the correct connection recovery key, the client is able to catch up by receiving the queued @ProtocolMessages@ from Ably.
** @(RTN15g)@ Connection state is only maintained server-side for a brief period, given by the @connectionStateTtl@ in the @connectionDetails@, see "CD2f":#CD2f. If a client has been disconnected for longer than the @connectionStateTtl@, it should not attempt to resume. Instead, it should clear the local connection state, and any connection attempts should be made as for a fresh connection
** @(RTN15g)@ Connection state is only maintained server-side for a brief period, given by the @connectionStateTtl@ in the @connectionDetails@, see "CD2f":#CD2f. If a client has been disconnected for longer than the @connectionStateTtl@, it should not attempt to resume. Instead, it should clear the local connection state (clear `connectionKey` and `Id`), and any connection attempts should be made as for a fresh connection.
*** @(RTN15g1)@ This check should be made before each connection attempt. It is generally not sufficient to merely clear the connection state when moving to @SUSPENDED@ state (though that may be done too), since the device may have been sleeping / suspended, in which case it may have been many hours since it was last actually connected, even though, having been in the @CONNECTED@ state when it was put to sleep, it has only moved out of that state very recently (after waking up and noticing it's no longer connected)
*** @(RTN15g2)@ Another consequence of that is that the measure of whether the client been disconnected for too long (for the purpose of this check) cannot just be whether the client left the @CONNECTED@ state more than @connectionStateTtl@ ago. Instead, it should be whether the difference between the current time and the last activity time is greater than the sum of the @connectionStateTtl@ and the @maxIdleInterval@, where the last activity time is the time of the last known actual sign of activity from Ably per "RTN23a":#RTN23a
*** @(RTN15g3)@ When a connection attempt succeeds after the connection state has been cleared in this way, channels that were previously @ATTACHED@, @ATTACHING@, or @SUSPENDED@ must be automatically reattached, just as if the connection was a resume attempt which failed per "RTN15c7":#RTN15c7
*** @(RTN15g2)@ Another consequence of that is that the measure of whether the client been disconnected for too long (for the purpose of this check) cannot just be whether the client left the @CONNECTED@ state more than @connectionStateTtl@ ago. Instead, it should be whether the difference between the current time and the last activity time (recorded timestamp for last received protocolMessage) is greater than the sum of the @connectionStateTtl@ and the @maxIdleInterval@, where the last activity time is the time of the last known actual sign of activity from Ably per "RTN23a":#RTN23a
sacOO7 marked this conversation as resolved.
Show resolved Hide resolved
*** @(RTN15g3)@ When a connection attempt succeeds after the connection state has been cleared in this way, channels that were previously @ATTACHED@, @ATTACHING@, or @SUSPENDED@ must be automatically reattached, just as if the connection was a resume attempt which failed per "RTN15c8":#RTN15c8
** @(RTN15b)@ In order for a connection to be resumed and connection state to be recovered, the client must have received a @CONNECTED@ ProtocolMessage which will include a private connection key. To resume that connection, the library reconnects to the "websocket":https://ably.com/topic/websockets endpoint with an additional querystring param:
*** @(RTN15b1)@ @resume@ is the @ProtocolMessage#connectionKey@ from the most recent @CONNECTED@ @ProtocolMessage@ received
*** @(RTN15b2)@ This clause has been deleted. It was valid up to and including specification version @1.2@.
** @(RTN15c)@ The system's response to a resume request will be one of the following:
**** @(RTN15c6)@ A @CONNECTED@ @ProtocolMessage@ with the same @connectionId@ as the current client (and no @error@ property). This indicates that the resume attempt was valid. The client library should move all channels that were in the @ATTACHING@, @ATTACHED@, or @SUSPENDED@ states to the @ATTACHING@ state, and initiate an @RTL4c@ attach sequence for each. The connection should also process any messages queued per @RTL6c2@ (there is no need to wait for the attaches to finish before processing queued messages).
**** @(RTN15c7)@ @CONNECTED@ @ProtocolMessage@ with a new @connectionId@ and an @ErrorInfo@ in the @error@ field. In this case, the resume was invalid, and the error indicates the cause. The @error@ should be set as the @reason@ in the @CONNECTED@ event, and as the @Connection#errorReason@. The internal @msgSerial@ counter should be reset so that the first message published to Ably will contain a @msgSerial@ of @0@. The rest of the process is the same as for @RTN16c6@: The client library should move all channels that were in the @ATTACHING@, @ATTACHED@, or @SUSPENDED@ states to the @ATTACHING@ state, and initiate an @RTL4c@ attach sequence for each. The connection should also process any messages queued per @RTL6c2@.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if there was no resume attempt as a part of suspended retry, are we supposed to treat is as resume failed? That means we need to reset msgSerial in this case too 🤔

Copy link
Collaborator Author

@sacOO7 sacOO7 Sep 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we will get a new connectionId but there won't be an error in the errorField. I think this third case is coming for almost every other spec that mentions just a dual behavior for resume success/failure.

Copy link
Collaborator Author

@sacOO7 sacOO7 Sep 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think resume failure should be new connectionId != prev connectionId. We shouldn't have extra checks for errors as such. So, if there is no resume attempt (null key ), it will be treated as a resume failure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if there was no resume attempt as a part of suspended retry, are we supposed to treat is as resume failed? That means we need to reset msgSerial in this case too 🤔

RTN15g: "If a client has been disconnected for longer than the @connectionStateTtl@, it should not attempt to resume. Instead, it should clear the local connection state, and any connection attempts should be made as for a fresh connection" -- the msgSerial is part of the connection state, so it will have been reset in this case. (Perhaps the spec should explicitly list the properties that should be cleared, for clarity?)

(see also RTN15g1 which explains why RTN15g is strictly stronger than just doing this on going into SUSPENDED)

Also, we will get a new connectionId but there won't be an error in the errorField. I think this third case is coming for almost every other spec that mentions just a dual behavior for resume success/failure.

There's two cases. Either connection state (including connectionKey and msgSerial) is still there, in which case the library will try and do a resume, and the possible cases are listed in RTN15c. Or the connection state has been cleared, including the msgSerial, in which case the library is just doing a clean connection. The sdk never stops trying to resume but keeps its msgSerial.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

** @(RTN15c)@ The system's response to a resume request (with a non-empty connectionKey) will be one of the following:
**** @(RTN15c6)@ A @CONNECTED@ @ProtocolMessage@ with the same @connectionId@ as the current client (and no @error@ property). This indicates that the resume attempt was valid.
**** @(RTN15c7)@ @CONNECTED@ @ProtocolMessage@ with a new @connectionId@ and an @ErrorInfo@ in the @error@ field. In this case, the resume was invalid, and the error indicates the cause. The @error@ should be set as the @reason@ in the @CONNECTED@ event, and as the @Connection#errorReason@. The internal @msgSerial@ counter should be reset so that the first message published to Ably will contain a @msgSerial@ of @0@.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spec doesn't mention about value of channel Flag.resumed after channel attach @SimonWoolf ?

**** @(RTN15c8)@ Irrespective of success/failure for a resume request, the client library should move all channels that were in the @ATTACHING@, @ATTACHED@, or @SUSPENDED@ states to the @ATTACHING@ state, and initiate an @RTL4c@ attach sequence for each. The connection should also process any messages queued per @RTL6c2@ (there is no need to wait for the attaches to finish before processing queued messages).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this item is a child of a spec item saying "The system's response to a resume request (with a non-empty connectionKey) will be one of the following:". Every other child is one of the four things that can happen:

(RTN15c) The system’s response to a resume request will be one of the following:
↳ CONNECTED, same connId
↳ CONNECTED, new connId
↳ token ERROR
↳ other ERROR

Adding an item in the middle that's actually is a shared spec item for the first two of the items doesn't make sense

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, though the first 2 do exhibit a common behavior for attaching channels and processing pending messages.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove this if not needed. Though I am not sure if resume failure applies here too as per #167 (comment)

**** @(RTN15c5)@ @ERROR@ @ProtocolMessage@ indicating a failure to authenticate as a result of a token error (see "RTN15h":#RTN15h). The transport will be closed by the server. The spec described in "RTN15h":#RTN15h must be followed for a connection being resumed with a token error
**** @(RTN15c4)@ Any other @ERROR@ @ProtocolMessage@ indicating a fatal error in the connection. The server will close the transport immediately after. The client should transition to the @FAILED@ state triggering all attached channels to transition to the @FAILED@ state as well. Additionally the @Connection#errorReason@ will be set should be set with the error received from Ably
**** @(RTN15c1)@ This clause has been replaced by "@RTN15c6@":#RTN15c6. It was valid up to and including specification version @1.2@.
Expand All @@ -535,7 +536,7 @@ h3(#realtime-connection). Connection
** @(RTN16d)@ The library may wish to test that after a connection has been successfully recovered, the @Connection#id@ should be identical to the @id@ of the connection that was recovered, and @Connection#key@ should have been updated to the @ConnectionDetails#connectionKey@ provided in the @CONNECTED@ @ProtocolMessage@.
** @(RTN16m)@ @Connection#recoveryKey@ is a deprecated property that, when read, contains the same value that would be returned by calling @Connection#createRecoveryKey@. If the implementation languague allows, it should be implemented in a way that does not require that it be recalculated on every message. If this is not possible, it should be recalculated on every incoming message.
*** @(RTN16m1)@ @Connection#recoveryKey@ must be removed on the next major release of the SDK.
** @(RTN16l)@ Recovery failures should be handled identically to resume failures, per "RTN15c7":#RTN15c7, "RTN15c5":#RTN15c5, and "RTN15c4":#RTN15c4.
** @(RTN16l)@ Recovery success/failures should be handled identically to resume, per "RTN15c8":#RTN15c8, "RTN15c5":#RTN15c5, and "RTN15c4":#RTN15c4.
** @(RTN16a)@ This clause has been replaced by "@RTN16i@":#RTN16i. It was valid up to and including specification version @1.2@.
** @(RTN16b)@ This clause has been replaced by "@RTN16g@":#RTN16g and "@RTN16m@":#RTN16m. It was valid up to and including specification version @1.2@.
** @(RTN16c)@ This clause has been replaced by "@RTN16g2@":#RTN16g2. It was valid up to and including specification version @1.2@.
Expand All @@ -552,9 +553,9 @@ h3(#realtime-connection). Connection
*** @(RTN17f1)@ a @DISCONNECTED@ response with an @error.statusCode@ in the range @500 <= code <= 504@
** @(RTN17e)@ If the realtime client is connected to a fallback host endpoint, then for the duration that the transport is connected to that host, all HTTP requests, such as history or token requests, should be first attempted to the same datacenter the realtime connection is established with i.e. the same fallback host must be used as the default HTTP request host. If however the HTTP request against that fallback host fails, then the normal fallback host behavior should be followed attempting the request against another fallback host as described in "RSC15":#RSC15
* @(RTN19)@ Transport state side effects - when a transport is disconnected for any reason:
** @(RTN19a)@ Any @ProtocolMessage@ that is awaiting an @ACK@/@NACK@ on the old transport will not receive the @ACK@/@NACK@ on the new transport. The client library must therefore resend any @ProtocolMessage@ that is awaiting a @ACK@/@NACK@ to Ably in order to receive the expected @ACK@/@NACK@ for that message (subject to @RTN7c@/@RTN7d@). The Ably service is responsible for keeping track of messages, ignoring duplicates and responding with suitable @ACK@/@NACK@ messages
*** @(RTN19a1)@ One possible implementation of this requirement would be to add any in-flight messages to the @RTL6c2@ connection-wide queue of messages that will be sent once the connection next becomes @CONNECTED@
*** @(RTN19a2)@ In the case of an @RTN15c6@ successful resume, the @msgSerial@ of the reattempted @ProtocolMessage@s should remain the same as for the original attempt. In the case of an @RTN15c7@ failed resume, the message must be assigned a new @msgSerial@ from the SDK's internal counter.
** @(RTN19a)@ Any @ProtocolMessage@ that is awaiting an @ACK@/@NACK@ on the old transport will not receive the @ACK@/@NACK@ on the new transport (failed resume as per @RTN15c7@). The client library must therefore resend any @ProtocolMessage@ that is awaiting a @ACK@/@NACK@ to Ably in order to receive the expected @ACK@/@NACK@ for that message (subject to @RTN7c@/@RTN7d@). The Ably service is responsible for keeping track of messages, ignoring duplicates and responding with suitable @ACK@/@NACK@ messages
sacOO7 marked this conversation as resolved.
Show resolved Hide resolved
*** @(RTN19a1)@ One possible implementation of this requirement would be to add those @ACK@/@NACK@ messages to the @RTL6c2@ connection-wide queue of messages that will be sent once the connection becomes @CONNECTED@.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we're not adding ack/nacks to the queue, the sdk never generates or sends an ack or nack

Copy link
Collaborator Author

@sacOO7 sacOO7 Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant Ack.Nack contained messages, not ack itself

Copy link
Collaborator Author

@sacOO7 sacOO7 Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not able to understand in-flight messages at all : (
It should explicitly mention type of mesages we are adding to the queue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant Ack.Nack contained messages

I don't understand what you mean by this

The types of messages that would be requeued are ones that the library is waiting to hear an ACK or NACK from the server for, which is MESSAGE and PRESENCE protocolmessages per RTN7a

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we should mention sending MESSAGE and PRESENCE type messages associated with acks/nacks.

*** @(RTN19a2)@ In the case of an @RTN15c6@ successful resume, the @msgSerial@ of the reattempted @ProtocolMessage@s remains unchanged. In the case of an @RTN15c7@ failed resume, each message must be assigned a new incremental @msgSerial@ from the Connection#MessageSerial.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remains unchanged

why the change to the present tense? that sentence is saying "don't change it in this case (only in the other case)"

Copy link
Collaborator Author

@sacOO7 sacOO7 Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we should change it to

@(RTN19a2)@ Only in case of new connection Id, each message must be assigned a new incremental @msgSerial@ from the Connection#MessageSerial. This should automatically get covered, when pending connection queue messages (added at RTN19a1) are processed.

Copy link
Collaborator Author

@sacOO7 sacOO7 Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, Correct me if I am wrong, server connectionState doesn't maintain pending ack/nack on the server. It just maintains pending messages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only in case of new connection Id

I'd really prefer it refer back to the spec item that defines the case. The relevant principle is DRY (Don't Repeat Yourself): if you define the same requirement in multiple places, with slightly different wording each time, the same logic gets implemented multiple times, and then if that needs changing it's way too easy to miss a place it needs to be changed in.

Okay, Correct me if I am wrong, server connectionState doesn't maintain pending ack/nack on the server. It just maintains pending messages.

yes

** @(RTN19b)@ If there are any pending channels i.e. in the @ATTACHING@ or @DETACHING@ state, the respective @ATTACH@ or @DETACH@ message should be resent to Ably
* @(RTN23)@ Heartbeats
** @(RTN23a)@ If a transport does not receive any indication of activity on a transport for a period greater than the sum of the @maxIdleInterval@ (which will be sent in the @connectionDetails@ of the most recent @CONNECTED@ message received on that transport) and the @realtimeRequestTimeout@, that transport should be disconnected. Any message (or non-message indicator, see @RTN23b@) received counts as an indication of activity and should reset the timer, not merely heartbeat messages. However, it must be received (that is, sent from the server to the client); client-sent data does not count.
Expand Down