XEP-0198 Stream Management #14

rufferson · 2020-10-11T06:44:17Z

Some design ideas for SM based on review of early draft implementation (https://github.com/noonien-d/wocky/compare/0.18_XEP-0198_XEP-0352) and my own (unpublished) early attempt.

SM needs to live in the porter - as this is the place where stanzas are being queued/dequeued/sent and handled(!). Handled is a keyword for SM h attribute:

Definition: Acknowledging a previously-received ack element indicates that the stanza(s) sent since then have been "handled" by the server. By "handled" we mean that the server has accepted responsibility for a stanza or stanzas (e.g., to process the stanza(s) directly, deliver the stanza(s) to a local entity such as another connected client on the same server, or route the stanza(s) to a remote entity at a different server); until a stanza has been affirmed as handled by the server, that stanza is the responsibility of the sender (e.g., to resend it or generate an error if it is never affirmed as handled by the server).

Therefore neither XML Reader nor XMPP Connection do not fit this definition for received stanzas.

Connector initiates SM (enable) but then hands it over to porter which starts with zero counters. When porter receives enabled response it starts actively participating in the SM (requesting/answering). Porter needs to know though if enable has been sent so it starts filling sent queue.

Porter will need to move sent stanzas(! not nonzas) from outgoing queue into (new) sent queue until it receives ack to head-drop the queue.

Connector will also need to understand resume vector (perhaps via reg_op). Resumption vector need to be handled by porter to avoid creating new one. In case of stream termination (not connection interruption) porter will just follow standard teardown vector, but for read/write errors it will discard connection and initiate resumption vector.

The text was updated successfully, but these errors were encountered:

rufferson · 2020-10-16T15:57:26Z

Keepalives

I've measured blackout (iptables -I OUTPUT -j DROP) scenario and after 17min of pinging into blackhole (whitespace or SM) the connection was finally terminated by the other side. This is unacceptable when we have 5-10 min of the SM resumption window.

On the other hand if the ban is lifted (rule removed) within this 17min interval the TCP recovers and re-transmits the SM or WS pings (and receives responses) so that is faster that resumption. Therefore we do not impose any timeout on heartbeats - if the network changes the socket will receive RST and close the connection actively - which will be a case for resumption.

Either way when SM is enabled we need to use SM ping instead whitespace (it is still smaller than iq) because we are doing selective acks hence need to catch-up the latest acks when idle.

rufferson · 2020-10-18T12:47:22Z

Resumption

This is the most tricky part - not from technical implementation but from logical consistency. Ideally it should be managed from gabble - and most likely that's how it will be in the end, but so far trying to localize it in wocky, and maybe it will remain as an non-default option. Following challenges are identified so far:

Resumption must follow authentication but before binding: that means we need a properly initialized Connector (with Auth Registry, credentials, TLS handler, etc.)
Resumption failure should be non-fatal and and stream should continue with binding and new session allocation,

The former could be solved by making WockyConnector reusable and passing it to the porter as part of SM context. The later may be achieved by making Porter reusable however that does not solve next tier problem - Gabble will also need to reset its state to reuse the porter and re-start initialisation (roster, etc.). Without interfering with Gabble (until it understands the SM) the only way is to make resumption failure fatal.

rufferson · 2020-11-01T13:50:53Z

Resumption Timeout

During my experiments with broken connection I came to know the timeout (specified by max) does not make any sense. It is of course needed on the server side (to cleanup) but it is totally useless at client side, hence communicating it back does not give any benefit. The reason behind is that the time when client side or server side detects connection problem is totally unpredictable and this time is usually bigger (15-20min) than conventional resume-max (5-10min). Client may detect it and start retrying but server may still be at TCP retransmit state. The opposite is also true. Hence if we start timing out after we detected stream EOF we may go out of resumption window on our side while server may still be within the window. Moreover for client side resumption failure is non-fatal hence it may proceed with the bind, thus not loosing much time with resumption attempt.

On the other hand there's UI problem - when connection is suspended and tries to resume - it does not expose this crippled state to the higher level. I.e. to gabble the connection is still connected, you can still send stanzas - but they (stanzas) will be queued and will eventiually timeout. Hence it must sooner or later give up resuming and go to disconnected state. And gabble should use its own stanza timeout value for that, rather than communicated max value.

To summarize - for client it does not make sense to honour SM timeout for resumption, it will continuously attempt to re-connect and resume connection when it runs in auto-resume (autonomous) mode. When the mode is managed (by gabble) it will timeout when upper layer thinks it's feasible.

The idea behind leaving auto mode in place (currently it's only used for testing till gabble becomes sm-aware) is to be used in headless/non-interactive applications.

rufferson · 2020-11-14T08:57:22Z

The final draft

By final I mean that it looks complete enough to start building test units and gabble integration around it.

Instead of explicit flow control I've decided to make assisted flow control via (interactive) signals. When porter detects connection failure, it prepares for resumption however right before calling connector's resume vector it fires resuming signal with <resume /> nonza as a payload. If signal returns from handler with TRUE state (the default) it calls wocky_connector_resume_async to start resumption vector and unrefs nonza. Otherwise it just unrefs and returns. The expected behaviour here is that signal handler returning FALSE will call the resumption vector on connector itself (perhaps creating a new connector, retrieving new credentials, certs, etc.). And to return back to porter a new call is added to API - wocky_c2s_porter_resume which unlocks sending and pulls stanza from the connection - which is expected to be <resumed/> nonza to trigger connection resynchronisation (flushes queues).

The outcome of wocky_connector_resume_async has 3 states - connection failure, resumption failure and success. Success as mentioned above will end up with wocky_c2s_porter_resume call. Resumption failure state means the connector has connected, encrypted, authenticated, sent <resume/> but got <failure><not-found/></failure> response. As per XEP-0198 this is recoverable soft-fail state and the client may continue normal connection handshake flow (resource bind and session start). At this point the porter (if it owns the resumption vector and was not interrupted by the resuming signal handler) will emit resume-failed signal. This signal is also interactive and the handler may return FALSE to make porter to simply return from the callback without any action. The expected behaviour here is that signal handler will discard existing porter and build a new one. The callback does not make necessary clean-ups to leave the porter in reusable state. And there is no call to reuse porter from this state anyway. If the intention is to reuse the porter - just leave it be and return TRUE from the handler. Once reconnected the porter will send notify:: events for changed full-jid and connection properties and then will emit reconnected signal which will have new full-jid and stream id as a payload.

Since wocky_session also holds a copy of full-jid and a reference to connection - it is also made notify::-aware and updates its properties on that signal to keep in sync with the porter.

Finally if wocky_connector_resume_async returns any other error (not NOT_FOUND which indicates resume-failure) the expected outcome is to continue [re]trying to connect. That's exactly what porter does if it is left in auto-resume mode. To avoid re-connection loop/storm the wocky_c2s_porter_send_whitespace_ping_async call is modified to trigger resumption retry if the connection is NULL hence it will try to reconnect on each heartbeat (unless reconnect is already pending). The same is expected from the upper level if signal handler aborts the auto-resumption vector.

There are two more signals which are emitted when wocky_c2s_porter begins and finishes resynchronisation.

The signal resumed is emitted from wocky_c2s_porter_resume after it triggers resynchronisation. The intended use is to update UI and return it back to normal connected state. I.e. the user may start sending interactive queries and messages. Before that user may also send messages however UI should make it clear that stanzas won't be delivered (only queued for delivery - which is expected to happen on resuming signal).

The signal resume-done is emitted when all queues are fully flushed and there are no stale stanzas to be sent. The intended (planned) use for this signal is to reset sending timeout on stanzas to normal (default) state. Which also means on resuming signal this timeout will likely be adjusted to accommodate for longer timeout.

rufferson · 2020-12-25T13:53:05Z

Couple of comments after completing gabble integration:

First, gabble uses channel managers (which are wrappers to WockyTLSHandler and WockySaslAuth) and those are holding back-reference to GabbleConnection. That means that reusing original handlers from gabble is not viable - once gabble-connection is destroyed (due to reconnection) old handlers stop working. To circumvent this WockyConnector is modified to make handlers CONSTRUCT instead of CONSTRUCT_ONLY properties thus allowing their replacement from a new connection.

Second - resumed signal is moved from wocky_c2s_porter_resume to wocky_porter_sm_handle call (resumed handler section). This is because signal processing may (actually does right now) trigger additional stanza injection into the stream (before unacked queue is moved into backlog) which then violates in-order stanza processing requirement of the RFC6120. To maintain that requirement the signal is moved after the queue is re-populated (but before it is flushed).

rufferson self-assigned this Oct 19, 2020

rufferson added the enhancement New feature or request label Oct 19, 2020

rufferson linked a pull request Nov 14, 2020 that will close this issue

XEP-0198 Stream Management #16

Open

rufferson mentioned this issue Dec 16, 2020

XEP-0198 SM Support TelepathyIM/telepathy-gabble#16

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XEP-0198 Stream Management #14

XEP-0198 Stream Management #14

rufferson commented Oct 11, 2020

rufferson commented Oct 16, 2020

rufferson commented Oct 18, 2020

rufferson commented Nov 1, 2020 •

edited

Loading

rufferson commented Nov 14, 2020 •

edited

Loading

rufferson commented Dec 25, 2020

XEP-0198 Stream Management #14

XEP-0198 Stream Management #14

Comments

rufferson commented Oct 11, 2020

rufferson commented Oct 16, 2020

Keepalives

rufferson commented Oct 18, 2020

Resumption

rufferson commented Nov 1, 2020 • edited Loading

Resumption Timeout

rufferson commented Nov 14, 2020 • edited Loading

The final draft

rufferson commented Dec 25, 2020

rufferson commented Nov 1, 2020 •

edited

Loading

rufferson commented Nov 14, 2020 •

edited

Loading