Attempt reconnect via the FB channel on init sync stall #10

poltak · 2020-04-07T06:30:30Z

TODOs:

get re-connection reproducible in test setup
- current implementation does not work!
write test for case: "One side detects the stall, the other one doesn't for whatever reason"
- implement timeout for each reconnect attempt
write test for case: "One side detects the stall, the other one does so later"
- implement message queue for signal channel
write test for case: "Both sides detect the stall and both want to initiate the signalling"
- figure out if defined behaviours for sender and receiver roles covers this
rename the CB passed in to setup fast sync signal channel

Discussion points:

I had to update the init sync to be able to pass in an optional callback to be able to do post-instantiation setup on the FB channel, so in the integration tests I could set custom stall timeouts, etc. This is outlined in 61ac5f0 . I personally feel this is less elegant than it could be. I feel like you might have had a better solution in mind when we discussed this yesterday, which I wasn't able to work out?
Currently this only attempts to reconnect once, when a stall is encountered. The reconnect attempt can fail.
1. Do we want to be able to try and do it multiple times?
2. I added a new reconnected event in 51e3582 that is fired from the FB channel after stalled occurs and the re-connection is successful. In the case it's not successful, it fires stalled a second time. This is just a suggestion. Do you have any preferences for how we handle this?
I just realized that the receiveMessage method does not exist on SignalChannel. It's commented out in the interface definition and the implementations in simple-signalling do not seem to have it:
https://github.com/WorldBrain/simple-signalling/blob/master/ts/firebase.ts#L39
From what I understanding, this stops me from actually running my test and means the reconnection won't actually work. Did you have plans for this already?

We'll also need to update at least the mobile init sync flow which currently listens and reacts to the stalled event. Assuming the current flow, described in 2.ii., we'll need to at least update it to not move to the stall error screen unless it emits 2 stalled. I'm not sure how the ext's sync handles stalls, but that may need to be updated too.

ShishKabab · 2020-04-07T10:15:44Z

I'd say the callback is no problem, but I'd like to see a more descriptive name there, like onFastSyncChannelCreated.

Yes, we do want to retry to connect multiple times, maybe keeping it at 3 max?

With regards to the events, I'd do the following:

When we detect the connection is stalled, we fire an event reconnect with { attempt: number }
When we keep firing this event with the attempt number increasing
On successful reconnect, we fire reconnected
After the retry limit exceeds, we fire stalled

We need to decide whether we want to display this in the UI.

Indeed, the receiveMessage event was removed from the SignalChannel in favor of listening to events. This prevents the class from needing to buffer events internally, which we can instead just do externally if we want.

One bigger problem I see here is that there might be a race condition if I see it correctly. If peer A detects a stall sooner than peer B, peer A will initiate reconnecting, but peer B won't be listening for messages yet. Am I correct? We need to account for a few scenarios here off the top of my head:

One side detects the stall, the other one doesn't for whatever reason
One side detects the stall, the other one does so later
Both sides detect the stall and both want to initiate the signalling

ShishKabab · 2020-04-07T10:18:40Z

Also, am I right the signalling channel gets released after the first connection is set up? This should now happen after the whole sync is complete.

poltak · 2020-04-07T12:05:03Z

there might be a race condition if I see it correctly. If peer A detects a stall sooner than peer B, peer A will initiate reconnecting, but peer B won't be listening for messages yet. Am I correct?

I was going by the (perhaps overly confident) assumptions that if one side stalled, the other side would always eventually stall, and that there would be some kind of message queue on the signalling channels. Though they are not right.

RE each of those scenarios:

One side detects the stall, the other one doesn't for whatever reason

This could be solved with a timeout on each reconnect attempt.

One side detects the stall, the other one does so later

This could be solved with some event queue.

Both sides detect the stall and both want to initiate the signalling

I need to think about this some more, though I feel it will be clearer once I've updated how the reconnection attempts work and the events that get emitted.

Also, am I right the signalling channel gets released after the first connection is set up? This should now happen after the whole sync is complete.

I see, so the channel remains open throughout the entire sync so it can attempt reconnects over it. Makes sense. Cheers

poltak · 2020-04-08T12:56:25Z

I am having a fair amount of trouble making my test work. Every time I set it up to get into a stall and attempt reconnect, it exceeds the 2000ms timeout for tests. I got to the point, after putting a few console.logs around the place, where I thought it was because one side isn't yet ready to listen for the reconnect signal when it gets sent from the other side. I thought this would be solved by implementing some kind of buffer to put all incoming signals on, which I did an overly complicated test implementation in 810feff, although it still did not manage to reach communication between the two sides. Will think about more possible solutions to try.

I am feeling like things are starting to get to the point where it's really complicated adding these extra things in, like the buffer, and I'm feeling fairly demoralized after spending 3 days now on this task, which was I originally expected to talk maybe half of my Monday, and I am still not seeing the end in sight. @ShishKabab what do you think about doing a bit of a read through and review of the work up until this point? I feel like I may have been going in the wrong direction at some point or am missing something obvious, and that it would help having another brain to critique the progress up until now. If you do have a look, I would recommend looking from 4af8dd1 as the following latest commit (810feff) was more of an experiment.

- tests and iterations to come, once I figure out how to test this

- needed someway to say the attempt to reestablish a connection has passed or failed - now if passed it emits 'reconnected', else if fails it emits 'stalled' again - we may want to discuss this more; mainly wanted to handle reconnection errors

- untested; need to move over to storex-workspace to get tests running

- also add the ability to re-attempt reconnection, now set up to 3 times

- previously was using old `receiveMessage` method

- also ensured signal channel messages get JSON serialized

- prev. resolved before reconnect attempts do

- this is just an experiment, not intended to be code we roll-out (it doesn't even properly) - the "SignalBuffer" essentially acts as an abstraction on top of the signal channel for the stall code to use to check for new incoming signals, and not miss any that might have been sent before they knew when to listen - I feel "SignalBuffer" is overly complicated, though it is that way because of the code that uses it - that may indicate that code is overly complicated

poltak requested a review from ShishKabab April 7, 2020 06:31

poltak and others added 10 commits April 17, 2020 11:44

Write initial implementation

7566410

- tests and iterations to come, once I figure out how to test this

Add reconnect event to fast-sync channel

7965e7f

- needed someway to say the attempt to reestablish a connection has passed or failed - now if passed it emits 'reconnected', else if fails it emits 'stalled' again - we may want to discuss this more; mainly wanted to handle reconnection errors

Write test to attempt to cover reconnect on stalled syncs

55706d7

- untested; need to move over to storex-workspace to get tests running

Fix some type issues breaking tsc

328a3f8

Update event emissions surrounding reconnections

fe6ea70

- also add the ability to re-attempt reconnection, now set up to 3 times

Move reconnect signalling over to event based API

7b20087

- previously was using old `receiveMessage` method

Pass in max reconnect attempts to init sync constructor

84964ae

- also ensured signal channel messages get JSON serialized

Ensure stalling detection waits for reconnect attempts

7544f42

- prev. resolved before reconnect attempts do

Progress on re-connect during initial sync

1522ab6

ShishKabab force-pushed the feature/init-sync-reconnect branch from 7caf32a to 1522ab6 Compare April 17, 2020 09:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt reconnect via the FB channel on init sync stall #10

Attempt reconnect via the FB channel on init sync stall #10

poltak commented Apr 7, 2020 •

edited

Loading

ShishKabab commented Apr 7, 2020 •

edited

Loading

ShishKabab commented Apr 7, 2020

poltak commented Apr 7, 2020

poltak commented Apr 8, 2020

Attempt reconnect via the FB channel on init sync stall #10

Are you sure you want to change the base?

Attempt reconnect via the FB channel on init sync stall #10

Conversation

poltak commented Apr 7, 2020 • edited Loading

ShishKabab commented Apr 7, 2020 • edited Loading

ShishKabab commented Apr 7, 2020

poltak commented Apr 7, 2020

poltak commented Apr 8, 2020

poltak commented Apr 7, 2020 •

edited

Loading

ShishKabab commented Apr 7, 2020 •

edited

Loading