Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt reconnect via the FB channel on init sync stall #10

Open
wants to merge 10 commits into
base: develop
Choose a base branch
from

Conversation

poltak
Copy link
Member

@poltak poltak commented Apr 7, 2020

TODOs:

  • get re-connection reproducible in test setup
    • current implementation does not work!
  • write test for case: "One side detects the stall, the other one doesn't for whatever reason"
    • implement timeout for each reconnect attempt
  • write test for case: "One side detects the stall, the other one does so later"
    • implement message queue for signal channel
  • write test for case: "Both sides detect the stall and both want to initiate the signalling"
    • figure out if defined behaviours for sender and receiver roles covers this
  • rename the CB passed in to setup fast sync signal channel

Discussion points:

  1. I had to update the init sync to be able to pass in an optional callback to be able to do post-instantiation setup on the FB channel, so in the integration tests I could set custom stall timeouts, etc. This is outlined in 61ac5f0 . I personally feel this is less elegant than it could be. I feel like you might have had a better solution in mind when we discussed this yesterday, which I wasn't able to work out?
  2. Currently this only attempts to reconnect once, when a stall is encountered. The reconnect attempt can fail.
    1. Do we want to be able to try and do it multiple times?
    2. I added a new reconnected event in 51e3582 that is fired from the FB channel after stalled occurs and the re-connection is successful. In the case it's not successful, it fires stalled a second time. This is just a suggestion. Do you have any preferences for how we handle this?
  3. I just realized that the receiveMessage method does not exist on SignalChannel. It's commented out in the interface definition and the implementations in simple-signalling do not seem to have it:
    https://github.com/WorldBrain/simple-signalling/blob/master/ts/firebase.ts#L39
    From what I understanding, this stops me from actually running my test and means the reconnection won't actually work. Did you have plans for this already?

We'll also need to update at least the mobile init sync flow which currently listens and reacts to the stalled event. Assuming the current flow, described in 2.ii., we'll need to at least update it to not move to the stall error screen unless it emits 2 stalled. I'm not sure how the ext's sync handles stalls, but that may need to be updated too.

@poltak poltak requested a review from ShishKabab April 7, 2020 06:31
@ShishKabab
Copy link
Member

ShishKabab commented Apr 7, 2020

I'd say the callback is no problem, but I'd like to see a more descriptive name there, like onFastSyncChannelCreated.

Yes, we do want to retry to connect multiple times, maybe keeping it at 3 max?

With regards to the events, I'd do the following:

  • When we detect the connection is stalled, we fire an event reconnect with { attempt: number }
  • When we keep firing this event with the attempt number increasing
  • On successful reconnect, we fire reconnected
  • After the retry limit exceeds, we fire stalled

We need to decide whether we want to display this in the UI.

Indeed, the receiveMessage event was removed from the SignalChannel in favor of listening to events. This prevents the class from needing to buffer events internally, which we can instead just do externally if we want.

One bigger problem I see here is that there might be a race condition if I see it correctly. If peer A detects a stall sooner than peer B, peer A will initiate reconnecting, but peer B won't be listening for messages yet. Am I correct? We need to account for a few scenarios here off the top of my head:

  • One side detects the stall, the other one doesn't for whatever reason
  • One side detects the stall, the other one does so later
  • Both sides detect the stall and both want to initiate the signalling

@ShishKabab
Copy link
Member

Also, am I right the signalling channel gets released after the first connection is set up? This should now happen after the whole sync is complete.

@poltak
Copy link
Member Author

poltak commented Apr 7, 2020

there might be a race condition if I see it correctly. If peer A detects a stall sooner than peer B, peer A will initiate reconnecting, but peer B won't be listening for messages yet. Am I correct?

I was going by the (perhaps overly confident) assumptions that if one side stalled, the other side would always eventually stall, and that there would be some kind of message queue on the signalling channels. Though they are not right.

RE each of those scenarios:

  1. One side detects the stall, the other one doesn't for whatever reason

This could be solved with a timeout on each reconnect attempt.

  1. One side detects the stall, the other one does so later

This could be solved with some event queue.

  1. Both sides detect the stall and both want to initiate the signalling

I need to think about this some more, though I feel it will be clearer once I've updated how the reconnection attempts work and the events that get emitted.

Also, am I right the signalling channel gets released after the first connection is set up? This should now happen after the whole sync is complete.

I see, so the channel remains open throughout the entire sync so it can attempt reconnects over it. Makes sense. Cheers

@poltak
Copy link
Member Author

poltak commented Apr 8, 2020

I am having a fair amount of trouble making my test work. Every time I set it up to get into a stall and attempt reconnect, it exceeds the 2000ms timeout for tests. I got to the point, after putting a few console.logs around the place, where I thought it was because one side isn't yet ready to listen for the reconnect signal when it gets sent from the other side. I thought this would be solved by implementing some kind of buffer to put all incoming signals on, which I did an overly complicated test implementation in 810feff, although it still did not manage to reach communication between the two sides. Will think about more possible solutions to try.

I am feeling like things are starting to get to the point where it's really complicated adding these extra things in, like the buffer, and I'm feeling fairly demoralized after spending 3 days now on this task, which was I originally expected to talk maybe half of my Monday, and I am still not seeing the end in sight. @ShishKabab what do you think about doing a bit of a read through and review of the work up until this point? I feel like I may have been going in the wrong direction at some point or am missing something obvious, and that it would help having another brain to critique the progress up until now. If you do have a look, I would recommend looking from 4af8dd1 as the following latest commit (810feff) was more of an experiment.

poltak and others added 10 commits April 17, 2020 11:44
- tests and iterations to come, once I figure out how to test this
- needed someway to say the attempt to reestablish a connection has passed or failed
- now if passed it emits 'reconnected', else if fails it emits 'stalled' again
- we may want to discuss this more; mainly wanted to handle reconnection errors
- untested; need to move over to storex-workspace to get tests running
- also add the ability to re-attempt reconnection, now set up to 3 times
- previously was using old `receiveMessage` method
- also ensured signal channel messages get JSON serialized
- prev. resolved before reconnect attempts do
- this is just an experiment, not intended to be code we roll-out (it doesn't even properly)
- the "SignalBuffer" essentially acts as an abstraction on top of the signal channel for the stall code to use to check for new incoming signals, and not miss any that might have been sent before they knew when to listen
- I feel "SignalBuffer" is overly complicated, though it is that way because of the code that uses it - that may indicate that code is overly complicated
@ShishKabab ShishKabab force-pushed the feature/init-sync-reconnect branch from 7caf32a to 1522ab6 Compare April 17, 2020 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants