Improve node connection resiliency #159
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR makes several changes to the way our connections are established and maintained.
On node start, we try to connect to known peers first. Known peers are boot nodes and dial-back peers.
I introduced a config option/CLI flag to treat failure to connect to a boot node as a critical error, halting the node. This goes only for the initial connection - if we're starting and we can't connect to our boot node - something might be wrong. However, if sometime in the future our boot node goes down, that should not be a cause for panic IMO. Also, by default this setting is off, so you need to explicitly request this stricter behavior.
Besides connecting to known peers on node start, we spin up a goroutine that will periodically check if we're still connected to our boot nodes. If not, it will try to reestablish a connection. This ties into the previous point - if we were connected to a node and it went offline, try to reach it occasionally so we reconnect when it comes back up. Default interval at which this is done is one minute.
We do not treat failure to connect to dial-back peers this seriously, we simply log those errors, and we do not retry dialing back to them. I think it's often the case that these can be more ephemeral, and they might just be gone. In any case, if they're around, we could pick them up on peer discovery.
Notifiee implementation had a bug where we saved peer information on peer connect by retrieving it's address info from the libp2p peerstore. However, at this moment in time the peerstore does not have this info yet, which is why we erroneously saved the peer in our DB without it. It was not a huge problem because we had the node multiaddress, but now this is fixed.
Peer discovery is decoupled from connecting to known peers, as boot nodes and dialback peers are semantically different things - we discover peers per-topic, while boot node is a more general and direct connection.