Improve node connection resiliency #159

Maelkum · 2024-07-12T14:25:25Z

This PR makes several changes to the way our connections are established and maintained.

On node start, we try to connect to known peers first. Known peers are boot nodes and dial-back peers.

I introduced a config option/CLI flag to treat failure to connect to a boot node as a critical error, halting the node. This goes only for the initial connection - if we're starting and we can't connect to our boot node - something might be wrong. However, if sometime in the future our boot node goes down, that should not be a cause for panic IMO. Also, by default this setting is off, so you need to explicitly request this stricter behavior.

Besides connecting to known peers on node start, we spin up a goroutine that will periodically check if we're still connected to our boot nodes. If not, it will try to reestablish a connection. This ties into the previous point - if we were connected to a node and it went offline, try to reach it occasionally so we reconnect when it comes back up. Default interval at which this is done is one minute.

We do not treat failure to connect to dial-back peers this seriously, we simply log those errors, and we do not retry dialing back to them. I think it's often the case that these can be more ephemeral, and they might just be gone. In any case, if they're around, we could pick them up on peer discovery.

Notifiee implementation had a bug where we saved peer information on peer connect by retrieving it's address info from the libp2p peerstore. However, at this moment in time the peerstore does not have this info yet, which is why we erroneously saved the peer in our DB without it. It was not a huge problem because we had the node multiaddress, but now this is fixed.

Peer discovery is decoupled from connecting to known peers, as boot nodes and dialback peers are semantically different things - we discover peers per-topic, while boot node is a more general and direct connection.

… on the initial connect

- allow treating failure to connect to boot nodes as a halting error - retry connection to boot nodes periodically in the background - decouple boot nodes/dialback peers from peer discovery

Maelkum added 4 commits July 11, 2024 23:38

Notifiee - on connection, don't use the peerstore as it will be empty…

4b68035

… on the initial connect

Slight changes for peer discovery loop

ddfb5f8

Improve mechanism for connecting to boot nodes

6ccb948

- allow treating failure to connect to boot nodes as a halting error - retry connection to boot nodes periodically in the background - decouple boot nodes/dialback peers from peer discovery

Decouple boot node reachability interval from peer discovery interval

51b9b88

Maelkum requested a review from dmikey July 12, 2024 14:25

Maelkum self-assigned this Jul 12, 2024

On node start subscribe to topics first

129eeed

dmikey approved these changes Jul 22, 2024

View reviewed changes

dmikey merged commit 01dc8a2 into main Jul 22, 2024
5 checks passed

dmikey deleted the host-discovery-tweaks branch July 22, 2024 12:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve node connection resiliency #159

Improve node connection resiliency #159

Maelkum commented Jul 12, 2024

Improve node connection resiliency #159

Improve node connection resiliency #159

Conversation

Maelkum commented Jul 12, 2024