Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve node connection resiliency #159

Merged
merged 5 commits into from
Jul 22, 2024
Merged

Improve node connection resiliency #159

merged 5 commits into from
Jul 22, 2024

Conversation

Maelkum
Copy link
Collaborator

@Maelkum Maelkum commented Jul 12, 2024

This PR makes several changes to the way our connections are established and maintained.

On node start, we try to connect to known peers first. Known peers are boot nodes and dial-back peers.

I introduced a config option/CLI flag to treat failure to connect to a boot node as a critical error, halting the node. This goes only for the initial connection - if we're starting and we can't connect to our boot node - something might be wrong. However, if sometime in the future our boot node goes down, that should not be a cause for panic IMO. Also, by default this setting is off, so you need to explicitly request this stricter behavior.

Besides connecting to known peers on node start, we spin up a goroutine that will periodically check if we're still connected to our boot nodes. If not, it will try to reestablish a connection. This ties into the previous point - if we were connected to a node and it went offline, try to reach it occasionally so we reconnect when it comes back up. Default interval at which this is done is one minute.

We do not treat failure to connect to dial-back peers this seriously, we simply log those errors, and we do not retry dialing back to them. I think it's often the case that these can be more ephemeral, and they might just be gone. In any case, if they're around, we could pick them up on peer discovery.

Notifiee implementation had a bug where we saved peer information on peer connect by retrieving it's address info from the libp2p peerstore. However, at this moment in time the peerstore does not have this info yet, which is why we erroneously saved the peer in our DB without it. It was not a huge problem because we had the node multiaddress, but now this is fixed.

Peer discovery is decoupled from connecting to known peers, as boot nodes and dialback peers are semantically different things - we discover peers per-topic, while boot node is a more general and direct connection.

Maelkum added 4 commits July 11, 2024 23:38
- allow treating failure to connect to boot nodes as a halting error
- retry connection to boot nodes periodically in the background
- decouple boot nodes/dialback peers from peer discovery
@Maelkum Maelkum requested a review from dmikey July 12, 2024 14:25
@Maelkum Maelkum self-assigned this Jul 12, 2024
@dmikey dmikey merged commit 01dc8a2 into main Jul 22, 2024
5 checks passed
@dmikey dmikey deleted the host-discovery-tweaks branch July 22, 2024 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants