Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: out connections leak #3077

Merged
merged 13 commits into from
Oct 3, 2024
Merged

fix: out connections leak #3077

merged 13 commits into from
Oct 3, 2024

Conversation

gabrielmer
Copy link
Contributor

@gabrielmer gabrielmer commented Oct 1, 2024

Description

Once we started promptly disconnecting from excess in connections, we began seeing our nodes significantly exceeding their out connections targets.

The root cause was a race condition in our keep alive loop

nwaku/waku/node/waku_node.nim

Lines 1241 to 1258 in 643ab20

proc keepaliveLoop(node: WakuNode, keepalive: chronos.Duration) {.async.} =
while node.started:
# Keep all connected peers alive while running
trace "Running keepalive"
# First get a list of connected peer infos
let peers =
node.peerManager.wakuPeerStore.peers().filterIt(it.connectedness == Connected)
for peer in peers:
try:
let conn = await node.switch.dial(peer.peerId, peer.addrs, PingCodec)
let pingDelay = await node.libp2pPing.ping(conn)
await conn.close()
except CatchableError as exc:
waku_node_errors.inc(labelValues = ["keep_alive_failure"])
await sleepAsync(keepalive)

The case is the following:

  1. A node receives incoming connections beyond its target, and it takes a small amount of time from the moment nim-libp2p accepts the connection until our peer manager notices that it's beyond our in target and disconnects
  2. In the time while we're connected to this in connection, we start running the keep alive loop and have that peer in the list of connected peers that we should ping
  3. While we're pinging other peers, we disconnect from the in connection as we noticed it's beyond our target
  4. Because the list of the nodes to ping was generated before we disconnected from the node, we ping the node. As there's no existing connection, we end up creating a new out connection towards the node

The proposed change to avoid this race condition is to delegate the responsibility of the periodic ping to the node that originally initiated the connection. Or in other words, whoever initiated a connection is the one responsible to ping periodically to maintain it open - there's no need to have both nodes pinging each other.

Changes

  • extended connectedPeers() to allow to get connected peers from all protocols
  • modified keepaliveLoop so that we only ping nodes in our out connections list

Issue

closes #3063

Copy link

github-actions bot commented Oct 1, 2024

This PR may contain changes to database schema of one of the drivers.

If you are introducing any changes to the schema, make sure the upgrade from the latest release to this change passes without any errors/issues.

Please make sure the label release-notes is added to make sure upgrade instructions properly highlight this change.

@gabrielmer gabrielmer force-pushed the chore-debug-excess-connections branch from 3f0c865 to 2656819 Compare October 1, 2024 11:35
Copy link

github-actions bot commented Oct 1, 2024

You can find the image built from this PR at

quay.io/wakuorg/nwaku-pr:3077

Built from 3deaacf

@gabrielmer gabrielmer changed the title chore: [DEBUG] investigate excess connections fix: out connections leak Oct 2, 2024
Copy link
Contributor

@SionoiS SionoiS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Do we need to revisit how missed pings are handled?
If only one side pings maybe we should be more lenient before disconnecting.

It may not be a problem in practice, IDK.

Copy link
Collaborator

@Ivansete-status Ivansete-status left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for it! 💯

@gabrielmer
Copy link
Contributor Author

LGTM

Do we need to revisit how missed pings are handled? If only one side pings maybe we should be more lenient before disconnecting.

It may not be a problem in practice, IDK.

Great point! I see that the connection should timeout after 4-5 missed pings (~10 minutes without being reachable)

let defaultKeepalive = 2.minutes # 20% of the default chronosstream timeout duration

I think it looks reasonable? Don't think it should give issues, lmk what you think :)

@gabrielmer gabrielmer self-assigned this Oct 2, 2024
Copy link
Contributor

@NagyZoltanPeter NagyZoltanPeter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insightful! Thank you!

@gabrielmer gabrielmer merged commit eb2bbae into master Oct 3, 2024
10 of 11 checks passed
@gabrielmer gabrielmer deleted the chore-debug-excess-connections branch October 3, 2024 09:37
gabrielmer added a commit that referenced this pull request Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: excess relay connections are not being properly pruned
4 participants