fix: out connections leak #3077

gabrielmer · 2024-10-01T11:32:37Z

Description

Once we started promptly disconnecting from excess in connections, we began seeing our nodes significantly exceeding their out connections targets.

The root cause was a race condition in our keep alive loop

nwaku/waku/node/waku_node.nim

Lines 1241 to 1258 in 643ab20

    
           proc keepaliveLoop(node: WakuNode, keepalive: chronos.Duration) {.async.} = 
        
             while node.started: 
        
               # Keep all connected peers alive while running 
        
               trace "Running keepalive" 
        
               # First get a list of connected peer infos 
        
               let peers = 
        
                 node.peerManager.wakuPeerStore.peers().filterIt(it.connectedness == Connected) 
        
               for peer in peers: 
        
                 try: 
        
                   let conn = await node.switch.dial(peer.peerId, peer.addrs, PingCodec) 
        
                   let pingDelay = await node.libp2pPing.ping(conn) 
        
                   await conn.close() 
        
                 except CatchableError as exc: 
        
                   waku_node_errors.inc(labelValues = ["keep_alive_failure"]) 
        
               await sleepAsync(keepalive)

The case is the following:

A node receives incoming connections beyond its target, and it takes a small amount of time from the moment nim-libp2p accepts the connection until our peer manager notices that it's beyond our in target and disconnects
In the time while we're connected to this in connection, we start running the keep alive loop and have that peer in the list of connected peers that we should ping
While we're pinging other peers, we disconnect from the in connection as we noticed it's beyond our target
Because the list of the nodes to ping was generated before we disconnected from the node, we ping the node. As there's no existing connection, we end up creating a new out connection towards the node

The proposed change to avoid this race condition is to delegate the responsibility of the periodic ping to the node that originally initiated the connection. Or in other words, whoever initiated a connection is the one responsible to ping periodically to maintain it open - there's no need to have both nodes pinging each other.

Changes

extended connectedPeers() to allow to get connected peers from all protocols
modified keepaliveLoop so that we only ping nodes in our out connections list

Issue

closes #3063

github-actions · 2024-10-01T11:32:53Z

This PR may contain changes to database schema of one of the drivers.

If you are introducing any changes to the schema, make sure the upgrade from the latest release to this change passes without any errors/issues.

Please make sure the label release-notes is added to make sure upgrade instructions properly highlight this change.

github-actions · 2024-10-01T11:40:22Z

You can find the image built from this PR at

quay.io/wakuorg/nwaku-pr:3077

Built from 3deaacf

This reverts commit 75ab0e1.

SionoiS

LGTM

Do we need to revisit how missed pings are handled?
If only one side pings maybe we should be more lenient before disconnecting.

It may not be a problem in practice, IDK.

Ivansete-status

LGTM! Thanks for it! 💯

gabrielmer · 2024-10-02T13:13:56Z

LGTM

Do we need to revisit how missed pings are handled? If only one side pings maybe we should be more lenient before disconnecting.

It may not be a problem in practice, IDK.

Great point! I see that the connection should timeout after 4-5 missed pings (~10 minutes without being reachable)

nwaku/waku/node/waku_node.nim

Line 1261 in e406673

    
           let defaultKeepalive = 2.minutes # 20% of the default chronosstream timeout duration

I think it looks reasonable? Don't think it should give issues, lmk what you think :)

NagyZoltanPeter

Insightful! Thank you!

adding debug logs

2656819

gabrielmer force-pushed the chore-debug-excess-connections branch from 3f0c865 to 2656819 Compare October 1, 2024 11:35

gabrielmer added 9 commits October 1, 2024 15:07

more logs

a0b23fb

more logs

1ca9970

logs

3a1f0a8

updating nim-libp2p to debug branch

75ab0e1

logs

98ce230

ping log

5c9bc68

adding potential fix

f3b533f

removing debug logs

41ffa4b

Revert "updating nim-libp2p to debug branch"

5d40f73

This reverts commit 75ab0e1.

gabrielmer changed the title ~~chore: [DEBUG] investigate excess connections~~ fix: out connections leak Oct 2, 2024

Merge branch 'master' into chore-debug-excess-connections

f1b12ca

gabrielmer requested review from Ivansete-status, jm-clius, SionoiS, alrevuelta, darshankabariya, richard-ramos and NagyZoltanPeter and removed request for Ivansete-status October 2, 2024 11:36

gabrielmer marked this pull request as ready for review October 2, 2024 11:36

SionoiS approved these changes Oct 2, 2024

View reviewed changes

Ivansete-status approved these changes Oct 2, 2024

View reviewed changes

gabrielmer self-assigned this Oct 2, 2024

Merge branch 'master' into chore-debug-excess-connections

860d1a4

NagyZoltanPeter approved these changes Oct 3, 2024

View reviewed changes

Merge branch 'master' into chore-debug-excess-connections

5f5e9f7

gabrielmer merged commit eb2bbae into master Oct 3, 2024
10 of 11 checks passed

gabrielmer deleted the chore-debug-excess-connections branch October 3, 2024 09:37

gabrielmer added a commit that referenced this pull request Oct 3, 2024

fix: out connections leak (#3077)

45c6d89

gabrielmer added a commit that referenced this pull request Oct 3, 2024

fix: out connections leak (#3077)

843a080

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: out connections leak #3077

fix: out connections leak #3077

gabrielmer commented Oct 1, 2024 •

edited

Loading

github-actions bot commented Oct 1, 2024

github-actions bot commented Oct 1, 2024 •

edited

Loading

SionoiS left a comment

Ivansete-status left a comment

gabrielmer commented Oct 2, 2024

NagyZoltanPeter left a comment

	proc keepaliveLoop(node: WakuNode, keepalive: chronos.Duration) {.async.} =
	while node.started:
	# Keep all connected peers alive while running
	trace "Running keepalive"

	# First get a list of connected peer infos
	let peers =
	node.peerManager.wakuPeerStore.peers().filterIt(it.connectedness == Connected)

	for peer in peers:
	try:
	let conn = await node.switch.dial(peer.peerId, peer.addrs, PingCodec)
	let pingDelay = await node.libp2pPing.ping(conn)
	await conn.close()
	except CatchableError as exc:
	waku_node_errors.inc(labelValues = ["keep_alive_failure"])

	await sleepAsync(keepalive)

fix: out connections leak #3077

fix: out connections leak #3077

Conversation

gabrielmer commented Oct 1, 2024 • edited Loading

Description

Changes

Issue

github-actions bot commented Oct 1, 2024

github-actions bot commented Oct 1, 2024 • edited Loading

SionoiS left a comment

Choose a reason for hiding this comment

Ivansete-status left a comment

Choose a reason for hiding this comment

gabrielmer commented Oct 2, 2024

NagyZoltanPeter left a comment

Choose a reason for hiding this comment

gabrielmer commented Oct 1, 2024 •

edited

Loading

github-actions bot commented Oct 1, 2024 •

edited

Loading