-
Notifications
You must be signed in to change notification settings - Fork 418
Description
To set the stage, our LSP (running v0.1.3) was getting progressively more and more unhealthy. After investigating some issues with high CPU from gossip, it turns out our NetworkGraph
was not actually getting pruned--or not quickly enough. The LSP accumulated a 185MiB NetworkGraph
with 47k nodes and 309k channels, over 3.5x the live network graph size.
For context, we've been using utxo_lookup: None
for P2PGossipSync
. I think when the code was written, there was no async UtxoLookup
. We've been relying on NetworkGraph::remove_stale_channels_and_tracking
to prune the network graph.
There might also be something degenerate going on when our user nodes pull their graph from the LSP and then proceed to P2PGossipSync with the LSP. Or when the LSP tries to gossip it's ancient, bloated graph with other peers. Turning off user node gossip sync is definitely a priority, since it seems to be causing problems.
Curious what was going on, I pulled the raw network graph off the node and poked around (lexe-lsp.network_graph.20250912.bin.zip if you're curious). 250k channels already had their channel updates pruned (i.e., both one_to_two: None
and two_to_one: None
), but the announcement_received_timestamp
's were still not prunable.
Of these channels with pruned updates but a fresh announcement_recevied_timestamp
, there was kind of a weird distribution where the most channels last received an announcement around the same time, 8 days ago. Not sure what that's about...
Anyway, desperate to get prod healthy again, I cooked this up: graph: reduce time-to-prune for chans w/ no recent announce to 5d. The LSP is also now forced to prune immediately at startup. The diff is hacky, but it at least got the LSP healthy again. The first prune took quite a while, with later prunes going much faster:
Pruned network graph in (161.474806812 s) nodes=18373 pruned_nodes=28540 channels=57560 pruned_channels=251810
Pruned network graph in (39.365067000 ms) nodes=18381 pruned_nodes=0 channels=57578 pruned_channels=2
If anyone has any thoughts on what might be causing the graph to get so bloated, that would be appreciated. If necessary, we can also impl UtxoLookup
now that it's async. The 160s prune also looks worth optimizing.