Skip to content

gossip: bloated network graph, slow to prune #4070

@phlip9

Description

@phlip9

To set the stage, our LSP (running v0.1.3) was getting progressively more and more unhealthy. After investigating some issues with high CPU from gossip, it turns out our NetworkGraph was not actually getting pruned--or not quickly enough. The LSP accumulated a 185MiB NetworkGraph with 47k nodes and 309k channels, over 3.5x the live network graph size.

For context, we've been using utxo_lookup: None for P2PGossipSync. I think when the code was written, there was no async UtxoLookup. We've been relying on NetworkGraph::remove_stale_channels_and_tracking to prune the network graph.

There might also be something degenerate going on when our user nodes pull their graph from the LSP and then proceed to P2PGossipSync with the LSP. Or when the LSP tries to gossip it's ancient, bloated graph with other peers. Turning off user node gossip sync is definitely a priority, since it seems to be causing problems.

Curious what was going on, I pulled the raw network graph off the node and poked around (lexe-lsp.network_graph.20250912.bin.zip if you're curious). 250k channels already had their channel updates pruned (i.e., both one_to_two: None and two_to_one: None), but the announcement_received_timestamp's were still not prunable.

Of these channels with pruned updates but a fresh announcement_recevied_timestamp, there was kind of a weird distribution where the most channels last received an announcement around the same time, 8 days ago. Not sure what that's about...

lexe-lsp.network_graph.prunable_channels_by_last_announcement_recvd.jpeg

Anyway, desperate to get prod healthy again, I cooked this up: graph: reduce time-to-prune for chans w/ no recent announce to 5d. The LSP is also now forced to prune immediately at startup. The diff is hacky, but it at least got the LSP healthy again. The first prune took quite a while, with later prunes going much faster:

Pruned network graph in (161.474806812 s)  nodes=18373 pruned_nodes=28540 channels=57560 pruned_channels=251810
Pruned network graph in  (39.365067000 ms) nodes=18381 pruned_nodes=0     channels=57578 pruned_channels=2

If anyone has any thoughts on what might be causing the graph to get so bloated, that would be appreciated. If necessary, we can also impl UtxoLookup now that it's async. The 160s prune also looks worth optimizing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions