Misc routing optimization #2803

TheBlueMatt · 2023-12-21T06:00:28Z

During routing, we spend most of our time doing hashmap lookups. It turns out, we can drop two of them, the first requires a good bit of work - assigning each node in memory a random u32 "node counter", we can then drop the main per-node routefinding state map and replace it with a vec. Once we do that, we can also drop the first-hop hashmap lookup that we do on a per node basis as we walk the network graph, replacing it with a check in the same vec.

This is the first in a series of PRs that, in total, substantially more than double our routefinding performance with real data. This first step optimizes the route-finder itself, with later steps more focused on the scorer.

~~Based on #2802.~~

The bulk of this PR was landed in #3103 and #3104. This PR now includes a grab-bag of misc optimizations to get_route which should speed the router up a smidge.

shaavan · 2023-12-22T12:51:31Z

CI's unhappy.
Looks like there's some error in the code

TheBlueMatt · 2024-01-08T21:35:15Z

Fixed.

TheBlueMatt · 2024-01-18T23:56:45Z

Rebased.

codecov-commenter · 2024-01-18T23:56:51Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.78%. Comparing base (78c0eaa) to head (f689e01).

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2803      +/-   ##
==========================================
- Coverage   89.80%   89.78%   -0.02%     
==========================================
  Files         121      121              
  Lines      100045   100094      +49     
  Branches   100045   100094      +49     
==========================================
+ Hits        89845    89869      +24     
- Misses       7533     7555      +22     
- Partials     2667     2670       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

coderabbitai · 2024-01-18T23:56:57Z

Walkthrough

The project has undergone a significant update, focusing on efficiency and data integrity. The .github/workflows/build.yml file reflects updated paths and keys for network graph and scorer binaries, ensuring the latest versions are used. In the lightning source code, there's a refactoring for feature flag checks, structural optimizations for network graph storage, and scoring logic revisions to enhance performance. Additionally, new counters for node tracking in routing have been introduced, suggesting a move towards more detailed network analysis.

Changes

File Path	Change Summary
`.github/workflows/build.yml`	Updated paths and keys for net graph and scorer binaries; new SHA sum checks added.
`.../src/ln/features.rs`	Refactored `requires_unknown_bits` method for efficient flag comparison.
`.../src/routing/gossip.rs`	Added node counters and restructured fields for cache optimization and consistency checks.
`.../src/routing/scoring.rs`	Altered `decay_100k_channel_bounds` function to use graph scorer and current time update.
`.../src/util/test_utils.rs`	Introduced node counters to routing structs for enhanced route tracking.

🐇✨
To code we hop, with every commit,
A graph update, a refactor bit.
With scores and nodes, we weave the net,
Our binary tales, in silicon set. 🌐🔍

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Review Status

Actionable comments generated: 40

Configuration used: CodeRabbit UI

Commits

Files that changed from the base of the PR and between 5592378 and bcbf56d.

Files selected for processing (6)

.github/workflows/build.yml (1 hunks)
lightning/src/ln/features.rs (1 hunks)
lightning/src/routing/gossip.rs (33 hunks)
lightning/src/routing/router.rs (38 hunks)
lightning/src/routing/scoring.rs (1 hunks)
lightning/src/util/test_utils.rs (2 hunks)

Files not summarized due to errors (1)

lightning/src/routing/router.rs: Error: Message exceeds token limit

Additional comments: 40

.github/workflows/build.yml (4)

86-87: Updated binary file paths and keys for the network graph to reflect new versioning. Ensure that the new binary files are correctly placed and accessible at the specified URLs.

91-93: The download and hash verification steps for the network graph binary have been updated. Verify that the hash matches the expected value to ensure integrity of the downloaded file.

98-98: The environment variable for the expected network graph snapshot SHA sum has been updated. Confirm that this new SHA sum is correct and corresponds to the new binary file.

99-115: New steps have been added for caching, fetching, and verifying the scorer binary file. Ensure that the scorer binary is correctly integrated into the CI process and that the SHA sum verification step is accurate.

lightning/src/ln/features.rs (1)

778-794: The refactoring of the requires_unknown_bits method to use chunk iteration for flag comparison is a significant improvement in terms of efficiency. By processing 64 bits at a time instead of 8, the method reduces the number of iterations needed for feature flag checks, which can be beneficial for performance, especially when dealing with a large number of feature flags.

lightning/src/routing/gossip.rs (34)

40-45: > Note: This review was outside the patches, and no patch overlapping with it was found. Original lines [1-6]

Imports and uses are modified, ensure that all the newly added imports (AtomicUsize, Ordering, etc.) are used in the code and that there are no unused imports which can lead to warnings or bloat.

67-67: The NodeId struct is introduced or modified. Ensure that the changes to this struct are consistent with the rest of the codebase, especially with respect to serialization and deserialization, as these are common areas where issues arise when modifying data structures.

168-169: New fields removed_node_counters and next_node_counter are added to the NetworkGraph struct. Verify that the logic for managing these counters is correctly implemented throughout the codebase, especially in the context of node removal and addition.

197-197: The max_node_counter field is added to the ReadOnlyNetworkGraph. Ensure that this field is properly maintained and represents the correct maximum value of node_counter across all nodes.

754-761: The ChannelUpdateInfo struct is annotated with repr(C, align(32)). Confirm that the alignment and representation directives are appropriate and that they do not cause any unforeseen issues on different architectures or with FFI boundaries.

847-858: The ChannelInfo struct is annotated with repr(align(128), C). Similar to the previous comment, verify that the alignment and representation directives are appropriate and do not cause issues on different architectures or with FFI boundaries.

898-907: The PartialEq implementation for ChannelInfo is modified. Ensure that all fields that should be compared are included and that this change does not introduce any regressions in areas where ChannelInfo equality checks are performed.

1024-1025: The node_one_counter and node_two_counter fields in ChannelInfo are initialized with u32::max_value(). Confirm that this is the intended default value and that it is handled correctly in all parts of the code where ChannelInfo is used.

1036-1037: The DirectedChannelInfo struct now includes source_counter and target_counter fields. Verify that these fields are correctly updated and used in routing decisions.

1046-1051: The new method for DirectedChannelInfo is modified to set source_counter and target_counter. Ensure that the logic for determining these values is correct and that it aligns with the intended use of these counters in routing.

1094-1099: The source_counter and target_counter methods are added to DirectedChannelInfo. Verify that these methods are used consistently and correctly throughout the routing logic.

1290-1295: The node_counter field is added to NodeInfo. Ensure that this field is correctly managed throughout the node's lifecycle and that it is consistent with the new vector-based lookup system.

1359-1359: The node_counter field in NodeInfo is initialized with u32::max_value(). Confirm that this is the intended default value and that it is handled correctly in all parts of the code where NodeInfo is used.

1369-1370: The write method for NetworkGraph now includes a call to test_node_counter_consistency. Verify that this method is correctly implemented and that it does not introduce performance regressions.

1405-1415: The deserialization logic for ChannelInfo and NodeInfo is modified to set node_counter. Ensure that the deserialization process is correct and that the node_counter values are consistent with the serialized data.

1437-1438: The NetworkGraph constructor is modified to initialize removed_node_counters and next_node_counter. Verify that these fields are initialized to the correct values and that the constructor's logic is consistent with the rest of the codebase.

1484-1485: The NetworkGraph constructor is modified to initialize next_node_counter to 0 and removed_node_counters to an empty vector. Confirm that these initial values are correct and that they are handled properly throughout the graph's lifecycle.

1493-1521: The test_node_counter_consistency method is added to NetworkGraph. Verify that this method is correctly implemented and that it is called in appropriate places to ensure the consistency of node_counter values.

1680-1681: The node_one_counter and node_two_counter fields in ChannelInfo are initialized with u32::max_value(). Confirm that this is the intended default value and that it is handled correctly in all parts of the code where ChannelInfo is used.

1696-1697: The logic for adding a channel between nodes is modified. Verify that the changes are correct and that they do not introduce any regressions in channel management.

1711-1713: The remove_channel_in_nodes method is called within a match arm. Verify that the logic for removing and updating channel information is correct and that it does not introduce any inconsistencies in the network graph.

1723-1727: The node_counter_id array is introduced to manage node_counter values for channels. Verify that this logic is correct and that it properly updates the node_counter values for both nodes associated with a channel.

1832-1833: The node_one_counter and node_two_counter fields in ChannelInfo are initialized with u32::max_value(). Confirm that this is the intended default value and that it is handled correctly in all parts of the code where ChannelInfo is used.

1862-1862: The remove_channel_in_nodes method is called. Verify that the logic for removing a channel from nodes is correct and that it does not introduce any inconsistencies in the network graph.

1881-1881: The remove_channel_in_nodes method is called within a loop. Verify that the logic for removing channels and managing node counters is correct and that it does not introduce any inconsistencies in the network graph.

1890-1890: The logic for removing nodes and updating removed_node_counters is modified. Verify that the changes are correct and that they do not introduce any regressions in node management.

1973-1973: The remove_channel_in_nodes method is called within a loop. Verify that the logic for removing channels and managing node counters is correct and that it does not introduce any inconsistencies in the network graph.

2152-2160: The remove_channel_in_nodes method is modified to update removed_node_counters. Verify that the logic for removing channels and managing node counters is correct and that it does not introduce any inconsistencies in the network graph.

2218-2222: The max_node_counter method is added to ReadOnlyNetworkGraph. Verify that this method returns the correct maximum value for node_counter and that it is used appropriately throughout the code.

3519-3520: The node_one_counter and node_two_counter fields in ChannelInfo are set to specific values in a test. Verify that these values are appropriate for the test scenario and that the test correctly reflects the intended behavior of the code.

3539-3540: The node_one_counter and node_two_counter fields in ChannelInfo are set to specific values in a test. Verify that these values are appropriate for the test scenario and that the test correctly reflects the intended behavior of the code.

3595-3595: The node_counter field in NodeInfo is set to a specific value in a test. Verify that this value is appropriate for the test scenario and that the test correctly reflects the intended behavior of the code.

3632-3632: The read_network_graph benchmark function is modified. Verify that the benchmark is correctly set up and that it accurately measures the performance of reading the network graph.

3642-3642: The write_network_graph benchmark function is modified. Verify that the benchmark is correctly set up and that it accurately measures the performance of writing the network graph.

lightning/src/routing/scoring.rs (1)

3687-3687: The logic within decay_100k_channel_bounds has been fundamentally changed. Ensure that the new implementation aligns with the intended behavior and that all dependent features are tested with the updated logic.

lightning/src/routing/router.rs

coderabbitai

Review Status

Actionable comments generated: 28

Configuration used: CodeRabbit UI

Commits

Files that changed from the base of the PR and between 51d9ee3 and 8d80f82.

Files selected for processing (6)

.github/workflows/build.yml (1 hunks)
lightning/src/ln/features.rs (1 hunks)
lightning/src/routing/gossip.rs (33 hunks)
lightning/src/routing/router.rs (38 hunks)
lightning/src/routing/scoring.rs (1 hunks)
lightning/src/util/test_utils.rs (2 hunks)

Files not summarized due to errors (1)

lightning/src/routing/router.rs: Error: Message exceeds token limit

Files skipped from review as they are similar to previous changes (4)

.github/workflows/build.yml
lightning/src/ln/features.rs
lightning/src/routing/gossip.rs
lightning/src/util/test_utils.rs

lightning/src/routing/router.rs

TheBlueMatt · 2024-03-05T18:16:44Z

Rebased.

coderabbitai

Review Status

Actionable comments generated: 7

Configuration used: CodeRabbit UI

Commits

Files that changed from the base of the PR and between 7a35bf8 and c5e0599.

Files selected for processing (6)

.github/workflows/build.yml (1 hunks)
lightning/src/ln/features.rs (1 hunks)
lightning/src/routing/gossip.rs (33 hunks)
lightning/src/routing/router.rs (38 hunks)
lightning/src/routing/scoring.rs (1 hunks)
lightning/src/util/test_utils.rs (2 hunks)

Files not summarized due to errors (1)

lightning/src/routing/router.rs: Error: Message exceeds token limit

Files not reviewed due to errors (1)

(no review received)

Additional comments: 19

.github/workflows/build.yml (6)

86-87: The path and key for caching the routing graph snapshot have been updated to lightning/net_graph-2023-12-10.bin and ldk-net_graph-v0.0.118-2023-12-10.bin, respectively. Ensure that the new binary file is correctly generated and available at the specified URL to prevent CI failures.

91-93: The fetching process for the routing graph snapshot includes a SHA-256 checksum verification step. It's crucial to verify that the EXPECTED_ROUTING_GRAPH_SNAPSHOT_SHASUM matches the actual checksum of the file at https://bitcoin.ninja/ldk-net_graph-v0.0.118-2023-12-10.bin to ensure integrity and prevent potential security issues.

98-98: The EXPECTED_ROUTING_GRAPH_SNAPSHOT_SHASUM is updated. Double-check that this SHA-256 checksum accurately corresponds to the new routing graph snapshot file to ensure the integrity of the downloaded file.

103-104: The path and key for caching the scorer snapshot have been updated to lightning/scorer-2023-12-10.bin and ldk-scorer-v0.0.118-2023-12-10.bin, respectively. Confirm that the new binary file is correctly generated and accessible at the provided URL to avoid CI disruptions.

108-110: The fetching process for the scorer snapshot includes a SHA-256 checksum verification step. It's essential to ensure that the EXPECTED_SCORER_SNAPSHOT_SHASUM matches the actual checksum of the file at https://bitcoin.ninja/ldk-scorer-v0.0.118-2023-12-10.bin to maintain integrity and avert potential security risks.

115-115: The EXPECTED_SCORER_SNAPSHOT_SHASUM is updated. Verify that this SHA-256 checksum correctly matches the new scorer snapshot file to guarantee the integrity of the downloaded file.
lightning/src/routing/gossip.rs (8)
67-67: The NodeId struct is correctly annotated with #[derive(Clone, Copy, PartialEq, Eq)] to ensure it can be easily copied and compared.

168-169: The addition of removed_node_counters and next_node_counter fields to the NetworkGraph struct is consistent with the PR's objective to optimize routing performance by using unique counters for nodes.

197-197: The max_node_counter field in ReadOnlyNetworkGraph struct is a good addition for tracking the maximum node counter value, which is likely used for performance optimizations in routing.

753-761: The use of #[repr(C, align(32))] for ChannelUpdateInfo struct is a smart optimization to ensure that the struct's layout is predictable and aligned for cache performance. This is particularly important for structures that are frequently accessed and modified in performance-critical paths.

846-852: Similarly, the ChannelInfo struct's alignment with #[repr(align(128), C)] is a thoughtful optimization for cache performance. Ensuring that frequently accessed fields are likely to be on the same or adjacent cache lines can significantly impact performance in routing.

1024-1025: The initialization of node_one_counter and node_two_counter to u32::max_value() in ChannelInfo struct's Readable implementation seems to be a placeholder. It's crucial to ensure that these counters are correctly set elsewhere in the code to meaningful values, as using u32::max_value() directly could lead to incorrect behavior or performance issues.

Verification successful

The search results indicate that node_one_counter and node_two_counter are indeed updated from their initial u32::max_value() state in various parts of the code. There are assertions and explicit updates that suggest these counters are managed and set to meaningful values before being used in a way that could affect behavior or performance. This evidence supports the idea that there is a mechanism in place to ensure the counters are not left at their placeholder values throughout the code's execution.
Scripts Executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Verify that node counters are correctly assigned and not left at their initial max value state.
rg "node_(one|two)_counter" --type rust
Length of output: 1526
* 1381-1382: The `test_node_counter_consistency` method in `NetworkGraph` is a good practice for ensuring the internal consistency of node counters. This kind of invariant checking is crucial for debugging and maintaining the integrity of the routing graph. * 1537-1537: The `read_only` method in `NetworkGraph` provides a safe way to access a read-only view of the network graph. This is useful for operations that need to read network graph data without the risk of modifying it.
lightning/src/routing/scoring.rs (1)

3687-3687: Consider evaluating the performance impact of the time_passed method within decay_100k_channel_bounds. Frequent updates to channel scores could lead to performance bottlenecks, especially with a large number of channels.

Explore optimizing the decay calculation to reduce computational overhead, possibly by batching updates or adjusting the decay frequency based on network size and activity.

lightning/src/routing/router.rs (4)

1126-1132: > 📝 NOTE

This review was outside the diff hunks, and no overlapping diff hunk was found. Original lines [3-6]

The constants _GRAPH_NODE_SMALL and _GRAPH_NODE_FIXED_SIZE are used in static assertions to enforce a specific layout for RouteGraphNode, as clarified in previous discussions. Ensure that these assertions are present and effectively enforce the intended layout for performance optimization.

1181-1189: The addition of source_node_counter and target_node_counter fields in this context further supports the PR's goal of optimizing routing performance by using unique u32 counters. Good consistency across different structs.

1205-1209: The introduction of source_node_counter for blinded paths aligns with the PR's optimization strategy. Ensure that the assumptions regarding the introduction point's visibility as a public node are valid and clearly documented.

1354-1375: The use of #[inline(always)] and #[inline] attributes is based on careful benchmarking, as previously clarified. Consider adding comments to document the benchmarking results and rationale behind these decisions to aid future maintainers.

lightning/src/util/test_utils.rs

lightning/src/routing/scoring.rs

lightning/src/routing/router.rs

tnull

Firstly, let me say sorry for taking this long to have a look this, I believe I self-requested review a while back.

I did a first high-level pass and added an initial round of questions. I have to say that I'm close to a concept NACK on this one: the router logic is hard to reason about as it is and we keep discovering bugs here. It seems to me that this PR significantly increases the code complexity and introduces several new angles how things can go wrong. While this seems to work just fine for now, I fear that we'll see more breakage in the router code as a consequence in the future. If we really want to go ahead with this, it would be great if we could find a better abstraction for our newly created data structure that would offer a foolproof API, e.g., so we don't for get to insert/remove reused counters in the corresponding list.

I have yet to run the benchmarks myself to see how much speedup this PR would gain us, but from my first impression I'm not convinced it's worth the increased risks and maintenance costs. Also, it seems that a good chunk of the performance improvements might come from the last few commits alone, which are optimizations that could be applied independently from switching to node counters?

tnull · 2024-03-06T13:58:04Z

lightning/src/routing/router.rs


 	/// Tries to open a network graph file, or panics with a URL to fetch it.
-	pub(crate) fn get_route_file() -> Result<std::fs::File, &'static str> {


Re: "This means future changes to the scorer's data may
be harder to benchmark": Would an alternative be to keep both versions as separate benchmarks, so we could still benchmark updates impacting the Scorer's data model with synthetic data?

We could, but I think the random failures model we had before is borderline useless for benchmarking route performance changes. We're somewhat better off doing some kind of conversion from old to new scoring data and then bencharking from there.

tnull · 2024-03-06T14:11:51Z

lightning/src/routing/gossip.rs

+	///
+	/// These IDs allow the router to avoid a `HashMap` lookup by simply using this value as an
+	/// index in a `Vec`, skipping a big step in some of the hottest code when routing.
+	pub(crate) node_counter: u32,


It seems to me that storing the counter in here could be pretty dangerous if we or users were to clone the NodeInfo, do something with it and come back, at which point we could have dropped the entry and recreated another NodeInfo with the same counter.

Is this an issue? Do we want store the node_counter as part of a wrapper struct holding both the NodeInfo and the counter? Alternatively, we could make this an Option<u32> and make sure that clone() would reset it to None, asserting we'd have to re-insert/lookup it as a 'fresh' info?

Hmm, I'm not convinced its an issue. The network graph is publicly read-only - it has several internal consistency requirements (eg each channel has both side nodes in the nodes map) which imply users can't freely edit bits of it (though they're welcome to take parts of it and copy them locally to build their own graphs).

lightning/src/routing/gossip.rs

tnull · 2024-03-06T14:24:11Z

lightning/src/routing/gossip.rs

@@ -1409,15 +1429,42 @@ impl<L: Deref> NetworkGraph<L> where L::Target: Logger {
 			logger,
 			channels: RwLock::new(IndexedMap::new()),
 			nodes: RwLock::new(IndexedMap::new()),
+			next_node_counter: AtomicUsize::new(0),
+			removed_node_counters: Mutex::new(Vec::new()),


Rather than adding these just side-by-side, could we create a new data structure wrapping them and exposing adequate insert/remove API methods so that we'd never forget to, e.g., call removed_node_counters.push(..) whenever we remove a node?

Hmm, sadly that doesn't improve things much, we end up with a bunch of places where we just replace one removed_node_counters.push(..) with a node_counters.removed_node(...). We can't super easily use a method on the graph to mark a node removed, as we're often doing it with hash map entries already in place from a previous lookup.

Still, test_node_counter_consistency is pretty thorough, so if we have any obvious bugs fuzzing or tests should easily hit assertions in that.

tnull · 2024-03-06T14:26:38Z

lightning/src/routing/gossip.rs

@@ -865,6 +857,24 @@ pub struct ChannelInfo {
 	/// (which we can probably assume we are - no-std environments probably won't have a full
 	/// network graph in memory!).
 	announcement_received_time: u64,
+
+	/// The [`NodeInfo::node_counter`] of the node pointed to by [`Self::node_one`].
+	pub(crate) node_one_counter: u32,


Given that these counters will be reused, how can we be sure that they won't get outdated, especially for cloned ChannelInfos as mentioned above regarding NodeInfo?

They should only get reused when a node was fully removed. The network graph already has internal consistency requirements that any channel has both the source and sink nodes in the graph as well. This just relies on that consistency requirement by adding an additional pointer. In terms of ChannelInfos copied and used outside of a specific graph, indeed, they could point nowhere, but that's kinda by definition - the counters are specific to a NetworkGraph, they aren't global in any other sense, and each route finding operation only cares about a single graph and its contained infos.

lightning/src/routing/router.rs

tnull · 2024-03-06T14:32:31Z

lightning/src/routing/router.rs

+	/// public node.
+	pub(crate) payer_node_counter: u32,
+	/// A unique ID which describes the first hop counterparty. It will not conflict with any
+	/// [`super::gossip::NodeInfo::node_counter`]s, but may be equal to one if the counterparty is


nit: may be equal to one is ambiguous in this context (here and below).

Not sure how this is ambiguous? Its saying that this won't step on the toes of any data in our graph, but it may be equal to some data in our graph if the node is public.

lightning/src/routing/router.rs

TheBlueMatt · 2024-03-06T15:56:18Z

I did a first high-level pass and added an initial round of questions. I have to say that I'm close to a concept NACK on this one: the router logic is hard to reason about as it is and we keep discovering bugs here. It seems to me that this PR significantly increases the code complexity and introduces several new angles how things can go wrong. While this seems to work just fine for now, I fear that we'll see more breakage in the router code as a consequence in the future. If we really want to go ahead with this, it would be great if we could find a better abstraction for our newly created data structure that would offer a foolproof API, e.g., so we don't for get to insert/remove reused counters in the corresponding list.

Fair, let me encapsulate the node counter logic and remove it from get_route and then we can see how we feel about it.

TheBlueMatt · 2024-03-06T16:00:34Z

I have yet to run the benchmarks myself to see how much speedup this PR would gain us, but from my first impression I'm not convinced it's worth the increased risks and maintenance costs. Also, it seems that a good chunk of the performance improvements might come from the last few commits alone, which are optimizations that could be applied independently from switching to node counters?

Sadly not. The last few commits reduce the pressure we put on the branch predictor, and improve things a bit on the edges, but the vast majority of the gain here is dropping the hash table lookups. A very large portion of our total routing time is spent just doing hash table lookups directly (we have like 3 or 4 of them we index into in routing - the network graph, gossip data, dist, etc), so dropping one entirely is a huge win.

TheBlueMatt · 2024-06-02T00:42:04Z

Okay, rebased on main. With the new struct I think its not that messy, and now it also lets us simplify some of the blinded path stuff too which I think is nice.

valentinewallace

Conceptually, I think get_route isn't too much less readable than before now that the encapsulation has been added. I agree with the concerns about complexity generally, though the speedup seems pretty worthwhile.

lightning/src/blinded_path/mod.rs

lightning/src/routing/router.rs

lightning/src/routing/gossip.rs

valentinewallace · 2024-06-04T20:17:47Z

IMO, the node_counter changes could be split off to make the PR more focused. At the moment there's a lot bundled in here with the cache updates, more minor get_route optimizations, benchmarking updates and feature bit parsing.

TheBlueMatt · 2024-06-06T16:00:48Z

Pulled smaller changes into #3103 and #3104.

TheBlueMatt · 2024-07-10T19:17:33Z

Rebased.

valentinewallace · 2024-07-10T19:19:36Z

CI is sad.

When processing the main loop during routefinding, for each node, we check whether it happens to be our peer in one of our channels. This ensures we never fail to find a route that takes a hop through a private channel of ours, to a private node, then through invoice-provided route hints to reach the ultimate payee. Because this is incredibly hot code, doing a full `HashMap` lookup to check if each node is a first-hop target ends up eating a good chunk of time during routing. Luckily, we can trivially avoid this cost. Because we're already looking up the per-node state in the `dist` map, we can store a bool in each first-hop target's state, avoiding the lookup unless we know its going to succeed. This requires storing a dummy entry in `dist`, which feels somewhat strange, but is ultimately fine as we should never be looking at per-node state unless we've already found a path to that node, updating the fields in doign so.

While LLVM should inline and elide the redundant calls, because the router is rather large LLVM can decide against inlining in some cases where it would be an nice win. Thus, its worth DRY'ing the redundant calls explicitly.

Because we now have some slack space in `PathBuildingHop`, we can use it to cache some additional hot values. Here we use it to cache the source and target `node_counter`s for public channels, effectively prefetching the values from the channel state.

It turns out we spend several percent of our routefinding time just checking if nodes and channels require unknown features byte-by-byte. While the cost is almost certainly dominated by the memory read latency, avoiding doing the checks byte-by-byte should reduce the branch count slightly, which may reduce the overhead.

Because fetching fields from the `$candidate` often implies an indirect read, grouping them together may result in one or two fewer memory loads, so we do so here.

Because we scan per-channel information in the hot inner loop of our routefinding immediately after looking a channel up in a `HashMap`, we end up spending a nontrivial portion of our routefinding time waiting on memory to be read in. While there is only so much we can do about that, ensuring the channel information that we care about is sitting on one or adjacent cache lines avoids paying that penalty twice. Thus, here we manually lay out `ChannelInfo` and `ChannelUpdateInfo` and set them to 128b and 32b alignment, respectively. This wastes some space in memory in our network graph, but improves routing performance in return.

TheBlueMatt · 2024-07-10T19:38:08Z

Fixed

valentinewallace

LGTM. On the fence about whether this needs a second reviewer so up to you!

lightning/src/routing/router.rs

TheBlueMatt · 2024-07-11T18:16:38Z

It is all pretty trivial, but at least eg the feature optimization and the first hop cache thing could probably use another pair of eyes.

tnull

LGTM, mod one question.

Feel free to ignore nits, as they aren't important.

tnull · 2024-07-15T08:22:10Z

lightning/src/routing/router.rs

+			// dummy entry in dist for each first-hop target, allowing us to do this lookup for
+			// free since we're already looking at the `was_processed` flag.
+			//
+			// Note that all the fields (except `is_first_hop_target`) will be overwritten whenever


Would it be worth adding debug_asserts or similar checks to make sure we don't deviate from this assumption?

Hmm, it would be, but I'm not sure I know how to write such an assertion?

I guess we could check that if one field is updated, all others are too? But maybe not worth it?

Hmm, I'm not sure how to check on a per-field basis. The likely failure case is we add another field and fail to update it when relevant, but I'm not aware of a way to iterate over the fields of a struct?

lightning/src/routing/gossip.rs

tnull · 2024-07-15T08:34:14Z

lightning/src/routing/gossip.rs

+//
+// Sadly, this is not possible, however we can still do okay - all of the fields before
+// `one_to_two` and `two_to_one` are just under 128 bytes long, so we can ensure they sit on
+// adjacent cache lines (which are generally fetched together in x86_64 processors).


nit:

Suggested change

// adjacent cache lines (which are generally fetched together in x86_64 processors).

// adjacent cache lines (which are generally fetched together in x86-64 processors).

tnull

Feel free to land

TheBlueMatt · 2024-07-17T14:06:03Z

Gonna go ahead and land to get this done, but will tackle nits in a quick followup.

#2803 nits

TheBlueMatt force-pushed the 2023-12-routing-dist-vec branch 2 times, most recently from 22f6f4b to 06bcbf6 Compare January 8, 2024 21:35

tnull self-requested a review January 10, 2024 08:40

TheBlueMatt force-pushed the 2023-12-routing-dist-vec branch from 06bcbf6 to bcbf56d Compare January 18, 2024 23:56

coderabbitai bot reviewed Jan 19, 2024

View reviewed changes

TheBlueMatt force-pushed the 2023-12-routing-dist-vec branch from bcbf56d to 8d80f82 Compare January 25, 2024 20:17

coderabbitai bot reviewed Jan 25, 2024

View reviewed changes

TheBlueMatt force-pushed the 2023-12-routing-dist-vec branch from 8d80f82 to c5e0599 Compare March 5, 2024 18:16

TheBlueMatt mentioned this pull request Mar 5, 2024

Move code out of add_entry! in get_route #2920

Closed

coderabbitai bot reviewed Mar 5, 2024

View reviewed changes

tnull reviewed Mar 6, 2024

View reviewed changes

TheBlueMatt force-pushed the 2023-12-routing-dist-vec branch 3 times, most recently from 75d0c2e to f53cc4c Compare March 20, 2024 00:51

TheBlueMatt force-pushed the 2023-12-routing-dist-vec branch from f53cc4c to 799dc75 Compare June 2, 2024 00:39

TheBlueMatt force-pushed the 2023-12-routing-dist-vec branch from 799dc75 to fb2d61e Compare June 2, 2024 00:44

valentinewallace self-requested a review June 3, 2024 14:52

TheBlueMatt force-pushed the 2023-12-routing-dist-vec branch from fb2d61e to db4c369 Compare June 3, 2024 14:53

valentinewallace reviewed Jun 4, 2024

View reviewed changes

TheBlueMatt force-pushed the 2023-12-routing-dist-vec branch from db4c369 to 62ffd78 Compare June 6, 2024 15:56

TheBlueMatt mentioned this pull request Jun 6, 2024

Use unique per-node "node_counter"s rather than a node hashmap in routing #3104

Merged

TheBlueMatt force-pushed the 2023-12-routing-dist-vec branch from 62ffd78 to 5647247 Compare July 10, 2024 19:16

TheBlueMatt changed the title ~~Use unique per-node "node_counter"s rather than a node hashmap in routing~~ Misc routing optimization Jul 10, 2024

TheBlueMatt added 6 commits July 10, 2024 19:38

Consolidate candidate access in add_entry during routing

bed1fb0

Because fetching fields from the `$candidate` often implies an indirect read, grouping them together may result in one or two fewer memory loads, so we do so here.

TheBlueMatt force-pushed the 2023-12-routing-dist-vec branch from 5647247 to f689e01 Compare July 10, 2024 19:38

valentinewallace approved these changes Jul 11, 2024

View reviewed changes

lightning/src/routing/router.rs Show resolved Hide resolved

TheBlueMatt added the Seeking Code Review label Jul 11, 2024

tnull reviewed Jul 15, 2024

View reviewed changes

tnull removed the Seeking Code Review label Jul 15, 2024

tnull approved these changes Jul 17, 2024

View reviewed changes

TheBlueMatt merged commit ac1463b into lightningdevkit:main Jul 17, 2024
12 of 17 checks passed

TheBlueMatt added a commit that referenced this pull request Jul 17, 2024

Merge pull request #3187 from TheBlueMatt/2024-07-routing-nits

012bc50

#2803 nits


		/// Tries to open a network graph file, or panics with a URL to fetch it.
		pub(crate) fn get_route_file() -> Result<std::fs::File, &'static str> {

	// adjacent cache lines (which are generally fetched together in x86_64 processors).
	// adjacent cache lines (which are generally fetched together in x86-64 processors).

Misc routing optimization #2803

Misc routing optimization #2803

Uh oh!

Conversation

TheBlueMatt commented Dec 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shaavan commented Dec 22, 2023

Uh oh!

TheBlueMatt commented Jan 8, 2024

Uh oh!

TheBlueMatt commented Jan 18, 2024

Uh oh!

codecov-commenter commented Jan 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot commented Jan 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TheBlueMatt commented Mar 5, 2024

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tnull left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TheBlueMatt Mar 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

TheBlueMatt commented Dec 21, 2023 •

edited

Loading

codecov-commenter commented Jan 18, 2024 •

edited

Loading

coderabbitai bot commented Jan 18, 2024 •

edited

Loading

CodeRabbit Configration File (`.coderabbit.yaml`)

tnull left a comment •

edited

Loading

TheBlueMatt Mar 19, 2024 •

edited

Loading