Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc routing optimization #2803

Merged

Conversation

TheBlueMatt
Copy link
Collaborator

@TheBlueMatt TheBlueMatt commented Dec 21, 2023

During routing, we spend most of our time doing hashmap lookups. It turns out, we can drop two of them, the first requires a good bit of work - assigning each node in memory a random u32 "node counter", we can then drop the main per-node routefinding state map and replace it with a vec. Once we do that, we can also drop the first-hop hashmap lookup that we do on a per node basis as we walk the network graph, replacing it with a check in the same vec.

This is the first in a series of PRs that, in total, substantially more than double our routefinding performance with real data. This first step optimizes the route-finder itself, with later steps more focused on the scorer.

Based on #2802.

The bulk of this PR was landed in #3103 and #3104. This PR now includes a grab-bag of misc optimizations to get_route which should speed the router up a smidge.

@shaavan
Copy link
Contributor

shaavan commented Dec 22, 2023

CI's unhappy.
Looks like there's some error in the code

@TheBlueMatt TheBlueMatt force-pushed the 2023-12-routing-dist-vec branch 2 times, most recently from 22f6f4b to 06bcbf6 Compare January 8, 2024 21:35
@TheBlueMatt
Copy link
Collaborator Author

Fixed.

@TheBlueMatt
Copy link
Collaborator Author

Rebased.

@codecov-commenter
Copy link

codecov-commenter commented Jan 18, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.78%. Comparing base (78c0eaa) to head (f689e01).

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2803      +/-   ##
==========================================
- Coverage   89.80%   89.78%   -0.02%     
==========================================
  Files         121      121              
  Lines      100045   100094      +49     
  Branches   100045   100094      +49     
==========================================
+ Hits        89845    89869      +24     
- Misses       7533     7555      +22     
- Partials     2667     2670       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

coderabbitai bot commented Jan 18, 2024

Walkthrough

The project has undergone a significant update, focusing on efficiency and data integrity. The .github/workflows/build.yml file reflects updated paths and keys for network graph and scorer binaries, ensuring the latest versions are used. In the lightning source code, there's a refactoring for feature flag checks, structural optimizations for network graph storage, and scoring logic revisions to enhance performance. Additionally, new counters for node tracking in routing have been introduced, suggesting a move towards more detailed network analysis.

Changes

File Path Change Summary
.github/workflows/build.yml Updated paths and keys for net graph and scorer binaries; new SHA sum checks added.
.../src/ln/features.rs Refactored requires_unknown_bits method for efficient flag comparison.
.../src/routing/gossip.rs Added node counters and restructured fields for cache optimization and consistency checks.
.../src/routing/scoring.rs Altered decay_100k_channel_bounds function to use graph scorer and current time update.
.../src/util/test_utils.rs Introduced node counters to routing structs for enhanced route tracking.

🐇✨
To code we hop, with every commit,
A graph update, a refactor bit.
With scores and nodes, we weave the net,
Our binary tales, in silicon set. 🌐🔍


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 40

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 5592378 and bcbf56d.
Files selected for processing (6)
  • .github/workflows/build.yml (1 hunks)
  • lightning/src/ln/features.rs (1 hunks)
  • lightning/src/routing/gossip.rs (33 hunks)
  • lightning/src/routing/router.rs (38 hunks)
  • lightning/src/routing/scoring.rs (1 hunks)
  • lightning/src/util/test_utils.rs (2 hunks)
Files not summarized due to errors (1)
  • lightning/src/routing/router.rs: Error: Message exceeds token limit
Additional comments: 40
.github/workflows/build.yml (4)
  • 86-87: Updated binary file paths and keys for the network graph to reflect new versioning. Ensure that the new binary files are correctly placed and accessible at the specified URLs.
  • 91-93: The download and hash verification steps for the network graph binary have been updated. Verify that the hash matches the expected value to ensure integrity of the downloaded file.
  • 98-98: The environment variable for the expected network graph snapshot SHA sum has been updated. Confirm that this new SHA sum is correct and corresponds to the new binary file.
  • 99-115: New steps have been added for caching, fetching, and verifying the scorer binary file. Ensure that the scorer binary is correctly integrated into the CI process and that the SHA sum verification step is accurate.
lightning/src/ln/features.rs (1)
  • 778-794: The refactoring of the requires_unknown_bits method to use chunk iteration for flag comparison is a significant improvement in terms of efficiency. By processing 64 bits at a time instead of 8, the method reduces the number of iterations needed for feature flag checks, which can be beneficial for performance, especially when dealing with a large number of feature flags.
lightning/src/routing/gossip.rs (34)
  • 40-45: > Note: This review was outside the patches, and no patch overlapping with it was found. Original lines [1-6]

Imports and uses are modified, ensure that all the newly added imports (AtomicUsize, Ordering, etc.) are used in the code and that there are no unused imports which can lead to warnings or bloat.

  • 67-67: The NodeId struct is introduced or modified. Ensure that the changes to this struct are consistent with the rest of the codebase, especially with respect to serialization and deserialization, as these are common areas where issues arise when modifying data structures.
  • 168-169: New fields removed_node_counters and next_node_counter are added to the NetworkGraph struct. Verify that the logic for managing these counters is correctly implemented throughout the codebase, especially in the context of node removal and addition.
  • 197-197: The max_node_counter field is added to the ReadOnlyNetworkGraph. Ensure that this field is properly maintained and represents the correct maximum value of node_counter across all nodes.
  • 754-761: The ChannelUpdateInfo struct is annotated with repr(C, align(32)). Confirm that the alignment and representation directives are appropriate and that they do not cause any unforeseen issues on different architectures or with FFI boundaries.
  • 847-858: The ChannelInfo struct is annotated with repr(align(128), C). Similar to the previous comment, verify that the alignment and representation directives are appropriate and do not cause issues on different architectures or with FFI boundaries.
  • 898-907: The PartialEq implementation for ChannelInfo is modified. Ensure that all fields that should be compared are included and that this change does not introduce any regressions in areas where ChannelInfo equality checks are performed.
  • 1024-1025: The node_one_counter and node_two_counter fields in ChannelInfo are initialized with u32::max_value(). Confirm that this is the intended default value and that it is handled correctly in all parts of the code where ChannelInfo is used.
  • 1036-1037: The DirectedChannelInfo struct now includes source_counter and target_counter fields. Verify that these fields are correctly updated and used in routing decisions.
  • 1046-1051: The new method for DirectedChannelInfo is modified to set source_counter and target_counter. Ensure that the logic for determining these values is correct and that it aligns with the intended use of these counters in routing.
  • 1094-1099: The source_counter and target_counter methods are added to DirectedChannelInfo. Verify that these methods are used consistently and correctly throughout the routing logic.
  • 1290-1295: The node_counter field is added to NodeInfo. Ensure that this field is correctly managed throughout the node's lifecycle and that it is consistent with the new vector-based lookup system.
  • 1359-1359: The node_counter field in NodeInfo is initialized with u32::max_value(). Confirm that this is the intended default value and that it is handled correctly in all parts of the code where NodeInfo is used.
  • 1369-1370: The write method for NetworkGraph now includes a call to test_node_counter_consistency. Verify that this method is correctly implemented and that it does not introduce performance regressions.
  • 1405-1415: The deserialization logic for ChannelInfo and NodeInfo is modified to set node_counter. Ensure that the deserialization process is correct and that the node_counter values are consistent with the serialized data.
  • 1437-1438: The NetworkGraph constructor is modified to initialize removed_node_counters and next_node_counter. Verify that these fields are initialized to the correct values and that the constructor's logic is consistent with the rest of the codebase.
  • 1484-1485: The NetworkGraph constructor is modified to initialize next_node_counter to 0 and removed_node_counters to an empty vector. Confirm that these initial values are correct and that they are handled properly throughout the graph's lifecycle.
  • 1493-1521: The test_node_counter_consistency method is added to NetworkGraph. Verify that this method is correctly implemented and that it is called in appropriate places to ensure the consistency of node_counter values.
  • 1680-1681: The node_one_counter and node_two_counter fields in ChannelInfo are initialized with u32::max_value(). Confirm that this is the intended default value and that it is handled correctly in all parts of the code where ChannelInfo is used.
  • 1696-1697: The logic for adding a channel between nodes is modified. Verify that the changes are correct and that they do not introduce any regressions in channel management.
  • 1711-1713: The remove_channel_in_nodes method is called within a match arm. Verify that the logic for removing and updating channel information is correct and that it does not introduce any inconsistencies in the network graph.
  • 1723-1727: The node_counter_id array is introduced to manage node_counter values for channels. Verify that this logic is correct and that it properly updates the node_counter values for both nodes associated with a channel.
  • 1832-1833: The node_one_counter and node_two_counter fields in ChannelInfo are initialized with u32::max_value(). Confirm that this is the intended default value and that it is handled correctly in all parts of the code where ChannelInfo is used.
  • 1862-1862: The remove_channel_in_nodes method is called. Verify that the logic for removing a channel from nodes is correct and that it does not introduce any inconsistencies in the network graph.
  • 1881-1881: The remove_channel_in_nodes method is called within a loop. Verify that the logic for removing channels and managing node counters is correct and that it does not introduce any inconsistencies in the network graph.
  • 1890-1890: The logic for removing nodes and updating removed_node_counters is modified. Verify that the changes are correct and that they do not introduce any regressions in node management.
  • 1973-1973: The remove_channel_in_nodes method is called within a loop. Verify that the logic for removing channels and managing node counters is correct and that it does not introduce any inconsistencies in the network graph.
  • 2152-2160: The remove_channel_in_nodes method is modified to update removed_node_counters. Verify that the logic for removing channels and managing node counters is correct and that it does not introduce any inconsistencies in the network graph.
  • 2218-2222: The max_node_counter method is added to ReadOnlyNetworkGraph. Verify that this method returns the correct maximum value for node_counter and that it is used appropriately throughout the code.
  • 3519-3520: The node_one_counter and node_two_counter fields in ChannelInfo are set to specific values in a test. Verify that these values are appropriate for the test scenario and that the test correctly reflects the intended behavior of the code.
  • 3539-3540: The node_one_counter and node_two_counter fields in ChannelInfo are set to specific values in a test. Verify that these values are appropriate for the test scenario and that the test correctly reflects the intended behavior of the code.
  • 3595-3595: The node_counter field in NodeInfo is set to a specific value in a test. Verify that this value is appropriate for the test scenario and that the test correctly reflects the intended behavior of the code.
  • 3632-3632: The read_network_graph benchmark function is modified. Verify that the benchmark is correctly set up and that it accurately measures the performance of reading the network graph.
  • 3642-3642: The write_network_graph benchmark function is modified. Verify that the benchmark is correctly set up and that it accurately measures the performance of writing the network graph.
lightning/src/routing/scoring.rs (1)
  • 3687-3687: The logic within decay_100k_channel_bounds has been fundamentally changed. Ensure that the new implementation aligns with the intended behavior and that all dependent features are tested with the updated logic.

lightning/src/routing/router.rs Outdated Show resolved Hide resolved
lightning/src/routing/router.rs Outdated Show resolved Hide resolved
lightning/src/routing/router.rs Outdated Show resolved Hide resolved
lightning/src/routing/router.rs Outdated Show resolved Hide resolved
lightning/src/routing/router.rs Outdated Show resolved Hide resolved
lightning/src/routing/router.rs Outdated Show resolved Hide resolved
lightning/src/routing/router.rs Outdated Show resolved Hide resolved
lightning/src/routing/router.rs Outdated Show resolved Hide resolved
lightning/src/routing/router.rs Outdated Show resolved Hide resolved
lightning/src/routing/router.rs Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 28

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 51d9ee3 and 8d80f82.
Files selected for processing (6)
  • .github/workflows/build.yml (1 hunks)
  • lightning/src/ln/features.rs (1 hunks)
  • lightning/src/routing/gossip.rs (33 hunks)
  • lightning/src/routing/router.rs (38 hunks)
  • lightning/src/routing/scoring.rs (1 hunks)
  • lightning/src/util/test_utils.rs (2 hunks)
Files not summarized due to errors (1)
  • lightning/src/routing/router.rs: Error: Message exceeds token limit
Files skipped from review as they are similar to previous changes (4)
  • .github/workflows/build.yml
  • lightning/src/ln/features.rs
  • lightning/src/routing/gossip.rs
  • lightning/src/util/test_utils.rs

lightning/src/routing/router.rs Outdated Show resolved Hide resolved
lightning/src/routing/router.rs Outdated Show resolved Hide resolved
lightning/src/routing/router.rs Outdated Show resolved Hide resolved
lightning/src/routing/router.rs Outdated Show resolved Hide resolved
lightning/src/routing/router.rs Outdated Show resolved Hide resolved
lightning/src/routing/router.rs Outdated Show resolved Hide resolved
lightning/src/routing/router.rs Outdated Show resolved Hide resolved
lightning/src/routing/router.rs Outdated Show resolved Hide resolved
lightning/src/routing/router.rs Outdated Show resolved Hide resolved
lightning/src/routing/router.rs Outdated Show resolved Hide resolved
@TheBlueMatt
Copy link
Collaborator Author

Rebased.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 7

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 7a35bf8 and c5e0599.
Files selected for processing (6)
  • .github/workflows/build.yml (1 hunks)
  • lightning/src/ln/features.rs (1 hunks)
  • lightning/src/routing/gossip.rs (33 hunks)
  • lightning/src/routing/router.rs (38 hunks)
  • lightning/src/routing/scoring.rs (1 hunks)
  • lightning/src/util/test_utils.rs (2 hunks)
Files not summarized due to errors (1)
  • lightning/src/routing/router.rs: Error: Message exceeds token limit
Files not reviewed due to errors (1)
  • (no review received)
Additional comments: 19
.github/workflows/build.yml (6)
  • 86-87: The path and key for caching the routing graph snapshot have been updated to lightning/net_graph-2023-12-10.bin and ldk-net_graph-v0.0.118-2023-12-10.bin, respectively. Ensure that the new binary file is correctly generated and available at the specified URL to prevent CI failures.
  • 91-93: The fetching process for the routing graph snapshot includes a SHA-256 checksum verification step. It's crucial to verify that the EXPECTED_ROUTING_GRAPH_SNAPSHOT_SHASUM matches the actual checksum of the file at https://bitcoin.ninja/ldk-net_graph-v0.0.118-2023-12-10.bin to ensure integrity and prevent potential security issues.
  • 98-98: The EXPECTED_ROUTING_GRAPH_SNAPSHOT_SHASUM is updated. Double-check that this SHA-256 checksum accurately corresponds to the new routing graph snapshot file to ensure the integrity of the downloaded file.
  • 103-104: The path and key for caching the scorer snapshot have been updated to lightning/scorer-2023-12-10.bin and ldk-scorer-v0.0.118-2023-12-10.bin, respectively. Confirm that the new binary file is correctly generated and accessible at the provided URL to avoid CI disruptions.
  • 108-110: The fetching process for the scorer snapshot includes a SHA-256 checksum verification step. It's essential to ensure that the EXPECTED_SCORER_SNAPSHOT_SHASUM matches the actual checksum of the file at https://bitcoin.ninja/ldk-scorer-v0.0.118-2023-12-10.bin to maintain integrity and avert potential security risks.
  • 115-115: The EXPECTED_SCORER_SNAPSHOT_SHASUM is updated. Verify that this SHA-256 checksum correctly matches the new scorer snapshot file to guarantee the integrity of the downloaded file.
lightning/src/routing/gossip.rs (8)
  • 67-67: The NodeId struct is correctly annotated with #[derive(Clone, Copy, PartialEq, Eq)] to ensure it can be easily copied and compared.
  • 168-169: The addition of removed_node_counters and next_node_counter fields to the NetworkGraph struct is consistent with the PR's objective to optimize routing performance by using unique counters for nodes.
  • 197-197: The max_node_counter field in ReadOnlyNetworkGraph struct is a good addition for tracking the maximum node counter value, which is likely used for performance optimizations in routing.
  • 753-761: The use of #[repr(C, align(32))] for ChannelUpdateInfo struct is a smart optimization to ensure that the struct's layout is predictable and aligned for cache performance. This is particularly important for structures that are frequently accessed and modified in performance-critical paths.
  • 846-852: Similarly, the ChannelInfo struct's alignment with #[repr(align(128), C)] is a thoughtful optimization for cache performance. Ensuring that frequently accessed fields are likely to be on the same or adjacent cache lines can significantly impact performance in routing.
  • 1024-1025: The initialization of node_one_counter and node_two_counter to u32::max_value() in ChannelInfo struct's Readable implementation seems to be a placeholder. It's crucial to ensure that these counters are correctly set elsewhere in the code to meaningful values, as using u32::max_value() directly could lead to incorrect behavior or performance issues.
Verification successful

The search results indicate that node_one_counter and node_two_counter are indeed updated from their initial u32::max_value() state in various parts of the code. There are assertions and explicit updates that suggest these counters are managed and set to meaningful values before being used in a way that could affect behavior or performance. This evidence supports the idea that there is a mechanism in place to ensure the counters are not left at their placeholder values throughout the code's execution.

Scripts Executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Verify that node counters are correctly assigned and not left at their initial max value state.
rg "node_(one|two)_counter" --type rust

Length of output: 1526

* 1381-1382: The `test_node_counter_consistency` method in `NetworkGraph` is a good practice for ensuring the internal consistency of node counters. This kind of invariant checking is crucial for debugging and maintaining the integrity of the routing graph. * 1537-1537: The `read_only` method in `NetworkGraph` provides a safe way to access a read-only view of the network graph. This is useful for operations that need to read network graph data without the risk of modifying it.
lightning/src/routing/scoring.rs (1)
  • 3687-3687: Consider evaluating the performance impact of the time_passed method within decay_100k_channel_bounds. Frequent updates to channel scores could lead to performance bottlenecks, especially with a large number of channels.

Explore optimizing the decay calculation to reduce computational overhead, possibly by batching updates or adjusting the decay frequency based on network size and activity.

lightning/src/routing/router.rs (4)
  • 1126-1132: > 📝 NOTE

This review was outside the diff hunks, and no overlapping diff hunk was found. Original lines [3-6]

The constants _GRAPH_NODE_SMALL and _GRAPH_NODE_FIXED_SIZE are used in static assertions to enforce a specific layout for RouteGraphNode, as clarified in previous discussions. Ensure that these assertions are present and effectively enforce the intended layout for performance optimization.

  • 1181-1189: The addition of source_node_counter and target_node_counter fields in this context further supports the PR's goal of optimizing routing performance by using unique u32 counters. Good consistency across different structs.
  • 1205-1209: The introduction of source_node_counter for blinded paths aligns with the PR's optimization strategy. Ensure that the assumptions regarding the introduction point's visibility as a public node are valid and clearly documented.
  • 1354-1375: The use of #[inline(always)] and #[inline] attributes is based on careful benchmarking, as previously clarified. Consider adding comments to document the benchmarking results and rationale behind these decisions to aid future maintainers.

lightning/src/util/test_utils.rs Outdated Show resolved Hide resolved
lightning/src/util/test_utils.rs Outdated Show resolved Hide resolved
lightning/src/routing/scoring.rs Outdated Show resolved Hide resolved
lightning/src/routing/scoring.rs Outdated Show resolved Hide resolved
lightning/src/routing/router.rs Outdated Show resolved Hide resolved
lightning/src/routing/router.rs Outdated Show resolved Hide resolved
lightning/src/routing/router.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@tnull tnull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, let me say sorry for taking this long to have a look this, I believe I self-requested review a while back.

I did a first high-level pass and added an initial round of questions. I have to say that I'm close to a concept NACK on this one: the router logic is hard to reason about as it is and we keep discovering bugs here. It seems to me that this PR significantly increases the code complexity and introduces several new angles how things can go wrong. While this seems to work just fine for now, I fear that we'll see more breakage in the router code as a consequence in the future. If we really want to go ahead with this, it would be great if we could find a better abstraction for our newly created data structure that would offer a foolproof API, e.g., so we don't for get to insert/remove reused counters in the corresponding list.

I have yet to run the benchmarks myself to see how much speedup this PR would gain us, but from my first impression I'm not convinced it's worth the increased risks and maintenance costs. Also, it seems that a good chunk of the performance improvements might come from the last few commits alone, which are optimizations that could be applied independently from switching to node counters?


/// Tries to open a network graph file, or panics with a URL to fetch it.
pub(crate) fn get_route_file() -> Result<std::fs::File, &'static str> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: "This means future changes to the scorer's data may
be harder to benchmark": Would an alternative be to keep both versions as separate benchmarks, so we could still benchmark updates impacting the Scorer's data model with synthetic data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but I think the random failures model we had before is borderline useless for benchmarking route performance changes. We're somewhat better off doing some kind of conversion from old to new scoring data and then bencharking from there.

///
/// These IDs allow the router to avoid a `HashMap` lookup by simply using this value as an
/// index in a `Vec`, skipping a big step in some of the hottest code when routing.
pub(crate) node_counter: u32,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that storing the counter in here could be pretty dangerous if we or users were to clone the NodeInfo, do something with it and come back, at which point we could have dropped the entry and recreated another NodeInfo with the same counter.

Is this an issue? Do we want store the node_counter as part of a wrapper struct holding both the NodeInfo and the counter? Alternatively, we could make this an Option<u32> and make sure that clone() would reset it to None, asserting we'd have to re-insert/lookup it as a 'fresh' info?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not convinced its an issue. The network graph is publicly read-only - it has several internal consistency requirements (eg each channel has both side nodes in the nodes map) which imply users can't freely edit bits of it (though they're welcome to take parts of it and copy them locally to build their own graphs).

lightning/src/routing/gossip.rs Outdated Show resolved Hide resolved
@@ -1409,15 +1429,42 @@ impl<L: Deref> NetworkGraph<L> where L::Target: Logger {
logger,
channels: RwLock::new(IndexedMap::new()),
nodes: RwLock::new(IndexedMap::new()),
next_node_counter: AtomicUsize::new(0),
removed_node_counters: Mutex::new(Vec::new()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than adding these just side-by-side, could we create a new data structure wrapping them and exposing adequate insert/remove API methods so that we'd never forget to, e.g., call removed_node_counters.push(..) whenever we remove a node?

Copy link
Collaborator Author

@TheBlueMatt TheBlueMatt Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, sadly that doesn't improve things much, we end up with a bunch of places where we just replace one removed_node_counters.push(..) with a node_counters.removed_node(...). We can't super easily use a method on the graph to mark a node removed, as we're often doing it with hash map entries already in place from a previous lookup.

Still, test_node_counter_consistency is pretty thorough, so if we have any obvious bugs fuzzing or tests should easily hit assertions in that.

@@ -865,6 +857,24 @@ pub struct ChannelInfo {
/// (which we can probably assume we are - no-std environments probably won't have a full
/// network graph in memory!).
announcement_received_time: u64,

/// The [`NodeInfo::node_counter`] of the node pointed to by [`Self::node_one`].
pub(crate) node_one_counter: u32,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that these counters will be reused, how can we be sure that they won't get outdated, especially for cloned ChannelInfos as mentioned above regarding NodeInfo?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They should only get reused when a node was fully removed. The network graph already has internal consistency requirements that any channel has both the source and sink nodes in the graph as well. This just relies on that consistency requirement by adding an additional pointer. In terms of ChannelInfos copied and used outside of a specific graph, indeed, they could point nowhere, but that's kinda by definition - the counters are specific to a NetworkGraph, they aren't global in any other sense, and each route finding operation only cares about a single graph and its contained infos.

lightning/src/routing/router.rs Outdated Show resolved Hide resolved
/// public node.
pub(crate) payer_node_counter: u32,
/// A unique ID which describes the first hop counterparty. It will not conflict with any
/// [`super::gossip::NodeInfo::node_counter`]s, but may be equal to one if the counterparty is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: may be equal to one is ambiguous in this context (here and below).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how this is ambiguous? Its saying that this won't step on the toes of any data in our graph, but it may be equal to some data in our graph if the node is public.

lightning/src/routing/router.rs Outdated Show resolved Hide resolved
@TheBlueMatt
Copy link
Collaborator Author

I did a first high-level pass and added an initial round of questions. I have to say that I'm close to a concept NACK on this one: the router logic is hard to reason about as it is and we keep discovering bugs here. It seems to me that this PR significantly increases the code complexity and introduces several new angles how things can go wrong. While this seems to work just fine for now, I fear that we'll see more breakage in the router code as a consequence in the future. If we really want to go ahead with this, it would be great if we could find a better abstraction for our newly created data structure that would offer a foolproof API, e.g., so we don't for get to insert/remove reused counters in the corresponding list.

Fair, let me encapsulate the node counter logic and remove it from get_route and then we can see how we feel about it.

@TheBlueMatt
Copy link
Collaborator Author

I have yet to run the benchmarks myself to see how much speedup this PR would gain us, but from my first impression I'm not convinced it's worth the increased risks and maintenance costs. Also, it seems that a good chunk of the performance improvements might come from the last few commits alone, which are optimizations that could be applied independently from switching to node counters?

Sadly not. The last few commits reduce the pressure we put on the branch predictor, and improve things a bit on the edges, but the vast majority of the gain here is dropping the hash table lookups. A very large portion of our total routing time is spent just doing hash table lookups directly (we have like 3 or 4 of them we index into in routing - the network graph, gossip data, dist, etc), so dropping one entirely is a huge win.

@TheBlueMatt TheBlueMatt force-pushed the 2023-12-routing-dist-vec branch 3 times, most recently from 75d0c2e to f53cc4c Compare March 20, 2024 00:51
@TheBlueMatt
Copy link
Collaborator Author

Okay, rebased on main. With the new struct I think its not that messy, and now it also lets us simplify some of the blinded path stuff too which I think is nice.

Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually, I think get_route isn't too much less readable than before now that the encapsulation has been added. I agree with the concerns about complexity generally, though the speedup seems pretty worthwhile.

lightning/src/blinded_path/mod.rs Outdated Show resolved Hide resolved
lightning/src/routing/router.rs Show resolved Hide resolved
lightning/src/routing/router.rs Outdated Show resolved Hide resolved
lightning/src/routing/router.rs Outdated Show resolved Hide resolved
lightning/src/routing/router.rs Outdated Show resolved Hide resolved
lightning/src/routing/router.rs Outdated Show resolved Hide resolved
lightning/src/routing/router.rs Outdated Show resolved Hide resolved
lightning/src/routing/router.rs Outdated Show resolved Hide resolved
lightning/src/routing/gossip.rs Outdated Show resolved Hide resolved
lightning/src/routing/gossip.rs Outdated Show resolved Hide resolved
@valentinewallace
Copy link
Contributor

IMO, the node_counter changes could be split off to make the PR more focused. At the moment there's a lot bundled in here with the cache updates, more minor get_route optimizations, benchmarking updates and feature bit parsing.

@TheBlueMatt
Copy link
Collaborator Author

Pulled smaller changes into #3103 and #3104.

@TheBlueMatt TheBlueMatt changed the title Use unique per-node "node_counter"s rather than a node hashmap in routing Misc routing optimization Jul 10, 2024
@TheBlueMatt
Copy link
Collaborator Author

Rebased.

@valentinewallace
Copy link
Contributor

CI is sad.

When processing the main loop during routefinding, for each node,
we check whether it happens to be our peer in one of our channels.
This ensures we never fail to find a route that takes a hop through
a private channel of ours, to a private node, then through
invoice-provided route hints to reach the ultimate payee.

Because this is incredibly hot code, doing a full `HashMap` lookup
to check if each node is a first-hop target ends up eating a good
chunk of time during routing. Luckily, we can trivially avoid this
cost.

Because we're already looking up the per-node state in the `dist`
map, we can store a bool in each first-hop target's state, avoiding
the lookup unless we know its going to succeed.

This requires storing a dummy entry in `dist`, which feels somewhat
strange, but is ultimately fine as we should never be looking at
per-node state unless we've already found a path to that node,
updating the fields in doign so.
While LLVM should inline and elide the redundant calls, because the
router is rather large LLVM can decide against inlining in some
cases where it would be an nice win.

Thus, its worth DRY'ing the redundant calls explicitly.
Because we now have some slack space in `PathBuildingHop`, we can
use it to cache some additional hot values. Here we use it to
cache the source and target `node_counter`s for public channels,
effectively prefetching the values from the channel state.
It turns out we spend several percent of our routefinding time just
checking if nodes and channels require unknown features
byte-by-byte. While the cost is almost certainly dominated by the
memory read latency, avoiding doing the checks byte-by-byte should
reduce the branch count slightly, which may reduce the overhead.
Because fetching fields from the `$candidate` often implies an
indirect read, grouping them together may result in one or two
fewer memory loads, so we do so here.
Because we scan per-channel information in the hot inner loop of
our routefinding immediately after looking a channel up in a
`HashMap`, we end up spending a nontrivial portion of our
routefinding time waiting on memory to be read in.

While there is only so much we can do about that, ensuring the
channel information that we care about is sitting on one or
adjacent cache lines avoids paying that penalty twice. Thus, here
we manually lay out `ChannelInfo` and `ChannelUpdateInfo` and set
them to 128b and 32b alignment, respectively. This wastes some
space in memory in our network graph, but improves routing
performance in return.
@TheBlueMatt
Copy link
Collaborator Author

Fixed

Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. On the fence about whether this needs a second reviewer so up to you!

lightning/src/routing/router.rs Show resolved Hide resolved
@TheBlueMatt
Copy link
Collaborator Author

It is all pretty trivial, but at least eg the feature optimization and the first hop cache thing could probably use another pair of eyes.

Copy link
Contributor

@tnull tnull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, mod one question.

Feel free to ignore nits, as they aren't important.

// dummy entry in dist for each first-hop target, allowing us to do this lookup for
// free since we're already looking at the `was_processed` flag.
//
// Note that all the fields (except `is_first_hop_target`) will be overwritten whenever
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth adding debug_asserts or similar checks to make sure we don't deviate from this assumption?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it would be, but I'm not sure I know how to write such an assertion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could check that if one field is updated, all others are too? But maybe not worth it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not sure how to check on a per-field basis. The likely failure case is we add another field and fail to update it when relevant, but I'm not aware of a way to iterate over the fields of a struct?

lightning/src/routing/gossip.rs Show resolved Hide resolved
//
// Sadly, this is not possible, however we can still do okay - all of the fields before
// `one_to_two` and `two_to_one` are just under 128 bytes long, so we can ensure they sit on
// adjacent cache lines (which are generally fetched together in x86_64 processors).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
// adjacent cache lines (which are generally fetched together in x86_64 processors).
// adjacent cache lines (which are generally fetched together in x86-64 processors).

Copy link
Contributor

@tnull tnull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to land

@TheBlueMatt
Copy link
Collaborator Author

Gonna go ahead and land to get this done, but will tackle nits in a quick followup.

@TheBlueMatt TheBlueMatt merged commit ac1463b into lightningdevkit:main Jul 17, 2024
12 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants