-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure successful message propagation in case of disconnection mid-handshake #2725
Conversation
This PR makes the following interpretation of the issue and follows the solution accordingly:
If in case, my interpretation of the problem has been erroneous, do let me and I shall be glad to correct it! :) |
Also, this PR has been set to draft because the tests are incomplete and only partially test the added code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the status here. You have it marked draft, do you want feedback? Whay kind of feedback/review is this ready for?
Hi, @TheBlueMatt So, I am facing trouble preparing a test for when the peer is reconnected on time, and hence the open channel message is sent to it because it conflicts with how the rest of the test codebase is set up. So I wanted to get a general Approach ACK, before I go about hacking in the test to make them work. |
Updated from pr2725.01 -> pr2725.02 (diff) Changes:
Note:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay here, busy with thanksgiving travel and other stuff.
Updated from pr2725.03 -> pr2725.04 (diff) Changes:
Thank you very much, @TheBlueMatt, for this new idea to solve this problem. |
We still need to resend |
You are right! That was an oversight from my side. Thank you for pointing it out.
You are right. However, the goal of the PR is to ensure proper execution of the |
I don't think we necessarily care about that. What we really want is to end up with a funded channel (within a reasonable timeout) if a user requests one while being able to handle the counterparty disconnecting mid-handshake.
Typically nodes forget all about channels before sending/receiving |
Thanks, @wpaulino, for the details about the message transmissions! Seems like it's worth considering extending the PR from fixing the original issue to not failing channel creation mid-handshake due to channel disconnection. I am tinkering with an approach, and I shall update the PR very soon! |
Update: Okay, so I have figured out an approach, but this depends on the behavior changes introduced in #2760. We can track the list of msg_events we have sent during the handshake process, which can be used in case a peer disconnects midway. Once the funding is signed, we graduate the channel from However, currently, in the main, we graduate the channel as soon as we have created the funding. let (chan: Channel<SP>, msg_opt) = match peer_state.channel_by_id.remove(temporary_channel_id) {
...
}, Since #2760 is already getting approval and will soon be merged, I shall build this new approach over the changes introduced there. |
I don't think we need to explicitly track which message we're ready to send to our counterparty - if we are disconnected from a peer, then reconnect prior to funding, we have to restart from the |
Updated from pr2725.04 -> pr2725.05 (diff) Updates:
Logic: -> Follow the standard handshake routine. -> If we disconnect mid-handshake from our peer (that is, OutboundV1Channel is not resolved to a funded channel), we don't immediately close the OutboundV1Channel. -> Instead, we track how long it has been since we disconnected from peers. -> If we connect back within time, we rebroadcast SendOpenChannel corresponding to OutboundV1Channel to the peer. -> If we do not connect back within N (=2) timer ticks, we force close and remove the channel. Note:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically LGTM. One comment. Note that the test fixes in the last commit will need to get squashed into the commit that broke tests. We require (but don't actually check in CI) that each individual commit builds and passes tests.
Updated from pr2725.06 -> pr2725.07 (diff) Update:
|
Codecov ReportAttention:
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #2725 +/- ##
==========================================
- Coverage 89.14% 89.14% -0.01%
==========================================
Files 116 116
Lines 93205 93186 -19
Branches 93205 93186 -19
==========================================
- Hits 83089 83066 -23
- Misses 7583 7587 +4
Partials 2533 2533 ☔ View full report in Codecov by Sentry. |
Updated from pr2725.10 -> pr2725.11 (diff)
|
@shaavan, the updates you've made in response to the comments seem to be well-detailed and focused on improving the PR's clarity and functionality. It's good to see that you've expanded the tests to cover the new behavior thoroughly. Regarding the mention of (\( ⁰⊖⁰)/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 4
Configuration used: CodeRabbit UI
Files selected for processing (2)
- lightning/src/ln/channelmanager.rs (3 hunks)
- lightning/src/ln/functional_tests.rs (4 hunks)
Additional comments: 6
lightning/src/ln/functional_tests.rs (3)
- 3727-3737: The logic for handling disconnection before funding transaction broadcast is clear. However, the test seems to be disabled. Confirm if this is intentional and if so, provide a reason or a TODO comment for future enablement.
- 10513-10561: The new test
test_channel_close_when_not_timely_accepted
is well-structured and seems to cover the scenario it's designed for. However, ensure that the test is enabled and verify that it passes in the test suite.- 10563-10607: The test
test_rebroadcast_open_channel_when_reconnect_mid_handshake
appears to correctly simulate the scenario of a peer disconnecting and reconnecting mid-handshake. Verify that the test is enabled and that it passes in the test suite.lightning/src/ln/channelmanager.rs (3)
- 895-897: The logic here checks if any channel is in the
Funded
orUnfundedOutboundV1
phase. Ensure that this logic aligns with the intended behavior of theis_live
function, especially considering the newUnfundedOutboundV1
state.- 8876-8877: The
UnfundedOutboundV1
channel phase is set to always return true, which implies that these channels are considered live even if the peer is disconnected. Confirm that this behavior is consistent with the overall system logic and that it won't lead to any unexpected side effects.- 9028-9032: The addition of logic to push a
SendOpenChannel
message forUnfundedOutboundV1
channels is consistent with the PR's objective to allow rebroadcasting if the peer reconnects. Ensure that theget_open_channel
function generates the correct message and that this behavior is tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good other than some comment re-phrasing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 0
Configuration used: CodeRabbit UI
Files selected for processing (2)
- lightning/src/ln/channelmanager.rs (3 hunks)
- lightning/src/ln/functional_tests.rs (4 hunks)
Files skipped from review as they are similar to previous changes (2)
- lightning/src/ln/channelmanager.rs
- lightning/src/ln/functional_tests.rs
@TheBlueMatt |
Updated from pr2725.12 -> pr2725.13 (diff)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 1
Configuration used: CodeRabbit UI
Files selected for processing (2)
- lightning/src/ln/channelmanager.rs (3 hunks)
- lightning/src/ln/functional_tests.rs (6 hunks)
Files skipped from review as they are similar to previous changes (1)
- lightning/src/ln/channelmanager.rs
Additional comments: 6
lightning/src/ln/functional_tests.rs (6)
- 3727-3739: The logic for handling peer disconnection before funding is broadcasted seems to correctly simulate the disconnection and checks for the expected closure reasons. However, it's important to ensure that the
UNFUNDED_CHANNEL_AGE_LIMIT_TICKS
constant is appropriately defined and used across the test to simulate the timeout accurately.Ensure
UNFUNDED_CHANNEL_AGE_LIMIT_TICKS
is defined with a value that accurately represents the intended timeout duration for the test scenario.
- 10515-10559: The test
test_channel_close_when_not_timely_accepted
simulates a scenario where peers disconnect mid-handshake, and the channel is not timely accepted. The test setup and the disconnection simulation are correctly implemented. However, the assertion that checks the channel's state after disconnection (line 10534) and the assertion for the channel's closure (line 10550) are critical to validate the intended behavior. It's essential to ensure that these assertions accurately reflect the expected state changes in the system under test.Verify that the assertions accurately reflect the expected outcomes and that the test covers all relevant scenarios for the feature being tested.
- 10561-10604: The test
test_rebroadcast_open_channel_when_reconnect_mid_handshake
correctly simulates a peer disconnection and reconnection mid-handshake. The test ensures that theSendOpenChannel
message is rebroadcast upon reconnection (lines 10598-10603). This behavior aligns with the PR's objective to improve the robustness of the channel handshake process. However, it's crucial to verify that the rebroadcast logic is implemented as intended in the actual system code and not just within the test environment.Confirm that the rebroadcast logic for the
SendOpenChannel
message upon peer reconnection is correctly implemented in the system code and not solely within the test.
- 10762-10764: The introduction of the test
test_close_in_funding_batch
aims to ensure that if one channel in a batch closes, the entire batch is closed. This test is crucial for validating the robustness of batch processing in channel funding. It's important to ensure that the test setup correctly simulates the batch funding scenario and that the logic for triggering a channel close within the batch is accurately implemented.Ensure the test accurately simulates batch funding scenarios and correctly implements the logic for closing a channel within a batch.
- 10788-10820: The logic within
test_close_in_funding_batch
for force-closing a channel and verifying the closure of all channels in the batch (lines 10794-10820) is critical for ensuring the intended behavior of batch processing. The assertions and checks (lines 10797-10803, 10805-10809, and 10811-10818) are essential for validating the state of the system after a force-close operation. It's important to verify that these checks accurately reflect the expected outcomes and that the test covers all relevant scenarios for batch processing in channel funding.Verify that the assertions and checks within the test accurately reflect the expected outcomes for batch processing in channel funding and that all relevant scenarios are covered.
- 10820-10820: The final assertion in
test_close_in_funding_batch
that checks for the immediate closure of all channels in the batch upon a single channel's force-close (line 10820) is a key part of validating the intended behavior. However, it's crucial to ensure that this behavior aligns with the system's design and that the test accurately reflects the real-world scenario it intends to simulate.Confirm that the immediate closure of all channels in a batch upon a single channel's force-close aligns with the system's design and that the test accurately simulates this scenario.
- Do not remove channel immediately when peer_disconnect, instead removed it after some time if peer doesn't reconnect soon (handled in previous commit). - Do not mark per ok_to_remove if we have some OutboundV1Channels too. - Rebroadcast SendOpenChannel for outboundV1Channel when peer reconnects. - Update the relevant tests to account for the behavior change. - Repurpose the test_disconnect_in_funding_batch to test that all channels in the batch close when one them closes.
- The first test make sure that the OutboundV1Channel is not immediately removed when peers disconnect, but is removed after N timer ticks. - The second test makes sure that the SendOpenChannel is rebroadcasted for the OutboundV1Channel if peer reconnects within time.
Updated from pr2725.13 -> pr2725.14 (diff)
Updated from pr2725.14 -> pr2725.15 Range-diff
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 0
Configuration used: CodeRabbit UI
Files selected for processing (2)
- lightning/src/ln/channelmanager.rs (3 hunks)
- lightning/src/ln/functional_tests.rs (6 hunks)
Files skipped from review as they are similar to previous changes (1)
- lightning/src/ln/channelmanager.rs
Additional comments: 7
lightning/src/ln/functional_tests.rs (7)
- 3698-3698: The test description for
test_peer_disconnected_before_funding_broadcasted
is clear and sets the context well for what the test aims to achieve. However, ensure that the test implementation fully covers the scenario described, including both the disconnection and the failure to reconnect within the specified time.- 3727-3739: The logic to simulate peer disconnection before funding is broadcasted and to check the channel closure with the appropriate
ClosureReason
is implemented correctly. However, consider adding a comment explaining whyUNFUNDED_CHANNEL_AGE_LIMIT_TICKS
is used to simulate the passage of time and its significance in the context of this test.- 10517-10557: The test
test_channel_close_when_not_timely_accepted
correctly simulates a scenario where a peer disconnects mid-handshake and checks the state of channels and peer state after a specified time has passed. This test effectively covers the new behavior introduced in the PR. Ensure that the constants used, likeUNFUNDED_CHANNEL_AGE_LIMIT_TICKS
, are well-documented and their values are justified within the context of this test.- 10560-10598: The test
test_rebroadcast_open_channel_when_reconnect_mid_handshake
accurately simulates the scenario of peer disconnection and reconnection during the handshake process. It checks that theSendOpenChannel
message is rebroadcast upon reconnection, aligning with the PR's objectives. This test is well-structured and covers the critical functionality introduced. Ensure that the test includes assertions for the state of both nodes after reconnection to fully validate the rebroadcast logic.- 10756-10756: The introduction of
test_close_in_funding_batch
aims to test the behavior when one of the channels in a batch closes. This is a good addition to ensure that batch processing of channel closures behaves as expected. However, the test description could be expanded to detail the expected behavior of the batch closure process for clarity.- 10782-10813: In
test_close_in_funding_batch
, the logic to force-close a channel and check the resulting state, including monitor updates and message events, is implemented correctly. This test effectively validates the behavior when a channel in a funding batch is closed. Ensure that the test also verifies the state of other channels in the batch to confirm that they are affected as expected by the batch closure process.- 10814-10814: The assertion that all channels in the batch should close immediately after one channel is force-closed is a critical part of
test_close_in_funding_batch
. This ensures that the batch processing logic is working as intended. Consider adding more detailed assertions to verify the closure reasons for each channel in the batch to ensure they align with the expected outcomes.
_ => panic!("Unexpected message."), | ||
} | ||
|
||
// We broadcast the commitment transaction as part of the force-close. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, this is kinda dumb, maybe we should fix that, but its not super critical and certainly unrelated to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super interested in understanding the issue here! And probably might give it a try if it's not a super biggie!
v0.0.123 - May 08, 2024 - "BOLT12 Dust Sweeping" API Updates =========== * To reduce risk of force-closures and improve HTLC reliability the default dust exposure limit has been increased to `MaxDustHTLCExposure::FeeRateMultiplier(10_000)`. Users with existing channels might want to consider using `ChannelManager::update_channel_config` to apply the new default (lightningdevkit#3045). * `ChainMonitor::archive_fully_resolved_channel_monitors` is now provided to remove from memory `ChannelMonitor`s that have been fully resolved on-chain and are now not needed. It uses the new `Persist::archive_persisted_channel` to inform the storage layer that such a monitor should be archived (lightningdevkit#2964). * An `OutputSweeper` is now provided which will automatically sweep `SpendableOutputDescriptor`s, retrying until the sweep confirms (lightningdevkit#2825). * After initiating an outbound channel, a peer disconnection no longer results in immediate channel closure. Rather, if the peer is reconnected before the channel times out LDK will automatically retry opening it (lightningdevkit#2725). * `PaymentPurpose` now has separate variants for BOLT12 payments, which include fields from the `invoice_request` as well as the `OfferId` (lightningdevkit#2970). * `ChannelDetails` now includes a list of in-flight HTLCs (lightningdevkit#2442). * `Event::PaymentForwarded` now includes `skimmed_fee_msat` (lightningdevkit#2858). * The `hashbrown` dependency has been upgraded and the use of `ahash` as the no-std hash table hash function has been removed. As a consequence, LDK's `Hash{Map,Set}`s no longer feature several constructors when LDK is built with no-std; see the `util::hash_tables` module instead. On platforms that `getrandom` supports, setting the `possiblyrandom/getrandom` feature flag will ensure hash tables are resistant to HashDoS attacks, though the `possiblyrandom` crate should detect most common platforms (lightningdevkit#2810, lightningdevkit#2891). * `ChannelMonitor`-originated requests to the `ChannelSigner` can now fail and be retried using `ChannelMonitor::signer_unblocked` (lightningdevkit#2816). * `SpendableOutputDescriptor::to_psbt_input` now includes the `witness_script` where available as well as new proprietary data which can be used to re-derive some spending keys from the base key (lightningdevkit#2761, lightningdevkit#3004). * `OutPoint::to_channel_id` has been removed in favor of `ChannelId::v1_from_funding_outpoint` in preparation for v2 channels with a different `ChannelId` derivation scheme (lightningdevkit#2797). * `PeerManager::get_peer_node_ids` has been replaced with `list_peers` and `peer_by_node_id`, which provide more details (lightningdevkit#2905). * `Bolt11Invoice::get_payee_pub_key` is now provided (lightningdevkit#2909). * `Default[Message]Router` now take an `entropy_source` argument (lightningdevkit#2847). * `ClosureReason::HTLCsTimedOut` has been separated out from `ClosureReason::HolderForceClosed` as it is the most common case (lightningdevkit#2887). * `ClosureReason::CooperativeClosure` is now split into `{Counterparty,Locally}Initiated` variants (lightningdevkit#2863). * `Event::ChannelPending::channel_type` is now provided (lightningdevkit#2872). * `PaymentForwarded::{prev,next}_user_channel_id` are now provided (lightningdevkit#2924). * Channel init messages have been refactored towards V2 channels (lightningdevkit#2871). * `BumpTransactionEvent` now contains the channel and counterparty (lightningdevkit#2873). * `util::scid_utils` is now public, with some trivial utilities to examine short channel ids (lightningdevkit#2694). * `DirectedChannelInfo::{source,target}` are now public (lightningdevkit#2870). * Bounds in `lightning-background-processor` were simplified by using `AChannelManager` (lightningdevkit#2963). * The `Persist` impl for `KVStore` no longer requires `Sized`, allowing for the use of `dyn KVStore` as `Persist` (lightningdevkit#2883, lightningdevkit#2976). * `From<PaymentPreimage>` is now implemented for `PaymentHash` (lightningdevkit#2918). * `NodeId::from_slice` is now provided (lightningdevkit#2942). * `ChannelManager` deserialization may now fail with `DangerousValue` when LDK's persistence API was violated (lightningdevkit#2974). Bug Fixes ========= * Excess fees on counterparty commitment transactions are now included in the dust exposure calculation. This lines behavior up with some cases where transaction fees can be burnt, making them effectively dust exposure (lightningdevkit#3045). * `Future`s used as an `std::...::Future` could grow in size unbounded if it was never woken. For those not using async persistence and using the async `lightning-background-processor`, this could cause a memory leak in the `ChainMonitor` (lightningdevkit#2894). * Inbound channel requests that fail in `ChannelManager::accept_inbound_channel` would previously have stalled from the peer's perspective as no `error` message was sent (lightningdevkit#2953). * Blinded path construction has been tuned to select paths more likely to succeed, improving BOLT12 payment reliability (lightningdevkit#2911, lightningdevkit#2912). * After a reorg, `lightning-transaction-sync` could have failed to follow a transaction that LDK needed information about (lightningdevkit#2946). * `RecipientOnionFields`' `custom_tlvs` are now propagated to recipients when paying with blinded paths (lightningdevkit#2975). * `Event::ChannelClosed` is now properly generated and peers are properly notified for all channels that as a part of a batch channel open fail to be funded (lightningdevkit#3029). * In cases where user event processing is substantially delayed such that we complete multiple round-trips with our peers before a `PaymentSent` event is handled and then restart without persisting the `ChannelManager` after having persisted a `ChannelMonitor[Update]`, on startup we may have `Err`d trying to deserialize the `ChannelManager` (lightningdevkit#3021). * If a peer has relatively high latency, `PeerManager` may have failed to establish a connection (lightningdevkit#2993). * `ChannelUpdate` messages broadcasted for our own channel closures are now slightly more robust (lightningdevkit#2731). * Deserializing malformed BOLT11 invoices may have resulted in an integer overflow panic in debug builds (lightningdevkit#3032). * In exceedingly rare cases (no cases of this are known), LDK may have created an invalid serialization for a `ChannelManager` (lightningdevkit#2998). * Message processing latency handling BOLT12 payments has been reduced (lightningdevkit#2881). * Latency in processing `Event::SpendableOutputs` may be reduced (lightningdevkit#3033). Node Compatibility ================== * LDK's blinded paths were inconsistent with other implementations in several ways, which have been addressed (lightningdevkit#2856, lightningdevkit#2936, lightningdevkit#2945). * LDK's messaging blinded paths now support the latest features which some nodes may begin relying on soon (lightningdevkit#2961). * LDK's BOLT12 structs have been updated to support some last-minute changes to the spec (lightningdevkit#3017, lightningdevkit#3018). * CLN v24.02 requires the `gossip_queries` feature for all peers, however LDK by default does not set it for those not using a `P2PGossipSync` (e.g. those using RGS). This change was reverted in CLN v24.02.2 however for now LDK always sets the `gossip_queries` feature. This change is expected to be reverted in a future LDK release (lightningdevkit#2959). Security ======== 0.0.123 fixes a denial-of-service vulnerability which we believe to be reachable from untrusted input when parsing invalid BOLT11 invoices containing non-ASCII characters. * BOLT11 invoices with non-ASCII characters in the human-readable-part may cause an out-of-bounds read attempt leading to a panic (lightningdevkit#3054). Note that all BOLT11 invoices containing non-ASCII characters are invalid. In total, this release features 150 files changed, 19307 insertions, 6306 deletions in 360 commits since 0.0.121 from 17 authors, in alphabetical order: * Arik Sosman * Duncan Dean * Elias Rohrer * Evan Feenstra * Jeffrey Czyz * Keyue Bao * Matt Corallo * Orbital * Sergi Delgado Segura * Valentine Wallace * Willem Van Lint * Wilmer Paulino * benthecarman * jbesraa * olegkubrakov * optout * shaavan
Resolves #2096