-
Notifications
You must be signed in to change notification settings - Fork 418
Correctly handle lost MonitorEvent
s
#3984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Correctly handle lost MonitorEvent
s
#3984
Conversation
👋 Thanks for assigning @valentinewallace as a reviewer! |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3984 +/- ##
==========================================
- Coverage 88.76% 88.34% -0.43%
==========================================
Files 176 177 +1
Lines 128521 132093 +3572
Branches 128521 132093 +3572
==========================================
+ Hits 114088 116695 +2607
- Misses 11844 12728 +884
- Partials 2589 2670 +81
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
🔔 1st Reminder Hey @valentinewallace! This PR has been waiting for your review. |
I think this is safe to backport, with the one exception of making |
🔔 2nd Reminder Hey @valentinewallace! This PR has been waiting for your review. |
🔔 3rd Reminder Hey @valentinewallace! This PR has been waiting for your review. |
🔔 4th Reminder Hey @valentinewallace! This PR has been waiting for your review. |
🔔 5th Reminder Hey @valentinewallace! This PR has been waiting for your review. |
Broke part out in #4004, so drafting this until that lands. |
Tagging 0.2 since this now resolves a regression (deliberately) introduced in #4004 where we generate tons of duplicate paymentsent/failed events on restart. |
037a1d7
to
a791a4d
Compare
🔔 1st Reminder Hey @wpaulino @valentinewallace! This PR has been waiting for your review. |
🔔 2nd Reminder Hey @wpaulino @valentinewallace! This PR has been waiting for your review. |
🔔 1st Reminder Hey @wpaulino @valentinewallace! This PR has been waiting for your review. |
🔔 3rd Reminder Hey @wpaulino @valentinewallace! This PR has been waiting for your review. |
🔔 2nd Reminder Hey @wpaulino @valentinewallace! This PR has been waiting for your review. |
🔔 4th Reminder Hey @wpaulino @valentinewallace! This PR has been waiting for your review. |
🔔 3rd Reminder Hey @wpaulino @valentinewallace! This PR has been waiting for your review. |
🔔 5th Reminder Hey @wpaulino @valentinewallace! This PR has been waiting for your review. |
🔔 4th Reminder Hey @wpaulino @valentinewallace! This PR has been waiting for your review. |
🔔 6th Reminder Hey @wpaulino @valentinewallace! This PR has been waiting for your review. |
🔔 5th Reminder Hey @wpaulino @valentinewallace! This PR has been waiting for your review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally seems to make sense, will take a closer look shortly
/// [`Self::CommitmentSecret`] update which clears the payment information from all un-revoked | ||
/// counterparty commitment transactions. | ||
ReleasePaymentComplete { | ||
htlc: SentHTLCId, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/htlc/htlc_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean isn't it clear from the type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not as clear as it could be at the callsites, IMO. Since this field isn't actually an HTLC. Not a big deal though.
a791a4d
to
c0ab9e3
Compare
/// [`Self::CommitmentSecret`] update which clears the payment information from all un-revoked | ||
/// counterparty commitment transactions. | ||
ReleasePaymentComplete { | ||
htlc: SentHTLCId, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not as clear as it could be at the callsites, IMO. Since this field isn't actually an HTLC. Not a big deal though.
Event::PaymentPathSuccessful { payment_hash, .. } => { | ||
assert_eq!(payment_hash, Some(hash_b)); | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm my understanding, this event could still potentially get lost?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be rare, but its possible, yea. Those events are intended primarily for scorer-updating, so we don't guarantee them as tightly. Will make sure its documented clearly as a part of #3186.
lightning/src/ln/payment_tests.rs
Outdated
// After reload, the ChannelManager identified the failed payment and queued up the | ||
// PaymentSent and corresponding ChannelMonitorUpdate to mark the payment handled, but | ||
// while processing the pending `MonitorEvent`s (which were not processed before the | ||
// monitor was persisted) we will end up with a duplicate ChannelMonitorUpdate. | ||
expect_payment_sent(&nodes[0], payment_preimage, None, true, false); | ||
// PaymentSent (or not, if `persist_manager_post_event` resulted in us detecting we | ||
// already did that) and corresponding ChannelMonitorUpdate to mark the payment | ||
// handled, but while processing the pending `MonitorEvent`s (which were not processed | ||
// before the monitor was persisted) we will end up with a duplicate | ||
// ChannelMonitorUpdate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is all one sentence and I find it a bit incomprehensible... Is there any way to break it up and/or reword?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its pretty readable if you skip the parentheticals, but they also provide important context, so I'm somewhat inclined to leave it and call it a reader issue - first skip the parentheticals then re-read with them to get context.
@@ -16601,8 +16601,7 @@ where | |||
} | |||
|
|||
if is_channel_closed { | |||
for (htlc_source, (htlc, _)) in monitor.get_pending_or_resolved_outbound_htlcs() | |||
{ | |||
for (htlc_source, (htlc, _)) in monitor.get_all_current_outbound_htlcs() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous code would also check HTLCs in holder commitment txs, depending on what was actually confirmed onchain. I think the current behavior is correct based on the comment on fail_unbroadcast_htlcs
, but was it incorrect before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dunno if either is really "incorrect", but I don't think it matters. For an outbound HTLC, they'll first appear on counterparty transactions, so we always get them when they first go out either way. On the failing/claiming end they'll first be removed in the local commitment, then the remote one, so we'll always get a superset by looking at the remote commitment.
false | ||
}; | ||
if !have_action { | ||
self.handle_post_event_actions(ev_completion_action); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing test coverage for this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm? If I assert that ev_completion_action
is_none
or is_some
here I see a number of tests fail.
pending_events.iter().any(|(_, act)| act.as_ref() == Some(&action)) | ||
}; | ||
if !have_action { | ||
self.handle_post_event_actions([action]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing test coverage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its not actually clear to me how to reach this. In the claim path, we don't generate a PaymentPathSuccessful
event in the claim_htlc
method but rather wait until its irrevocable, leaving us to hit the dont-have-and-didnt-generate-event case rather trivially. In the fail path, we have to wait until all HTLCs are irrevocably failed, so don't call fail_htlc
until that point, at which time we ~always generate a PaymentPathFailed
event (see the end of fail_htlc
). There are cases where it doesn't generate an event, but only in cases where we have redundant failures. The easy way I see to hit that is doing a second restart after handling the event but not persisting the monitor. Sadly, the result of that is we just re-add the pending payment on startup and end up back where we started. I don't want to remove the code, but it does seem unreachable to me.
lightning/src/ln/channelmanager.rs
Outdated
// This should mostly only happen on startup, but it is possible to hit it in | ||
// rare cases where a MonitorUpdate is replayed after restart because a | ||
// ChannelMonitor wasn't persisted after it was applied (even though the | ||
// ChannelManager was). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added:
assert!(self.background_events_processed_since_startup.load(Ordering::Acquire));
here and all tests passed. So I think we don't have coverage of the "normal" startup case that's being outlined here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, its not clear to me that this is reachable - #3984 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, okay, if it's unreachable, I don't think it should be described as the "normal" case in the comment.
let events = node.node.get_and_clear_pending_events(); | ||
if conditions.from_mon_update { | ||
check_added_monitors(node, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pre-existing, but generally for our monitor update testing it would be nice if we could check that the particular ChannelMonitorUpdateStep
is as-expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, there's definitely opportunity to improve our monitor-update checking logic...this PR is big enough though.
let during_startup = | ||
!self.background_events_processed_since_startup.load(Ordering::Acquire); | ||
if during_startup { | ||
let event = BackgroundEvent::MonitorUpdateRegeneratedOnStartup { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing test coverage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is unreachable for the same reason in #3984 (comment) - we can't get here from a failure, and the claim pipeline only gets run post-startup.
🔔 6th Reminder Hey @wpaulino! This PR has been waiting for your review. |
🔔 7th Reminder Hey @wpaulino! This PR has been waiting for your review. |
c0ab9e3
to
ea275d2
Compare
Can we squash? |
`MonitorEvent`s aren't delivered to the `ChannelManager` in a durable fasion - if the `ChannelManager` fetches the pending `MonitorEvent`s, then the `ChannelMonitor` gets persisted (i.e. due to a block update) then the node crashes, prior to persisting the `ChannelManager` again, the `MonitorEvent` and its effects on the `ChannelManger` will be lost. This isn't likely in a sync persist environment, but in an async one this could be an issue. Note that this is only an issue for closed channels - `MonitorEvent`s only inform the `ChannelManager` that a channel is closed (which the `ChannelManager` will learn on startup or when it next tries to advance the channel state), that `ChannelMonitorUpdate` writes completed (which the `ChannelManager` will detect on startup), or that HTLCs resolved on-chain post closure. Of the three, only the last is problematic to lose prior to a reload. In previous commits we ensured that HTLC resolutions which came to `ChannelManager` via a `MonitorEvent` were replayed on startup if the `MonitorEvent` was lost. However, in cases where the `ChannelManager` was so stale that it didn't have the payment state for an HTLC at all, we only re-add it in cases where `ChannelMonitor::get_pending_or_resolved_outbound_htlcs` includes it. Because constantly re-adding a payment state and then failing it would generate lots of noise for users on startup (not to mention risk of confusing stale payment events for the latest state of a payment when the `PaymentId` has been reused to retry a payment). Thus, `get_pending_or_resolved_outbound_htlcs` does not include state for HTLCs which were resolved on chain with a preimage or HTLCs which were resolved on chain with a timeout after `ANTI_REORG_DELAY` confirmations. This critera matches the critera for generating a `MonitorEvent`, and works great under the assumption that `MonitorEvent`s are reliably delivered. However, if they are not, and our `ChannelManager` is lost or substantially old (or, in a future where we do not persist `ChannelManager` at all), we will not end up seeing payment resolution events for an HTLC. Instead, we really want to tell our `ChannelMonitor`s when the resolution of an HTLC is complete. Note that we don't particularly care about non-payment HTLCs, as there is no re-hydration of state to do there - `ChannelManager` load ignores forwarded HTLCs coming back from `get_pending_or_resolved_outbound_htlcs` as there's nothing to do - we always attempt to replay the success/failure and figure out if it mattered based on whether there was still an HTLC to claim/fail. Here we take the first step towards that notification, adding a new `ChannelMonitorUpdateStep` for the completion notification, and tracking HTLCs which make it to the `ChannelMonitor` in such updates in a new map.
`MonitorEvent`s aren't delivered to the `ChannelManager` in a durable fasion - if the `ChannelManager` fetches the pending `MonitorEvent`s, then the `ChannelMonitor` gets persisted (i.e. due to a block update) then the node crashes, prior to persisting the `ChannelManager` again, the `MonitorEvent` and its effects on the `ChannelManger` will be lost. This isn't likely in a sync persist environment, but in an async one this could be an issue. Note that this is only an issue for closed channels - `MonitorEvent`s only inform the `ChannelManager` that a channel is closed (which the `ChannelManager` will learn on startup or when it next tries to advance the channel state), that `ChannelMonitorUpdate` writes completed (which the `ChannelManager` will detect on startup), or that HTLCs resolved on-chain post closure. Of the three, only the last is problematic to lose prior to a reload. In previous commits we ensured that HTLC resolutions which came to `ChannelManager` via a `MonitorEvent` were replayed on startup if the `MonitorEvent` was lost. However, in cases where the `ChannelManager` was so stale that it didn't have the payment state for an HTLC at all, we only re-add it in cases where `ChannelMonitor::get_pending_or_resolved_outbound_htlcs` includes it. Because constantly re-adding a payment state and then failing it would generate lots of noise for users on startup (not to mention risk of confusing stale payment events for the latest state of a payment when the `PaymentId` has been reused to retry a payment). Thus, `get_pending_or_resolved_outbound_htlcs` does not include state for HTLCs which were resolved on chain with a preimage or HTLCs which were resolved on chain with a timeout after `ANTI_REORG_DELAY` confirmations. This critera matches the critera for generating a `MonitorEvent`, and works great under the assumption that `MonitorEvent`s are reliably delivered. However, if they are not, and our `ChannelManager` is lost or substantially old (or, in a future where we do not persist `ChannelManager` at all), we will not end up seeing payment resolution events for an HTLC. Instead, we really want to tell our `ChannelMonitor`s when the resolution of an HTLC is complete. Note that we don't particularly care about non-payment HTLCs, as there is no re-hydration of state to do there - `ChannelManager` load ignores forwarded HTLCs coming back from `get_pending_or_resolved_outbound_htlcs` as there's nothing to do - we always attempt to replay the success/failure and figure out if it mattered based on whether there was still an HTLC to claim/fail. Here we prepare to generate the new `ChannelMonitorUpdateStep::ReleasePaymentComplete` updates, adding a new `PaymentCompleteUpdate` struct to track the new update before we generate the `ChannelMonitorUpdate` and passing through to the right places in `ChannelManager`. The only cases where we want to generate the new update is after a `PaymentSent` or `PaymentFailed` event when the event was the result of a `MonitorEvent` or the equivalent read during startup.
ea275d2
to
c188686
Compare
Squashed with only rustfmt fixes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just one comment I could use some clarification on
#[derive(Clone, Debug, PartialEq, Eq)] | ||
pub(crate) struct PaymentCompleteUpdate { | ||
counterparty_node_id: PublicKey, | ||
channel_funding_outpoint: OutPoint, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we only need to track this because of legacy code for MonitorUpdateRegeneratedOnStartup
when downgrading, but IIUC that shouldn't be reachable via PaymentCompleteUpdate
because we just drop it on downgrades.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Close, but the funding_txo
from MonitorUpdateRegeneratedOnStartup
also makes its way into in_flight_monitor_updates
where its used to persist the legacy in-flight-updates field. Sadly, even 0.1 uses that version.
`MonitorEvent`s aren't delivered to the `ChannelManager` in a durable fasion - if the `ChannelManager` fetches the pending `MonitorEvent`s, then the `ChannelMonitor` gets persisted (i.e. due to a block update) then the node crashes, prior to persisting the `ChannelManager` again, the `MonitorEvent` and its effects on the `ChannelManger` will be lost. This isn't likely in a sync persist environment, but in an async one this could be an issue. Note that this is only an issue for closed channels - `MonitorEvent`s only inform the `ChannelManager` that a channel is closed (which the `ChannelManager` will learn on startup or when it next tries to advance the channel state), that `ChannelMonitorUpdate` writes completed (which the `ChannelManager` will detect on startup), or that HTLCs resolved on-chain post closure. Of the three, only the last is problematic to lose prior to a reload. In previous commits we ensured that HTLC resolutions which came to `ChannelManager` via a `MonitorEvent` were replayed on startup if the `MonitorEvent` was lost. However, in cases where the `ChannelManager` was so stale that it didn't have the payment state for an HTLC at all, we only re-add it in cases where `ChannelMonitor::get_pending_or_resolved_outbound_htlcs` includes it. Because constantly re-adding a payment state and then failing it would generate lots of noise for users on startup (not to mention risk of confusing stale payment events for the latest state of a payment when the `PaymentId` has been reused to retry a payment). Thus, `get_pending_or_resolved_outbound_htlcs` does not include state for HTLCs which were resolved on chain with a preimage or HTLCs which were resolved on chain with a timeout after `ANTI_REORG_DELAY` confirmations. This critera matches the critera for generating a `MonitorEvent`, and works great under the assumption that `MonitorEvent`s are reliably delivered. However, if they are not, and our `ChannelManager` is lost or substantially old (or, in a future where we do not persist `ChannelManager` at all), we will not end up seeing payment resolution events for an HTLC. Instead, we really want to tell our `ChannelMonitor`s when the resolution of an HTLC is complete. Note that we don't particularly care about non-payment HTLCs, as there is no re-hydration of state to do there - `ChannelManager` load ignores forwarded HTLCs coming back from `get_pending_or_resolved_outbound_htlcs` as there's nothing to do - we always attempt to replay the success/failure and figure out if it mattered based on whether there was still an HTLC to claim/fail. Here we begin generating the new `ChannelMonitorUpdateStep::ReleasePaymentComplete` updates, updating functional tests for the new `ChannelMonitorUpdate`s where required.
`MonitorEvent`s aren't delivered to the `ChannelManager` in a durable fasion - if the `ChannelManager` fetches the pending `MonitorEvent`s, then the `ChannelMonitor` gets persisted (i.e. due to a block update) then the node crashes, prior to persisting the `ChannelManager` again, the `MonitorEvent` and its effects on the `ChannelManger` will be lost. This isn't likely in a sync persist environment, but in an async one this could be an issue. Note that this is only an issue for closed channels - `MonitorEvent`s only inform the `ChannelManager` that a channel is closed (which the `ChannelManager` will learn on startup or when it next tries to advance the channel state), that `ChannelMonitorUpdate` writes completed (which the `ChannelManager` will detect on startup), or that HTLCs resolved on-chain post closure. Of the three, only the last is problematic to lose prior to a reload. In previous commits we ensured that HTLC resolutions which came to `ChannelManager` via a `MonitorEvent` were replayed on startup if the `MonitorEvent` was lost. However, in cases where the `ChannelManager` was so stale that it didn't have the payment state for an HTLC at all, we only re-add it in cases where `ChannelMonitor::get_pending_or_resolved_outbound_htlcs` includes it. Because constantly re-adding a payment state and then failing it would generate lots of noise for users on startup (not to mention risk of confusing stale payment events for the latest state of a payment when the `PaymentId` has been reused to retry a payment). Thus, `get_pending_or_resolved_outbound_htlcs` does not include state for HTLCs which were resolved on chain with a preimage or HTLCs which were resolved on chain with a timeout after `ANTI_REORG_DELAY` confirmations. This critera matches the critera for generating a `MonitorEvent`, and works great under the assumption that `MonitorEvent`s are reliably delivered. However, if they are not, and our `ChannelManager` is lost or substantially old (or, in a future where we do not persist `ChannelManager` at all), we will not end up seeing payment resolution events for an HTLC. Instead, we really want to tell our `ChannelMonitor`s when the resolution of an HTLC is complete. Note that we don't particularly care about non-payment HTLCs, as there is no re-hydration of state to do there - `ChannelManager` load ignores forwarded HTLCs coming back from `get_pending_or_resolved_outbound_htlcs` as there's nothing to do - we always attempt to replay the success/failure and figure out if it mattered based on whether there was still an HTLC to claim/fail. Here we, finally, begin actually using the new `ChannelMonitorUpdateStep::ReleasePaymentComplete` updates, skipping re-hydration of pending payments once they have been fully resolved through to a user `Event`.
When a payment was sent and ultimately completed through an on-chain HTLC claim which we discover during startup, we deliberately break the payment tracking logic to keep it around forever, declining to send a `PaymentPathSuccessful` event but ensuring that we don't constantly replay the claim on every startup. However, now that we now have logic to complete a claim by marking it as completed in a `ChannelMonitor` and not replaying information about the claim on every startup. Thus, we no longer need to take the conservative stance and can correctly replay claims now, generating `PaymentPathSuccessful` events and allowing the state to be removed.
`startup_replay` should always exactly mirror `background_events_processed_since_startup`, so we should just use that rather than having a dedicated argument for it.
c188686
to
77026c9
Compare
Fixed comment about reachability: $ git diff-tree -U1 c18868602 77026c9d6
diff --git a/lightning/src/ln/channelmanager.rs b/lightning/src/ln/channelmanager.rs
index c52b6edf2b..ff1f835cc3 100644
--- a/lightning/src/ln/channelmanager.rs
+++ b/lightning/src/ln/channelmanager.rs
@@ -8018,6 +8018,5 @@ where
// complete it here as the failure was duplicative - we've already handled it.
- // This should mostly only happen on startup, but it is possible to hit it in
- // rare cases where a MonitorUpdate is replayed after restart because a
- // ChannelMonitor wasn't persisted after it was applied (even though the
- // ChannelManager was).
+ // This can happen in rare cases where a MonitorUpdate is replayed after
+ // restart because a ChannelMonitor wasn't persisted after it was applied (even
+ // though the ChannelManager was).
// For such cases, we also check that there's no existing pending event to |
Beta build is failing on CI |
This PR fixes lost
MonitorEvent
s first in two commits which handle effectively replaying them on startup (first the preimage case then the timeout case) and then...doing so via a new
ChannelMonitorUpdate
.