-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop using a constant for monitor update_id
s after closure
#3355
base: main
Are you sure you want to change the base?
Stop using a constant for monitor update_id
s after closure
#3355
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3355 +/- ##
=========================================
Coverage 89.57% 89.57%
=========================================
Files 125 129 +4
Lines 101756 105481 +3725
Branches 101756 105481 +3725
=========================================
+ Hits 91151 94488 +3337
- Misses 7884 8236 +352
- Partials 2721 2757 +36 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review checkpoint: reviewed till commit c7b12ff
let context = &chan.context(); | ||
let logger = WithChannelContext::from(&self.logger, context, None); | ||
log_trace!(logger, "Removing channel {} now that signer is unblocked", context.channel_id()); | ||
update_maps_on_chan_removal!(self, peer_state, context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doubt: what was the impact of these dangling references on channel_remove, would it lead to ever increasing memory if there is lot of churn on channels ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not completely unbounded - once the ChannelMonitor
gets pruned (and the node restarted) they drop off. I think that's a fair tradeoff given each ChannelMonitor
weighs a lot more in-memory (and on disk) than this new struct.
let mut failed_htlcs = Vec::new(); | ||
let empty_peer_state = || { | ||
PeerState { | ||
channel_by_id: new_hash_map(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use hashmap_with_capacity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have a good guesstimate at capacity at this point, we only know the total channel count, not the per-peer channel count. We generally won't have more than one or two channels per peer anyway, so reallocation should be rare.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need to look at the last couple commits.
lightning/src/ln/channelmanager.rs
Outdated
@@ -2753,6 +2753,12 @@ macro_rules! handle_error { | |||
} }; | |||
} | |||
|
|||
/// When a channel is removed, two things need to happen: | |||
/// (a) This must be called in the same `per_peer_state` lock as the channel-closing action, | |||
/// (b) [`ChannelManager::finish_close_channel`] needs to be called unlocked. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you specify which lock? Same below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No locks, basically...
Rebased and addressed feedback. |
a5f0571
to
52ef7f8
Compare
let context = &chan.context(); | ||
let logger = WithChannelContext::from(&self.logger, context, None); | ||
log_trace!(logger, "Removing channel {} now that the signer is unblocked", context.channel_id()); | ||
update_maps_on_chan_removal!(self, peer_state, context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we don't have test coverage for this. Maybe not too crucial but if we want to add it, it looks like it'd be as easy as checking that ChannelManager::outpoint_to_peer
(or another manager map) is empty at the end of async_signer_tests
test_closing_signed
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, probably best to address in an async signing PR. I'll tag this comment there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lot to take in here
// override the `update_id`, taking care to handle old monitors where the | ||
// `latest_update_id` is already `u64::MAX`. | ||
let latest_update_id = entry.into_mut(); | ||
*latest_update_id = latest_update_id.saturating_add(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't seem to have coverage of the u64::MAX
case (same on deser), though that might be tricky to set up in a test :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, that's an upgrade test, so we'd need to go take a dependency on older LDK to build a MAX
monitor without doing it by hand. We could do it all by hand (building a ChannelManager
state we think is representative), but I'm somewhat hesitant as I assume it'll end up being wrong in some subtle way....
lightning/src/ln/channelmanager.rs
Outdated
// We shouldn't consider peer entries for any peers that either have no funded | ||
// channels or are disconnected and had some funded channel in the past (and we're | ||
// just keeping the entry around for closed_channel_monitor_update_ids). | ||
peer.total_channel_count() > 0 || (!peer.is_connected && peer.closed_channel_monitor_update_ids.len() > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing coverage for this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's because this change does nothing lol. peers_without_funded_channels
will already ignore peers which have no channels (which this condition is targeting checking for...). I removed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this wasn't removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦♂️ forgot to push.
counterparty_node_id, | ||
funding_txo: *funding_txo, | ||
channel_id, | ||
update: monitor_update, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we set the monitor_update.counterparty_node_id
here as well for posterity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't matter. Its only used to set the ChannelMonitor.counterparty_node_id
which is already set in this branch.
let short_to_chan_info = self.short_to_chan_info.read().unwrap(); | ||
short_to_chan_info.get(&prev_hop.short_channel_id).map(|(cp_id, _)| *cp_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: drop the local variable for a single, chained expression
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chained expressions will lead the eventual rustfmt to be super duper vertical, cause it'll insist on read
and unwrap
each getting their own line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YMMV, but personally I'd find it more readable at a glance than when having short_to_chan_info
repeated three times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, its kinda bad either way, rustfmt just kinda backs us into a corner here. I personally find this way at least somewhat readable because we are taking a lock, and naming the lock the same as the mutex seems...fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I don't mind that. More so the density of the code.
@@ -1299,6 +1299,13 @@ pub(super) struct PeerState<SP: Deref> where SP::Target: SignerProvider { | |||
/// will remove a preimage that needs to be durably in an upstream channel first), we put an | |||
/// entry here to note that the channel with the key's ID is blocked on a set of actions. | |||
actions_blocking_raa_monitor_updates: BTreeMap<ChannelId, Vec<RAAMonitorUpdateBlockingAction>>, | |||
/// The latest [`ChannelMonitor::get_latest_update_id`] value for all closed channels as they | |||
/// exist on-disk/in our [`chain::Watch`]. This *ignores* all pending updates not yet applied |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/ignores/excludes ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't they mean the same thing here? Not sure what the difference is you're suggesting needs to be captured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either is fine, I suppose. "ignores" seems more like an action whereas "excludes" seems more to described a state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, we're not taking any active actions, we're just ignoring as a side effect of not doing anything, so that feels like a state?
`update_maps_on_chan_removal` is used to perform `ChannelManager` state updates when a channel is being removed, prior to dropping the `peer_state` lock. In a future commit we'll use it to update fields in the `per_peer_state`, but in order to do so we'll need to have access to that state in the macro. Here we get set up for this by passing the per-peer state to `update_maps_on_chan_removal`, which is sadly a fairly large patch.
When a channel is closed, we have to call `update_maps_on_chan_removal` in the same per-peer-state lock as the removal of the `ChannelPhase` object. We forgot to do so in `ChannelManager::signer_unblocked` leaving dangling references to the channel. We also take this opportunity to include more context in the channel-closure log in `ChannelManager::signer_unblocked` and add documentation to `update_maps_on_chan_removal` and `finish_close_channel` to hopefully avoid this issue in the future.
In 453ed11 we started tracking the counterparty's `node_id` in `HTLCPreviousHopData`, however we were still trying to look it up using `prev_short_channel_id` in `claim_funds_from_hop`. Because we now usually have the counterparty's `node_id` directly accessible, we should skip the `prev_short_channel_id` lookup. This will also be more important in the next commit where we need to look up state for our counterparty to generate `ChannelMonitorUpdate`s whether we have a live channel or not.
Instead of first building a map from peers to a list of channels then pulling out of that to build the `per_peer_state`, we build `per_peer_state` immediately and store channels in it immediately. This avoids an unnecessary map indirection but also gives us access to the new fields in `per_peer_state` when reading `Channel`s which we'll need in a coming commit.
52ef7f8
to
55976a8
Compare
return None; | ||
} | ||
|
||
if prev_hop.counterparty_node_id.is_none() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, it's worth looking into whether the checks below should also apply to HTLC receives as well as HTLC forwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm, you're right we do need to consider it, but only in unclaimed payments.
If we're doing a re-claim-on-startup there's only two cases and I think we're safe in both:
- pre-Stop relying on ChannelMonitor persistence after manager read #3322 (and for old claims) we call directly into the
ChannelMonitor
withprovide_payment_preimage_unsafe_legacy
, - post-Stop relying on ChannelMonitor persistence after manager read #3322 we use
impl From<&MPPClaimHTLCSource> for HTLCClaimSource
which always setscounterparty_node_id
.
55976a8
to
938d2a2
Compare
I believe I've addressed all the feedback except for missing test coverage. |
for (payment_hash, payment) in claimable_payments.iter() { | ||
for htlc in payment.htlcs.iter() { | ||
if htlc.prev_hop.counterparty_node_id.is_none() { | ||
log_error!(args.logger, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't there be value in doing more of the checks that we're doing above for forwards, like checking the monitor claimable balances and the short_to_chan_info
map and failing read if we know we're going to panic? Wondering if there's a way to chain
these htlcs to the monitor.get_all_current_outbound_htlcs()
above to DRY the checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm, I was thinking they don't apply because we don't need to check if we need to replay a claim (we know its pending and may claim it when the user wants to), but indeed we should match the "if the channel is still open, just error log and continue anyway" logic, so I did that. I don't think there's much room to DRY them given its just one of the many checks in the forwarding block.
lightning/src/ln/channelmanager.rs
Outdated
// We shouldn't consider peer entries for any peers that either have no funded | ||
// channels or are disconnected and had some funded channel in the past (and we're | ||
// just keeping the entry around for closed_channel_monitor_update_ids). | ||
peer.total_channel_count() > 0 || (!peer.is_connected && peer.closed_channel_monitor_update_ids.len() > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this wasn't removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good mod outstanding feedback and Jeff taking a look
fn update_persisted_channel( | ||
&self, funding_txo: OutPoint, update: Option<&ChannelMonitorUpdate>, | ||
monitor: &ChannelMonitor<ChannelSigner>, | ||
) -> chain::ChannelMonitorUpdateStatus { | ||
const LEGACY_CLOSED_CHANNEL_UPDATE_ID: u64 = u64::MAX; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we avoid defining this in two places?
@@ -1299,6 +1299,13 @@ pub(super) struct PeerState<SP: Deref> where SP::Target: SignerProvider { | |||
/// will remove a preimage that needs to be durably in an upstream channel first), we put an | |||
/// entry here to note that the channel with the key's ID is blocked on a set of actions. | |||
actions_blocking_raa_monitor_updates: BTreeMap<ChannelId, Vec<RAAMonitorUpdateBlockingAction>>, | |||
/// The latest [`ChannelMonitor::get_latest_update_id`] value for all closed channels as they | |||
/// exist on-disk/in our [`chain::Watch`]. This *ignores* all pending updates not yet applied |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either is fine, I suppose. "ignores" seems more like an action whereas "excludes" seems more to described a state.
let short_to_chan_info = self.short_to_chan_info.read().unwrap(); | ||
short_to_chan_info.get(&prev_hop.short_channel_id).map(|(cp_id, _)| *cp_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YMMV, but personally I'd find it more readable at a glance than when having short_to_chan_info
repeated three times.
// monitor updates still in flight. In that case, we shouldn't | ||
// immediately free, but instead let that monitor update complete | ||
// in the background. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment still relevant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, the behavior hasn't changed, we just no longer debug-assert that the new monitor update is already in-flight.
19e90b2
to
7a2b272
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Feel free to squash, IMO.
Currently we store in-flight `ChannelMonitorUpdate`s in the per-peer structure in `ChannelManager`. This is nice and simple as we're generally updating it when we're updating other per-peer data, so we already have the relevant lock(s) and map entries. Sadly, when we're claiming an HTLC against a closed channel, we didn't have the `counterparty_node_id` available until it was added in 0.0.124 (and now we only have it for HTLCs which were forwarded in 0.0.124). This means we can't look up the per-peer structure when claiming old HTLCs, making it difficult to track the new `ChannelMonitorUpdate` as in-flight. While we could transition the in-flight `ChannelMonitorUpdate` tracking to a new global map indexed by `OutPoint`, doing so would result in a major lock which would be highly contended across channels with different peers. Instead, as we move towards tracking in-flight `ChannelMonitorUpdate`s for closed channels we'll keep our existing storage, leaving only the `counterparty_node_id` issue to contend with. Here we simply accept the issue, requiring that `counterparty_node_id` be available when claiming HTLCs against a closed channel. On startup, we explicitly check for any forwarded HTLCs which came from a closed channel where the forward happened prior to 0.0.124, failing to deserialize, or logging an warning if the channel is still open (implying things may work out, but panics may occur if the channel closes prior to HTLC resolution). While this is a somewhat dissapointing resolution, LDK nodes which forward HTLCs are generally fairly well-upgraded, so it is not anticipated to be an issue in practice.
In the next commit we'll drop the magic `u64::MAX` `ChannelMonitorUpdate::update_id` value used when we don't know the `ChannelMonitor`'s `latest_update_id` (i.e. when the channel is closed). In order to do so, we will store further information about `ChannelMonitor`s in the per-peer structure, keyed by the counterparty's node ID, which will be used when applying `ChannelMonitorUpdate`s to closed channels. By taking advantage of the change in the previous commit, that information is now reliably available when we generate the `ChannelMonitorUpdate` (when claiming HTLCs), but in order to ensure it is available when applying the `ChannelMonitorUpdate` we need to use `BackgroundEvent::MonitorUpdateRegeneratedOnStartup` instead of `BackgroundEvent::ClosedMonitorUpdateRegeneratedOnStartup` where possible. Here we do this, leaving `ClosedMonitorUpdateRegeneratedOnStartup` only used to ensure very old channels (created in 0.0.118 or earlier) which are not in the `ChannelManager` are force-closed on startup.
Because `ChannelManager` doesn't have a corresponding `Channel` after the channels are closed, we'd always used an `update_id` of `u64::MAX` for any `ChannelMonitorUpdate`s we need to build after the channel is closed. This completely breaks the abstraction of `update_id`s and leaks into persistence logic - because we might have more than one `ChannelMonitorUpdate` with the same (`u64::MAX`) value, suddenly instead of being able to safely use `update_id` as IDs, the `MonitorUpdatingPersister` has to have special logic to handle this. Worse, because we don't have a unique ID with which to refer to post-close `ChannelMonitorUpdate`s we cannot track when they complete async persistence. This means we cannot properly support async persist for forwarded payments where the inbound edge has hit the chain prior to the preimage coming to us. Here we rectify this by using consistent `update_id`s even after a channel has closed. In order to do so we have to keep some state for all channels for which the `ChannelMonitor` has not been archived (after which point we can be confident we will not need to update them). While this violates our long-standing policy of having no state at all in `ChannelManager`s for closed channels, its only a `(ChannelId, u64)` pair per channel, so shouldn't be problematic for any of our users (as they already store a whole honkin `ChannelMonitor` for these channels anyway). While limited changes are made to the connection-count-limiting logic, reviewers should carefully analyze the interactions the new map created here has with that logic.
If a peer creates a channel with us which never reaches the funding stage (or never gets any commitment updates after creation), we'll avoid inserting the `update_id` into `closed_channel_monitor_update_ids` at runtime to avoid keeping a `PeerState` entry around for no reason. However, on startup we still create a `ChannelMonitorUpdate` with a `ChannelForceClosed` update step to ensure the `ChannelMonitor` is locked and shut down. This is pretty redundant, and results in a bunch of on-startup `ChannelMonitorUpdate`s for any old but non-archived `ChannelMonitor`s. Instead, here, we check if a `ChannelMonitor` already saw a `ChannelForceClosed` update step before we generate the on-startup `ChannelMonitorUpdate`. This also allows us to skip the `closed_channel_monitor_update_ids` insertion as we can be confident we'll never have a `ChannelMonitorUpdate` for this channel at all.
7a2b272
to
0bf2975
Compare
Squashed without further changes. |
82771bb
to
4156259
Compare
Sadly, this requires some very annoying upgrade semantics - users cannot upgrade directly from 0.0.123 and earlier to 0.1 as long as they have pending forwarded HTLCs which have not yet been resolved. Instead, all HTLCs must be resolved (and any new HTLCs must have been forwarded using LDK 0.0.124) prior to the upgrade. I'm not particularly happy about this, but in practice I think its okay - those using LDK with forwarding are generally pretty active and upgrade quickly, or at least have the engineering chops to figure out the upgrade pattern (and could backport the inclusion of
counterparty_node_id
in HTLC forwarding).