-
Notifications
You must be signed in to change notification settings - Fork 418
Peer Storage (Part 3): Identifying Lost Channel States #3897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Peer Storage (Part 3): Identifying Lost Channel States #3897
Conversation
👋 Thanks for assigning @tnull as a reviewer! |
a35566a
to
4c9f3c3
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3897 +/- ##
==========================================
- Coverage 88.91% 88.69% -0.23%
==========================================
Files 173 173
Lines 123393 123735 +342
Branches 123393 123735 +342
==========================================
+ Hits 109717 109741 +24
- Misses 11216 11624 +408
+ Partials 2460 2370 -90
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
/// | ||
/// [`ChainMonitor`]: crate::chain::chainmonitor::ChainMonitor | ||
#[rustfmt::skip] | ||
pub(crate) fn write_util<Signer: EcdsaChannelSigner, W: Writer>(channel_monitor: &ChannelMonitorImpl<Signer>, is_stub: bool, writer: &mut W) -> Result<(), Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wpaulino what do you think we should reasonably cut here to reduce the size of a ChannelMonitor
without making the emergency-case ChannelMonitor
s all that different from the regular ones to induce more code changes across channelmonitor.rs
? Obviously we should avoid counterparty_claimable_outpoints
, but how much code is gonna break in doing so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not too familiar with the goals here, but if the idea is for the emergency-case ChannelMonitor
to be able to recover funds, wouldn't it need to handle a commitment confirmation from either party? That means we need to track most things, even counterparty_claimable_outpoints
(without the sources though) since the counterparty could broadcast a revoked commitment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically. I think ideally we find a way to store everything (required) but counterparty_claimable_outpoints
so that we can punish the counterparty on their balance+reserve if they broadcast a stale state, even if not HTLCs (though of course they can't claim the HTLCs without us being able to punish them on the next stage). Not sure how practical that is today without counterparty_claimable_outpoints
but I think that's the goal.
@adi2011 maybe for now let's just write the full monitors, but leave a TODO to strip out what we can later. For larger nodes that means all our monitors will be too large and we'll never back any up but that's okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, eventually we'll need to figure out what can be stripped from ChannelMonitors. In CLN, we use a separate struct that holds only the minimal data needed to reconstruct a ChannelMonitor with just the essential information to grab funds or penalise the peer.
🔔 1st Reminder Hey @tnull! This PR has been waiting for your review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a first look, but will hold off with going more into details until we decided on which way we should go with the ChannelMonitor
stub,
lightning/src/ln/channelmanager.rs
Outdated
}, | ||
|
||
Err(e) => { | ||
panic!("Wrong serialisation of PeerStorageMonitorHolderList: {}", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should ever panic in any of this code. Yes, something might be wrong if we have peer storage data we can't read anymore, but really no reason to refuse to at least keep other potential channels operational.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes sense, I think we should only panic if we have determined that we have lost some channel state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few more comments, let's move forward without blocking on the ChannelMonitor
serialization stuff.
/// | ||
/// [`ChainMonitor`]: crate::chain::chainmonitor::ChainMonitor | ||
#[rustfmt::skip] | ||
pub(crate) fn write_util<Signer: EcdsaChannelSigner, W: Writer>(channel_monitor: &ChannelMonitorImpl<Signer>, is_stub: bool, writer: &mut W) -> Result<(), Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically. I think ideally we find a way to store everything (required) but counterparty_claimable_outpoints
so that we can punish the counterparty on their balance+reserve if they broadcast a stale state, even if not HTLCs (though of course they can't claim the HTLCs without us being able to punish them on the next stage). Not sure how practical that is today without counterparty_claimable_outpoints
but I think that's the goal.
@adi2011 maybe for now let's just write the full monitors, but leave a TODO to strip out what we can later. For larger nodes that means all our monitors will be too large and we'll never back any up but that's okay.
lightning/src/chain/chainmonitor.rs
Outdated
let random_bytes = self.entropy_source.get_secure_random_bytes(); | ||
let serialised_channels = Vec::new(); | ||
|
||
// TODO(aditya): Choose n random channels so that peer storage does not exceed 64k. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be pretty easy? We have random bytes, just make an outer loop that selects a random monitor (by doing monitors.iter().skip(random_usize % monitors.len()).next()
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a ~first pass.
This needs a rebase now, in particular now that #3922 landed.
lightning/src/chain/chainmonitor.rs
Outdated
@@ -810,10 +813,53 @@ where | |||
} | |||
|
|||
fn send_peer_storage(&self, their_node_id: PublicKey) { | |||
// TODO: Serialize `ChannelMonitor`s inside `our_peer_storage`. | |||
|
|||
static MAX_PEER_STORAGE_SIZE: usize = 65000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a const
rather than static
, I think? Also, would probably make sense to add this add the module level, with some docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also isn't the max size 64 KiB, not 65K?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, my bad, thanks for pointing this out, It should be 65531.
lightning/src/chain/chainmonitor.rs
Outdated
let mut curr_size = 0; | ||
|
||
// Randomising Keys in the HashMap to fetch monitors without repetition. | ||
let mut keys: Vec<&ChannelId> = monitors.keys().collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this a bit cleaner by using the proposed iterator skip
ing approach in the loop below, maybe while simply keeping track of which monitors we already wrote?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will address this in the next fixup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems this is still unaddressed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gentle ping here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a BTree to keep track of the monitors and ran a while loop until we either ran out of peer storage capacity or all monitors were stored.
lightning/src/chain/chainmonitor.rs
Outdated
monitors_list.monitors.push(peer_storage_monitor); | ||
}, | ||
Err(_) => { | ||
panic!("Can not write monitor for {}", mon.monitor.channel_id()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really, please avoid these explicit panics in any of this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would panic only if there is some issue with the write_util
which suggests that we are unable to serialise the channelmonitor, should we just log the error here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mhh, while I'd still prefer to never panic in any of the peer storage code, indeed this would likely only get hit if we run out of memory or similar. If we keep it, let's just expect
on the write_internal
call, which avoids all this match
ing.
676afbc
to
1111f05
Compare
lightning/src/chain/chainmonitor.rs
Outdated
let random_bytes = self.entropy_source.get_secure_random_bytes(); | ||
let serialised_channels = Vec::new(); | ||
let random_usize = usize::from_le_bytes(random_bytes[0..core::mem::size_of::<usize>()].try_into().unwrap()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Let's keep the length in a separate const
, avoiding to make this line overly long/noisy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks!
lightning/src/chain/chainmonitor.rs
Outdated
monitors_list.monitors.push(peer_storage_monitor); | ||
}, | ||
Err(_) => { | ||
panic!("Can not write monitor for {}", mon.monitor.channel_id()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mhh, while I'd still prefer to never panic in any of the peer storage code, indeed this would likely only get hit if we run out of memory or similar. If we keep it, let's just expect
on the write_internal
call, which avoids all this match
ing.
lightning/src/chain/chainmonitor.rs
Outdated
match write_util_internal(&chan_mon, true, &mut ser_chan) { | ||
Ok(_) => { | ||
// Adding size of peer_storage_monitor. | ||
curr_size += ser_chan.0.serialized_length() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we move this check below constructing PeerStorageMonitorHolder
, can we just use it's serialized_lenght
implementation instead of tallying up the individual fields here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, thanks for pointing this out. Fixed.
lightning/src/ln/channelmanager.rs
Outdated
}, | ||
None => { | ||
// TODO: Figure out if this channel is so old that we have forgotten about it. | ||
panic!("Lost a channel {}", &mon_holder.channel_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why panic here? This should be the common case, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this when we’ve completely forgotten about a channel but it’s still active. The only false positives occur if we’ve closed a channel and then forgotten about it. Any ideas on how we could detect those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this when we’ve completely forgotten about a channel but it’s still active.
I don't think this is accurate? Wouldn't we end up here simply in this scenario:
- We have a channel, store an encrypted backup with a peer
- The peer goes offline
- We close the channel, drop it from the
per_peer_state
- The peer comes back online
- We retrieve and decrypt a peer backup with a channel state that we no longer know anything about.
- We panic and crash the node
This can happen very easily / is the common case, no? Also note this same flow could intentionally get abused by a malicious peer, too. The counterparty would simply need to guess that we include a certain channel in the stored blob, remember it, wait until the channel is closed, and then simply replay the old backup to crash our node.
I still maintain we should never panic in any of this PeerStorage related code, maybe mod the above case where we're positive that our local state is outdated compared to the backed-up version.
That said, we'll indeed need to discern between the two cases (unexpected data loss / expected data loss). One variant to do this would be to first check whether we see a funding transaction spend, although note that this would be an async flow that we can't just directly initiate here. Or we need to maintain a list of all closed and remember it forever. The latter might indeed be the simplest and safest way, if we're fine with keeping the additional data. In any case, we need to find a solution here before moving on, as panicking here is unacceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree, this is a false positive where the node crashes, but it shouldn’t. The happy path here is:
- We have a channel with a peer.
- Something happens and we lose the channel (but it is unclosed); the node cannot identify it either.
- We retrieve the ChannelMonitor from peer storage.
- We realise that this is an unknown, old channel, and the node crashes because it no longer has any record of an active channel.
There should be a way to determine whether a channel is closed, so that if we receive an unknown channel and its funding TXO is unspent, we crash the node for recovery. But if the funding is already spent, we shouldn’t worry unless they have published an old state. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree, this is a false positive where the node crashes
We cannot tolerate a false positive if we're going to panic.
There should be a way to determine whether a channel is closed, so that if we receive an unknown channel and its funding TXO is unspent, we crash the node for recovery. But if the funding is already spent, we shouldn’t worry unless they have published an old state. Right?
There is no way to do this in LDK today, and I don't think we'll ever get one. We don't have some global way to query if an arbitrary UTXO is unspent (well, we do in gossip but its not always available), and once a channel is closed and the funds swept we may delete the ChannelMonitor
because there's no reason to keep it around. Keeping around information about every past channel we've ever had just to detect a lost channel here doesn't seem worth it to me.
It does mean we won't detect if we lose all of our state, but I think that's okay - a user will notice that their entire balance is missing :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should not be there. I put up the TODO so that we can have a discussion around this.
Changing the panic to log_debug.
lightning/src/ln/our_peer_storage.rs
Outdated
|
||
impl_writeable_tlv_based!(PeerStorageMonitorHolder, { | ||
(0, channel_id, required), | ||
(1, counterparty_node_id, required), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For new objects, you should be able to make all fields even-numbered as they are required from the getgo, basically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
lightning/src/ln/our_peer_storage.rs
Outdated
/// wrapped inside [`PeerStorage`]. | ||
/// | ||
/// [`PeerStorage`]: crate::ln::msgs::PeerStorage | ||
pub(crate) struct PeerStorageMonitorHolderList { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be able to avoid this new type wrapper if we add a impl_writeable_for_vec
entry for PeerStorageMonitorHolder
in lightning/src/util/ser.rs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems this is still unaddressed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry, I will address this in the next fixup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
🔔 1st Reminder Hey @TheBlueMatt! This PR has been waiting for your review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you forgot to push after addressing @tnull's last comments? Also needs a rebase.
6dede0e
to
c9baefb
Compare
Oh sorry, my bad, I have pushed the fixups and rebased. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we just remove the is_stub
check and the spurious panic we can land this (feel free to squash fixups, IMO). Sadly we will need to add a cfg flag before 0.2 (unless we can nail down the serialization format we want), but it doesn't have to hold up this PR or progress.
|
||
self.lockdown_from_offchain.write(writer)?; | ||
self.holder_tx_signed.write(writer)?; | ||
if !is_stub { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now lets not do anything differently here, I don't think skipping the onchain_tx_handler
is safe, but we're not quite sure what the right answer is yet. Because we won't (yet) finalize the serialization format, we should also cfg-tag out the actual send logic in send_peer_storage
, sadly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I not use is_stub
, inside the function or should I remove the parameter from the function definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, should I make a new cfg tag for peer-storage or use an existing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I not use is_stub, inside the function or should I remove the parameter from the function definition.
I think its fine to add it with a TODO: figure out which fields should go here and which do not.
Also, should I make a new cfg tag for peer-storage or use an existing?
I believe a new one (you'll have to add it in ci/ci-tests.sh as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I will cfg-tag the sending logic. In the next PR we can discuss when to send peer storage along with what to omit from ChannelMonitors.
lightning/src/ln/channelmanager.rs
Outdated
}, | ||
None => { | ||
// TODO: Figure out if this channel is so old that we have forgotten about it. | ||
panic!("Lost a channel {}", &mon_holder.channel_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree, this is a false positive where the node crashes
We cannot tolerate a false positive if we're going to panic.
There should be a way to determine whether a channel is closed, so that if we receive an unknown channel and its funding TXO is unspent, we crash the node for recovery. But if the funding is already spent, we shouldn’t worry unless they have published an old state. Right?
There is no way to do this in LDK today, and I don't think we'll ever get one. We don't have some global way to query if an arbitrary UTXO is unspent (well, we do in gossip but its not always available), and once a channel is closed and the funds swept we may delete the ChannelMonitor
because there's no reason to keep it around. Keeping around information about every past channel we've ever had just to detect a lost channel here doesn't seem worth it to me.
It does mean we won't detect if we lose all of our state, but I think that's okay - a user will notice that their entire balance is missing :)
f95f238
to
4304853
Compare
'PeerStorageMonitorHolder' is used to wrap a single ChannelMonitor, here we are adding some fields separetly so that we do not need to read the whole ChannelMonitor to identify if we have lost some states. `PeerStorageMonitorHolderList` is used to keep the list of all the channels which would be sent over the wire inside Peer Storage.
Fixed formatting for write() in ChannelMonitorImpl. This would make the next commit cleaner by ensuring it only contains direct code shifts, without unrelated formatting changes.
4304853
to
e76f538
Compare
e76f538
to
75a8fd6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine now, after the notes here are fixed and rustfmt is fixed.
|
||
self.lockdown_from_offchain.write(writer)?; | ||
self.holder_tx_signed.write(writer)?; | ||
if !is_stub { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I not use is_stub, inside the function or should I remove the parameter from the function definition.
I think its fine to add it with a TODO: figure out which fields should go here and which do not.
Also, should I make a new cfg tag for peer-storage or use an existing?
I believe a new one (you'll have to add it in ci/ci-tests.sh as well)
75a8fd6
to
aa08357
Compare
Create a utililty function to prevent code duplication while writing ChannelMonitors. Serialise them inside ChainMonitor::send_peer_storage and send them over. Cfg-tag the sending logic because we are unsure of what to omit from ChannelMonitors stored inside peer-storage.
Deserialise the ChannelMonitors and compare the data to determine if we have lost some states.
Node should now determine lost states using retrieved peer storage.
aa08357
to
662fff1
Compare
🔔 1st Reminder Hey @tnull! This PR has been waiting for your review. |
🔔 2nd Reminder Hey @tnull! This PR has been waiting for your review. |
Cargo.toml
Outdated
@@ -68,4 +68,5 @@ check-cfg = [ | |||
"cfg(splicing)", | |||
"cfg(async_payments)", | |||
"cfg(simple_close)", | |||
"cfg(peer_storage)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Indentation is off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, need to fix my indentation configuration. Fixed.
lightning/src/chain/chainmonitor.rs
Outdated
@@ -47,6 +48,8 @@ use crate::types::features::{InitFeatures, NodeFeatures}; | |||
use crate::util::errors::APIError; | |||
use crate::util::logger::{Logger, WithContext}; | |||
use crate::util::persist::MonitorName; | |||
#[allow(unused_imports)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Please rather cfg
-gate these imports, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
lightning/src/chain/chainmonitor.rs
Outdated
let random_bytes = self.entropy_source.get_secure_random_bytes(); | ||
let serialised_channels = Vec::new(); | ||
|
||
#[cfg(peer_storage)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not cfg
-gate the whole message then? This way it might get really confusing what is part and what isn't part of the cfg
flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, for now, cfg-tagging fn send_peer_storage
and all the call sites of this function.
lightning/src/chain/chainmonitor.rs
Outdated
{ | ||
const MAX_PEER_STORAGE_SIZE: usize = 65531; | ||
const USIZE_LEN: usize = core::mem::size_of::<usize>(); | ||
let random_usize = usize::from_le_bytes(random_bytes[0..USIZE_LEN].try_into().unwrap()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Let's just avoid the unnecessary unwrap
here:
diff --git a/lightning/src/chain/chainmonitor.rs b/lightning/src/chain/chainmonitor.rs
index 1681d0187..ba7613f8e 100644
--- a/lightning/src/chain/chainmonitor.rs
+++ b/lightning/src/chain/chainmonitor.rs
@@ -816,12 +816,13 @@ where
#[allow(unused_mut)]
let mut monitors_list: Vec<PeerStorageMonitorHolder> = Vec::new();
let random_bytes = self.entropy_source.get_secure_random_bytes();
-
#[cfg(peer_storage)]
{
const MAX_PEER_STORAGE_SIZE: usize = 65531;
const USIZE_LEN: usize = core::mem::size_of::<usize>();
- let random_usize = usize::from_le_bytes(random_bytes[0..USIZE_LEN].try_into().unwrap());
+ let mut usize_bytes = [0u8; USIZE_LEN];
+ usize_bytes.copy_from_slice(&random_bytes);
+ let random_usize = usize::from_le_bytes(usize_bytes);
let mut curr_size = 0;
let monitors = self.monitors.read().unwrap();
// Randomising Keys in the HashMap to fetch monitors without repetition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this cleanup. Fixed.
lightning/src/chain/chainmonitor.rs
Outdated
let mut curr_size = 0; | ||
|
||
// Randomising Keys in the HashMap to fetch monitors without repetition. | ||
let mut keys: Vec<&ChannelId> = monitors.keys().collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gentle ping here.
let mut ser_chan = VecWriter(Vec::new()); | ||
let min_seen_secret = mon.monitor.get_min_seen_secret(); | ||
let counterparty_node_id = mon.monitor.get_counterparty_node_id(); | ||
let chan_mon = mon.monitor.inner.lock().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put this lock
and the write call into a separate scope so we drop the lock ASAP again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks for pointing this out. Fixed.
@@ -17117,38 +17165,62 @@ mod tests { | |||
|
|||
#[test] | |||
#[rustfmt::skip] | |||
#[cfg(peer_storage)] | |||
#[should_panic(expected = "Lost channel state for channel ae3367da2c13bc1ceb86bf56418f62828f7ce9d6bfb15a46af5ba1f1ed8b124f.\n\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than using should_panic
here (where we can't discern when it's actually happening), can we rather use std::panic::catch_unwind
to catch the specific panic on the specific call and then using assert
to check it's an/the correct error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is, if I remove the should_panic
the test panics during teardown (in the drop
), not in the main test logic.
if let MessageSendEvent::SendChannelReestablish { ref node_id, ref msg } = msg { | ||
nodes[0].node.handle_channel_reestablish(nodes[1].node.get_our_node_id(), msg); | ||
assert_eq!(*node_id, nodes[0].node.get_our_node_id()); | ||
} else if let MessageSendEvent::SendPeerStorageRetrieval { ref node_id, ref msg } = msg { | ||
// Should Panic here! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's check / assert
it here then, see above.
lightning/src/ln/channelmanager.rs
Outdated
nodes[0].node.peer_disconnected(nodes[1].node.get_our_node_id()); | ||
nodes[1].node.peer_disconnected(nodes[0].node.get_our_node_id()); | ||
|
||
// Reload Node! | ||
// nodes[0].chain_source.clear_watched_txn_and_outputs(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems this is still uncommented out? Do we need clear_watched_txn_and_outputs
at all then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, since we are not covering missing channelmonitors yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, then it might be preferable to just leave a TODO and drop the unused clear_watched_txn_and_outputs
from this PR for now.
while curr_size < MAX_PEER_STORAGE_SIZE | ||
&& *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len() | ||
{ | ||
let idx = random_usize % monitors.len(); | ||
stored_chanmon_idx.insert(idx + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loop condition has a potential issue where it might not terminate if the same random index is repeatedly selected. Since stored_chanmon_idx
is a BTreeSet
, inserting the same value multiple times doesn't increase its size. If random_usize % monitors.len()
keeps generating the same index, the condition *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len()
will remain true indefinitely.
Consider modifying the approach to ensure termination, such as:
- Using a counter to limit iterations
- Tracking already-processed indices in a separate set
- Using a deterministic sequence instead of random selection
This would prevent potential infinite loops while still achieving the goal of selecting monitors for peer storage.
while curr_size < MAX_PEER_STORAGE_SIZE | |
&& *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len() | |
{ | |
let idx = random_usize % monitors.len(); | |
stored_chanmon_idx.insert(idx + 1); | |
while curr_size < MAX_PEER_STORAGE_SIZE | |
&& *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len() | |
&& stored_chanmon_idx.len() < monitors.len() | |
{ | |
let idx = random_usize % monitors.len(); | |
let inserted = stored_chanmon_idx.insert(idx + 1); | |
if inserted { | |
curr_size += 1; | |
} else { | |
// If we couldn't insert, try a different index next time | |
random_usize = random_usize.wrapping_add(1); | |
} |
Spotted by Diamond
Is this helpful? React 👍 or 👎 to let us know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tnull, this is the same concern that I had in my mind as well. Wasn't the previous approach more efficient and secure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about something along these lines?:
diff --git a/lightning/src/chain/chainmonitor.rs b/lightning/src/chain/chainmonitor.rs
index 4a40ba872..dde117ca9 100644
--- a/lightning/src/chain/chainmonitor.rs
+++ b/lightning/src/chain/chainmonitor.rs
@@ -39,6 +39,7 @@ use crate::chain::{ChannelMonitorUpdateStatus, Filter, WatchedOutput};
use crate::events::{self, Event, EventHandler, ReplayEvent};
use crate::ln::channel_state::ChannelDetails;
use crate::ln::msgs::{self, BaseMessageHandler, Init, MessageSendEvent, SendOnlyMessageHandler};
+#[cfg(peer_storage)]
use crate::ln::our_peer_storage::{DecryptedOurPeerStorage, PeerStorageMonitorHolder};
use crate::ln::types::ChannelId;
use crate::prelude::*;
@@ -53,6 +54,8 @@ use crate::util::persist::MonitorName;
use crate::util::ser::{VecWriter, Writeable};
use crate::util::wakers::{Future, Notifier};
use bitcoin::secp256k1::PublicKey;
+#[cfg(peer_storage)]
+use core::iter::Cycle;
use core::ops::Deref;
use core::sync::atomic::{AtomicUsize, Ordering};
@@ -808,6 +811,7 @@ where
/// This function collects the counterparty node IDs from all monitors into a `HashSet`,
/// ensuring unique IDs are returned.
+ #[cfg(peer_storage)]
fn all_counterparty_node_ids(&self) -> HashSet<PublicKey> {
let mon = self.monitors.read().unwrap();
mon.values().map(|monitor| monitor.monitor.get_counterparty_node_id()).collect()
@@ -815,51 +819,71 @@ where
#[cfg(peer_storage)]
fn send_peer_storage(&self, their_node_id: PublicKey) {
- #[allow(unused_mut)]
let mut monitors_list: Vec<PeerStorageMonitorHolder> = Vec::new();
let random_bytes = self.entropy_source.get_secure_random_bytes();
const MAX_PEER_STORAGE_SIZE: usize = 65531;
const USIZE_LEN: usize = core::mem::size_of::<usize>();
- let mut usize_bytes = [0u8; USIZE_LEN];
- usize_bytes.copy_from_slice(&random_bytes[0..USIZE_LEN]);
- let random_usize = usize::from_le_bytes(usize_bytes);
+ let mut random_bytes_cycle_iter = random_bytes.iter().cycle();
+
+ let mut current_size = 0;
+ let monitors_lock = self.monitors.read().unwrap();
+ let mut channel_ids = monitors_lock.keys().copied().collect();
+
+ fn next_random_id(
+ channel_ids: &mut Vec<ChannelId>,
+ random_bytes_cycle_iter: &mut Cycle<core::slice::Iter<u8>>,
+ ) -> Option<ChannelId> {
+ if channel_ids.is_empty() {
+ return None;
+ }
- let mut curr_size = 0;
- let monitors = self.monitors.read().unwrap();
- let mut stored_chanmon_idx = alloc::collections::BTreeSet::<usize>::new();
- // Used as a fallback reference if the set is empty
- let zero = 0;
+ let random_idx = {
+ let mut usize_bytes = [0u8; USIZE_LEN];
+ usize_bytes.iter_mut().for_each(|b| {
+ *b = *random_bytes_cycle_iter.next().expect("A cycle never ends")
+ });
+ // Take one more to introduce a slight misalignment.
+ random_bytes_cycle_iter.next().expect("A cycle never ends");
+ usize::from_le_bytes(usize_bytes) % channel_ids.len()
+ };
+
+ Some(channel_ids.swap_remove(random_idx))
+ }
- while curr_size < MAX_PEER_STORAGE_SIZE
- && *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len()
+ while let Some(channel_id) = next_random_id(&mut channel_ids, &mut random_bytes_cycle_iter)
{
- let idx = random_usize % monitors.len();
- stored_chanmon_idx.insert(idx + 1);
- let (cid, mon) = monitors.iter().skip(idx).next().unwrap();
+ let monitor_holder = if let Some(monitor_holder) = monitors_lock.get(&channel_id) {
+ monitor_holder
+ } else {
+ debug_assert!(
+ false,
+ "Tried to access non-existing monitor, this should never happen"
+ );
+ break;
+ };
- let mut ser_chan = VecWriter(Vec::new());
- let min_seen_secret = mon.monitor.get_min_seen_secret();
- let counterparty_node_id = mon.monitor.get_counterparty_node_id();
+ let mut serialized_channel = VecWriter(Vec::new());
+ let min_seen_secret = monitor_holder.monitor.get_min_seen_secret();
+ let counterparty_node_id = monitor_holder.monitor.get_counterparty_node_id();
{
- let chan_mon = mon.monitor.inner.lock().unwrap();
+ let inner_lock = monitor_holder.monitor.inner.lock().unwrap();
- write_chanmon_internal(&chan_mon, true, &mut ser_chan)
+ write_chanmon_internal(&inner_lock, true, &mut serialized_channel)
.expect("can not write Channel Monitor for peer storage message");
}
let peer_storage_monitor = PeerStorageMonitorHolder {
- channel_id: *cid,
+ channel_id,
min_seen_secret,
counterparty_node_id,
- monitor_bytes: ser_chan.0,
+ monitor_bytes: serialized_channel.0,
};
- // Adding size of peer_storage_monitor.
- curr_size += peer_storage_monitor.serialized_length();
-
- if curr_size > MAX_PEER_STORAGE_SIZE {
- break;
+ current_size += peer_storage_monitor.serialized_length();
+ if current_size > MAX_PEER_STORAGE_SIZE {
+ continue;
}
+
monitors_list.push(peer_storage_monitor);
}
There might be still some room for improvement, but IMO something like this would be a lot more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+ if current_size > MAX_PEER_STORAGE_SIZE {
+ continue;
}
Why not break here?
while curr_size < MAX_PEER_STORAGE_SIZE | ||
&& *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len() | ||
{ | ||
let idx = random_usize % monitors.len(); | ||
stored_chanmon_idx.insert(idx + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about something along these lines?:
diff --git a/lightning/src/chain/chainmonitor.rs b/lightning/src/chain/chainmonitor.rs
index 4a40ba872..dde117ca9 100644
--- a/lightning/src/chain/chainmonitor.rs
+++ b/lightning/src/chain/chainmonitor.rs
@@ -39,6 +39,7 @@ use crate::chain::{ChannelMonitorUpdateStatus, Filter, WatchedOutput};
use crate::events::{self, Event, EventHandler, ReplayEvent};
use crate::ln::channel_state::ChannelDetails;
use crate::ln::msgs::{self, BaseMessageHandler, Init, MessageSendEvent, SendOnlyMessageHandler};
+#[cfg(peer_storage)]
use crate::ln::our_peer_storage::{DecryptedOurPeerStorage, PeerStorageMonitorHolder};
use crate::ln::types::ChannelId;
use crate::prelude::*;
@@ -53,6 +54,8 @@ use crate::util::persist::MonitorName;
use crate::util::ser::{VecWriter, Writeable};
use crate::util::wakers::{Future, Notifier};
use bitcoin::secp256k1::PublicKey;
+#[cfg(peer_storage)]
+use core::iter::Cycle;
use core::ops::Deref;
use core::sync::atomic::{AtomicUsize, Ordering};
@@ -808,6 +811,7 @@ where
/// This function collects the counterparty node IDs from all monitors into a `HashSet`,
/// ensuring unique IDs are returned.
+ #[cfg(peer_storage)]
fn all_counterparty_node_ids(&self) -> HashSet<PublicKey> {
let mon = self.monitors.read().unwrap();
mon.values().map(|monitor| monitor.monitor.get_counterparty_node_id()).collect()
@@ -815,51 +819,71 @@ where
#[cfg(peer_storage)]
fn send_peer_storage(&self, their_node_id: PublicKey) {
- #[allow(unused_mut)]
let mut monitors_list: Vec<PeerStorageMonitorHolder> = Vec::new();
let random_bytes = self.entropy_source.get_secure_random_bytes();
const MAX_PEER_STORAGE_SIZE: usize = 65531;
const USIZE_LEN: usize = core::mem::size_of::<usize>();
- let mut usize_bytes = [0u8; USIZE_LEN];
- usize_bytes.copy_from_slice(&random_bytes[0..USIZE_LEN]);
- let random_usize = usize::from_le_bytes(usize_bytes);
+ let mut random_bytes_cycle_iter = random_bytes.iter().cycle();
+
+ let mut current_size = 0;
+ let monitors_lock = self.monitors.read().unwrap();
+ let mut channel_ids = monitors_lock.keys().copied().collect();
+
+ fn next_random_id(
+ channel_ids: &mut Vec<ChannelId>,
+ random_bytes_cycle_iter: &mut Cycle<core::slice::Iter<u8>>,
+ ) -> Option<ChannelId> {
+ if channel_ids.is_empty() {
+ return None;
+ }
- let mut curr_size = 0;
- let monitors = self.monitors.read().unwrap();
- let mut stored_chanmon_idx = alloc::collections::BTreeSet::<usize>::new();
- // Used as a fallback reference if the set is empty
- let zero = 0;
+ let random_idx = {
+ let mut usize_bytes = [0u8; USIZE_LEN];
+ usize_bytes.iter_mut().for_each(|b| {
+ *b = *random_bytes_cycle_iter.next().expect("A cycle never ends")
+ });
+ // Take one more to introduce a slight misalignment.
+ random_bytes_cycle_iter.next().expect("A cycle never ends");
+ usize::from_le_bytes(usize_bytes) % channel_ids.len()
+ };
+
+ Some(channel_ids.swap_remove(random_idx))
+ }
- while curr_size < MAX_PEER_STORAGE_SIZE
- && *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len()
+ while let Some(channel_id) = next_random_id(&mut channel_ids, &mut random_bytes_cycle_iter)
{
- let idx = random_usize % monitors.len();
- stored_chanmon_idx.insert(idx + 1);
- let (cid, mon) = monitors.iter().skip(idx).next().unwrap();
+ let monitor_holder = if let Some(monitor_holder) = monitors_lock.get(&channel_id) {
+ monitor_holder
+ } else {
+ debug_assert!(
+ false,
+ "Tried to access non-existing monitor, this should never happen"
+ );
+ break;
+ };
- let mut ser_chan = VecWriter(Vec::new());
- let min_seen_secret = mon.monitor.get_min_seen_secret();
- let counterparty_node_id = mon.monitor.get_counterparty_node_id();
+ let mut serialized_channel = VecWriter(Vec::new());
+ let min_seen_secret = monitor_holder.monitor.get_min_seen_secret();
+ let counterparty_node_id = monitor_holder.monitor.get_counterparty_node_id();
{
- let chan_mon = mon.monitor.inner.lock().unwrap();
+ let inner_lock = monitor_holder.monitor.inner.lock().unwrap();
- write_chanmon_internal(&chan_mon, true, &mut ser_chan)
+ write_chanmon_internal(&inner_lock, true, &mut serialized_channel)
.expect("can not write Channel Monitor for peer storage message");
}
let peer_storage_monitor = PeerStorageMonitorHolder {
- channel_id: *cid,
+ channel_id,
min_seen_secret,
counterparty_node_id,
- monitor_bytes: ser_chan.0,
+ monitor_bytes: serialized_channel.0,
};
- // Adding size of peer_storage_monitor.
- curr_size += peer_storage_monitor.serialized_length();
-
- if curr_size > MAX_PEER_STORAGE_SIZE {
- break;
+ current_size += peer_storage_monitor.serialized_length();
+ if current_size > MAX_PEER_STORAGE_SIZE {
+ continue;
}
+
monitors_list.push(peer_storage_monitor);
}
There might be still some room for improvement, but IMO something like this would be a lot more readable.
@@ -809,11 +813,57 @@ where | |||
mon.values().map(|monitor| monitor.monitor.get_counterparty_node_id()).collect() | |||
} | |||
|
|||
#[cfg(peer_storage)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you'll also need to introduce the cfg
-gate in a few other places to get rid of the warning, e.g., above for all_counterparty_node_ids
and on some imports.
In this PR, we begin serializing the ChannelMonitors and sending them over to determine whether any states were lost upon retrieval.
The next PR will be the final one, where we use FundRecoverer to initiate a force close and potentially go on-chain using a penalty transaction.
Sorry for the delay!