Skip to content

Peer Storage (Part 3): Identifying Lost Channel States #3897

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

adi2011
Copy link
Contributor

@adi2011 adi2011 commented Jun 28, 2025

In this PR, we begin serializing the ChannelMonitors and sending them over to determine whether any states were lost upon retrieval.

The next PR will be the final one, where we use FundRecoverer to initiate a force close and potentially go on-chain using a penalty transaction.

Sorry for the delay!

@ldk-reviews-bot
Copy link

ldk-reviews-bot commented Jun 28, 2025

👋 Thanks for assigning @tnull as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

@tnull tnull requested review from tnull and removed request for joostjager June 28, 2025 11:17
@adi2011 adi2011 force-pushed the peer-storage/serialise-deserialise branch 2 times, most recently from a35566a to 4c9f3c3 Compare June 29, 2025 05:04
Copy link

codecov bot commented Jun 29, 2025

Codecov Report

❌ Patch coverage is 59.57447% with 76 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.69%. Comparing base (ff279d6) to head (6328b33).
⚠️ Report is 50 commits behind head on main.

Files with missing lines Patch % Lines
lightning/src/chain/channelmonitor.rs 50.35% 12 Missing and 57 partials ⚠️
lightning/src/ln/channelmanager.rs 85.71% 5 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3897      +/-   ##
==========================================
- Coverage   88.91%   88.69%   -0.23%     
==========================================
  Files         173      173              
  Lines      123393   123735     +342     
  Branches   123393   123735     +342     
==========================================
+ Hits       109717   109741      +24     
- Misses      11216    11624     +408     
+ Partials     2460     2370      -90     
Flag Coverage Δ
fuzzing ?
tests 88.69% <59.57%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

///
/// [`ChainMonitor`]: crate::chain::chainmonitor::ChainMonitor
#[rustfmt::skip]
pub(crate) fn write_util<Signer: EcdsaChannelSigner, W: Writer>(channel_monitor: &ChannelMonitorImpl<Signer>, is_stub: bool, writer: &mut W) -> Result<(), Error> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wpaulino what do you think we should reasonably cut here to reduce the size of a ChannelMonitor without making the emergency-case ChannelMonitors all that different from the regular ones to induce more code changes across channelmonitor.rs? Obviously we should avoid counterparty_claimable_outpoints, but how much code is gonna break in doing so?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too familiar with the goals here, but if the idea is for the emergency-case ChannelMonitor to be able to recover funds, wouldn't it need to handle a commitment confirmation from either party? That means we need to track most things, even counterparty_claimable_outpoints (without the sources though) since the counterparty could broadcast a revoked commitment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically. I think ideally we find a way to store everything (required) but counterparty_claimable_outpoints so that we can punish the counterparty on their balance+reserve if they broadcast a stale state, even if not HTLCs (though of course they can't claim the HTLCs without us being able to punish them on the next stage). Not sure how practical that is today without counterparty_claimable_outpoints but I think that's the goal.

@adi2011 maybe for now let's just write the full monitors, but leave a TODO to strip out what we can later. For larger nodes that means all our monitors will be too large and we'll never back any up but that's okay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, eventually we'll need to figure out what can be stripped from ChannelMonitors. In CLN, we use a separate struct that holds only the minimal data needed to reconstruct a ChannelMonitor with just the essential information to grab funds or penalise the peer.

@ldk-reviews-bot
Copy link

🔔 1st Reminder

Hey @tnull! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

Copy link
Contributor

@tnull tnull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a first look, but will hold off with going more into details until we decided on which way we should go with the ChannelMonitor stub,

},

Err(e) => {
panic!("Wrong serialisation of PeerStorageMonitorHolderList: {}", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should ever panic in any of this code. Yes, something might be wrong if we have peer storage data we can't read anymore, but really no reason to refuse to at least keep other potential channels operational.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense, I think we should only panic if we have determined that we have lost some channel state.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more comments, let's move forward without blocking on the ChannelMonitor serialization stuff.

///
/// [`ChainMonitor`]: crate::chain::chainmonitor::ChainMonitor
#[rustfmt::skip]
pub(crate) fn write_util<Signer: EcdsaChannelSigner, W: Writer>(channel_monitor: &ChannelMonitorImpl<Signer>, is_stub: bool, writer: &mut W) -> Result<(), Error> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically. I think ideally we find a way to store everything (required) but counterparty_claimable_outpoints so that we can punish the counterparty on their balance+reserve if they broadcast a stale state, even if not HTLCs (though of course they can't claim the HTLCs without us being able to punish them on the next stage). Not sure how practical that is today without counterparty_claimable_outpoints but I think that's the goal.

@adi2011 maybe for now let's just write the full monitors, but leave a TODO to strip out what we can later. For larger nodes that means all our monitors will be too large and we'll never back any up but that's okay.

let random_bytes = self.entropy_source.get_secure_random_bytes();
let serialised_channels = Vec::new();

// TODO(aditya): Choose n random channels so that peer storage does not exceed 64k.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be pretty easy? We have random bytes, just make an outer loop that selects a random monitor (by doing monitors.iter().skip(random_usize % monitors.len()).next())

Copy link
Contributor

@tnull tnull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a ~first pass.

This needs a rebase now, in particular now that #3922 landed.

@@ -810,10 +813,53 @@ where
}

fn send_peer_storage(&self, their_node_id: PublicKey) {
// TODO: Serialize `ChannelMonitor`s inside `our_peer_storage`.

static MAX_PEER_STORAGE_SIZE: usize = 65000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a const rather than static, I think? Also, would probably make sense to add this add the module level, with some docs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also isn't the max size 64 KiB, not 65K?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, my bad, thanks for pointing this out, It should be 65531.

let mut curr_size = 0;

// Randomising Keys in the HashMap to fetch monitors without repetition.
let mut keys: Vec<&ChannelId> = monitors.keys().collect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this a bit cleaner by using the proposed iterator skiping approach in the loop below, maybe while simply keeping track of which monitors we already wrote?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will address this in the next fixup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this is still unaddressed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gentle ping here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a BTree to keep track of the monitors and ran a while loop until we either ran out of peer storage capacity or all monitors were stored.

monitors_list.monitors.push(peer_storage_monitor);
},
Err(_) => {
panic!("Can not write monitor for {}", mon.monitor.channel_id())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really, please avoid these explicit panics in any of this code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would panic only if there is some issue with the write_util which suggests that we are unable to serialise the channelmonitor, should we just log the error here instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhh, while I'd still prefer to never panic in any of the peer storage code, indeed this would likely only get hit if we run out of memory or similar. If we keep it, let's just expect on the write_internal call, which avoids all this matching.

@adi2011 adi2011 force-pushed the peer-storage/serialise-deserialise branch from 676afbc to 1111f05 Compare July 16, 2025 20:41
let random_bytes = self.entropy_source.get_secure_random_bytes();
let serialised_channels = Vec::new();
let random_usize = usize::from_le_bytes(random_bytes[0..core::mem::size_of::<usize>()].try_into().unwrap());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Let's keep the length in a separate const, avoiding to make this line overly long/noisy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks!

monitors_list.monitors.push(peer_storage_monitor);
},
Err(_) => {
panic!("Can not write monitor for {}", mon.monitor.channel_id())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhh, while I'd still prefer to never panic in any of the peer storage code, indeed this would likely only get hit if we run out of memory or similar. If we keep it, let's just expect on the write_internal call, which avoids all this matching.

match write_util_internal(&chan_mon, true, &mut ser_chan) {
Ok(_) => {
// Adding size of peer_storage_monitor.
curr_size += ser_chan.0.serialized_length()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we move this check below constructing PeerStorageMonitorHolder, can we just use it's serialized_lenght implementation instead of tallying up the individual fields here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, thanks for pointing this out. Fixed.

},
None => {
// TODO: Figure out if this channel is so old that we have forgotten about it.
panic!("Lost a channel {}", &mon_holder.channel_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why panic here? This should be the common case, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this when we’ve completely forgotten about a channel but it’s still active. The only false positives occur if we’ve closed a channel and then forgotten about it. Any ideas on how we could detect those?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this when we’ve completely forgotten about a channel but it’s still active.

I don't think this is accurate? Wouldn't we end up here simply in this scenario:

  1. We have a channel, store an encrypted backup with a peer
  2. The peer goes offline
  3. We close the channel, drop it from the per_peer_state
  4. The peer comes back online
  5. We retrieve and decrypt a peer backup with a channel state that we no longer know anything about.
  6. We panic and crash the node

This can happen very easily / is the common case, no? Also note this same flow could intentionally get abused by a malicious peer, too. The counterparty would simply need to guess that we include a certain channel in the stored blob, remember it, wait until the channel is closed, and then simply replay the old backup to crash our node.

I still maintain we should never panic in any of this PeerStorage related code, maybe mod the above case where we're positive that our local state is outdated compared to the backed-up version.

That said, we'll indeed need to discern between the two cases (unexpected data loss / expected data loss). One variant to do this would be to first check whether we see a funding transaction spend, although note that this would be an async flow that we can't just directly initiate here. Or we need to maintain a list of all closed and remember it forever. The latter might indeed be the simplest and safest way, if we're fine with keeping the additional data. In any case, we need to find a solution here before moving on, as panicking here is unacceptable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree, this is a false positive where the node crashes, but it shouldn’t. The happy path here is:

  1. We have a channel with a peer.
  2. Something happens and we lose the channel (but it is unclosed); the node cannot identify it either.
  3. We retrieve the ChannelMonitor from peer storage.
  4. We realise that this is an unknown, old channel, and the node crashes because it no longer has any record of an active channel.

There should be a way to determine whether a channel is closed, so that if we receive an unknown channel and its funding TXO is unspent, we crash the node for recovery. But if the funding is already spent, we shouldn’t worry unless they have published an old state. Right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree, this is a false positive where the node crashes

We cannot tolerate a false positive if we're going to panic.

There should be a way to determine whether a channel is closed, so that if we receive an unknown channel and its funding TXO is unspent, we crash the node for recovery. But if the funding is already spent, we shouldn’t worry unless they have published an old state. Right?

There is no way to do this in LDK today, and I don't think we'll ever get one. We don't have some global way to query if an arbitrary UTXO is unspent (well, we do in gossip but its not always available), and once a channel is closed and the funds swept we may delete the ChannelMonitor because there's no reason to keep it around. Keeping around information about every past channel we've ever had just to detect a lost channel here doesn't seem worth it to me.

It does mean we won't detect if we lose all of our state, but I think that's okay - a user will notice that their entire balance is missing :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should not be there. I put up the TODO so that we can have a discussion around this.
Changing the panic to log_debug.


impl_writeable_tlv_based!(PeerStorageMonitorHolder, {
(0, channel_id, required),
(1, counterparty_node_id, required),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For new objects, you should be able to make all fields even-numbered as they are required from the getgo, basically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

/// wrapped inside [`PeerStorage`].
///
/// [`PeerStorage`]: crate::ln::msgs::PeerStorage
pub(crate) struct PeerStorageMonitorHolderList {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be able to avoid this new type wrapper if we add a impl_writeable_for_vec entry for PeerStorageMonitorHolder in lightning/src/util/ser.rs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this is still unaddressed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry, I will address this in the next fixup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@adi2011 adi2011 requested a review from TheBlueMatt July 17, 2025 15:12
@ldk-reviews-bot
Copy link

🔔 1st Reminder

Hey @TheBlueMatt! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you forgot to push after addressing @tnull's last comments? Also needs a rebase.

@adi2011 adi2011 force-pushed the peer-storage/serialise-deserialise branch from 6dede0e to c9baefb Compare July 21, 2025 13:12
@adi2011 adi2011 requested a review from TheBlueMatt July 21, 2025 13:12
@adi2011
Copy link
Contributor Author

adi2011 commented Jul 21, 2025

Oh sorry, my bad, I have pushed the fixups and rebased.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we just remove the is_stub check and the spurious panic we can land this (feel free to squash fixups, IMO). Sadly we will need to add a cfg flag before 0.2 (unless we can nail down the serialization format we want), but it doesn't have to hold up this PR or progress.


self.lockdown_from_offchain.write(writer)?;
self.holder_tx_signed.write(writer)?;
if !is_stub {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now lets not do anything differently here, I don't think skipping the onchain_tx_handler is safe, but we're not quite sure what the right answer is yet. Because we won't (yet) finalize the serialization format, we should also cfg-tag out the actual send logic in send_peer_storage, sadly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I not use is_stub, inside the function or should I remove the parameter from the function definition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, should I make a new cfg tag for peer-storage or use an existing?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I not use is_stub, inside the function or should I remove the parameter from the function definition.

I think its fine to add it with a TODO: figure out which fields should go here and which do not.

Also, should I make a new cfg tag for peer-storage or use an existing?

I believe a new one (you'll have to add it in ci/ci-tests.sh as well)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I will cfg-tag the sending logic. In the next PR we can discuss when to send peer storage along with what to omit from ChannelMonitors.

},
None => {
// TODO: Figure out if this channel is so old that we have forgotten about it.
panic!("Lost a channel {}", &mon_holder.channel_id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree, this is a false positive where the node crashes

We cannot tolerate a false positive if we're going to panic.

There should be a way to determine whether a channel is closed, so that if we receive an unknown channel and its funding TXO is unspent, we crash the node for recovery. But if the funding is already spent, we shouldn’t worry unless they have published an old state. Right?

There is no way to do this in LDK today, and I don't think we'll ever get one. We don't have some global way to query if an arbitrary UTXO is unspent (well, we do in gossip but its not always available), and once a channel is closed and the funds swept we may delete the ChannelMonitor because there's no reason to keep it around. Keeping around information about every past channel we've ever had just to detect a lost channel here doesn't seem worth it to me.

It does mean we won't detect if we lose all of our state, but I think that's okay - a user will notice that their entire balance is missing :)

@adi2011 adi2011 force-pushed the peer-storage/serialise-deserialise branch 4 times, most recently from f95f238 to 4304853 Compare July 24, 2025 11:16
adi2011 added 2 commits July 24, 2025 16:49
'PeerStorageMonitorHolder' is used to wrap a single ChannelMonitor, here we are
adding some fields separetly so that we do not need to read the whole ChannelMonitor
to identify if we have lost some states.

`PeerStorageMonitorHolderList` is used to keep the list of all the channels which would
be sent over the wire inside Peer Storage.
Fixed formatting for write() in ChannelMonitorImpl. This would make the next
commit cleaner by ensuring it only contains direct code shifts, without
unrelated formatting changes.
@adi2011 adi2011 force-pushed the peer-storage/serialise-deserialise branch from 4304853 to e76f538 Compare July 24, 2025 11:20
@adi2011 adi2011 requested review from tnull and TheBlueMatt July 24, 2025 11:20
@adi2011 adi2011 force-pushed the peer-storage/serialise-deserialise branch from e76f538 to 75a8fd6 Compare July 24, 2025 11:24
Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine now, after the notes here are fixed and rustfmt is fixed.


self.lockdown_from_offchain.write(writer)?;
self.holder_tx_signed.write(writer)?;
if !is_stub {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I not use is_stub, inside the function or should I remove the parameter from the function definition.

I think its fine to add it with a TODO: figure out which fields should go here and which do not.

Also, should I make a new cfg tag for peer-storage or use an existing?

I believe a new one (you'll have to add it in ci/ci-tests.sh as well)

@adi2011 adi2011 force-pushed the peer-storage/serialise-deserialise branch from 75a8fd6 to aa08357 Compare July 25, 2025 05:03
adi2011 added 3 commits July 25, 2025 11:31
Create a utililty function to prevent code duplication while writing ChannelMonitors.
Serialise them inside ChainMonitor::send_peer_storage and send them over.

Cfg-tag the sending logic because we are unsure of what to omit from ChannelMonitors stored inside
peer-storage.
Deserialise the ChannelMonitors and compare the data to determine if we have
lost some states.
Node should now determine lost states using retrieved peer storage.
@adi2011 adi2011 force-pushed the peer-storage/serialise-deserialise branch from aa08357 to 662fff1 Compare July 25, 2025 06:02
@ldk-reviews-bot
Copy link

🔔 1st Reminder

Hey @tnull! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

@ldk-reviews-bot
Copy link

🔔 2nd Reminder

Hey @tnull! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

Cargo.toml Outdated
@@ -68,4 +68,5 @@ check-cfg = [
"cfg(splicing)",
"cfg(async_payments)",
"cfg(simple_close)",
"cfg(peer_storage)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Indentation is off.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, need to fix my indentation configuration. Fixed.

@@ -47,6 +48,8 @@ use crate::types::features::{InitFeatures, NodeFeatures};
use crate::util::errors::APIError;
use crate::util::logger::{Logger, WithContext};
use crate::util::persist::MonitorName;
#[allow(unused_imports)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please rather cfg-gate these imports, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

let random_bytes = self.entropy_source.get_secure_random_bytes();
let serialised_channels = Vec::new();

#[cfg(peer_storage)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not cfg-gate the whole message then? This way it might get really confusing what is part and what isn't part of the cfg flag?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, for now, cfg-tagging fn send_peer_storage and all the call sites of this function.

{
const MAX_PEER_STORAGE_SIZE: usize = 65531;
const USIZE_LEN: usize = core::mem::size_of::<usize>();
let random_usize = usize::from_le_bytes(random_bytes[0..USIZE_LEN].try_into().unwrap());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Let's just avoid the unnecessary unwrap here:

diff --git a/lightning/src/chain/chainmonitor.rs b/lightning/src/chain/chainmonitor.rs
index 1681d0187..ba7613f8e 100644
--- a/lightning/src/chain/chainmonitor.rs
+++ b/lightning/src/chain/chainmonitor.rs
@@ -816,12 +816,13 @@ where
                #[allow(unused_mut)]
                let mut monitors_list: Vec<PeerStorageMonitorHolder> = Vec::new();
                let random_bytes = self.entropy_source.get_secure_random_bytes();
-
                #[cfg(peer_storage)]
                {
                        const MAX_PEER_STORAGE_SIZE: usize = 65531;
                        const USIZE_LEN: usize = core::mem::size_of::<usize>();
-                       let random_usize = usize::from_le_bytes(random_bytes[0..USIZE_LEN].try_into().unwrap());
+                       let mut usize_bytes = [0u8; USIZE_LEN];
+                       usize_bytes.copy_from_slice(&random_bytes);
+                       let random_usize = usize::from_le_bytes(usize_bytes);
                        let mut curr_size = 0;
                        let monitors = self.monitors.read().unwrap();
                        // Randomising Keys in the HashMap to fetch monitors without repetition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this cleanup. Fixed.

let mut curr_size = 0;

// Randomising Keys in the HashMap to fetch monitors without repetition.
let mut keys: Vec<&ChannelId> = monitors.keys().collect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gentle ping here.

let mut ser_chan = VecWriter(Vec::new());
let min_seen_secret = mon.monitor.get_min_seen_secret();
let counterparty_node_id = mon.monitor.get_counterparty_node_id();
let chan_mon = mon.monitor.inner.lock().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put this lock and the write call into a separate scope so we drop the lock ASAP again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for pointing this out. Fixed.

@@ -17117,38 +17165,62 @@ mod tests {

#[test]
#[rustfmt::skip]
#[cfg(peer_storage)]
#[should_panic(expected = "Lost channel state for channel ae3367da2c13bc1ceb86bf56418f62828f7ce9d6bfb15a46af5ba1f1ed8b124f.\n\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than using should_panic here (where we can't discern when it's actually happening), can we rather use std::panic::catch_unwind to catch the specific panic on the specific call and then using assert to check it's an/the correct error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is, if I remove the should_panic the test panics during teardown (in the drop), not in the main test logic.

if let MessageSendEvent::SendChannelReestablish { ref node_id, ref msg } = msg {
nodes[0].node.handle_channel_reestablish(nodes[1].node.get_our_node_id(), msg);
assert_eq!(*node_id, nodes[0].node.get_our_node_id());
} else if let MessageSendEvent::SendPeerStorageRetrieval { ref node_id, ref msg } = msg {
// Should Panic here!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's check / assert it here then, see above.

nodes[0].node.peer_disconnected(nodes[1].node.get_our_node_id());
nodes[1].node.peer_disconnected(nodes[0].node.get_our_node_id());

// Reload Node!
// nodes[0].chain_source.clear_watched_txn_and_outputs();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this is still uncommented out? Do we need clear_watched_txn_and_outputs at all then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, since we are not covering missing channelmonitors yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, then it might be preferable to just leave a TODO and drop the unused clear_watched_txn_and_outputs from this PR for now.

Comment on lines +834 to +838
while curr_size < MAX_PEER_STORAGE_SIZE
&& *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len()
{
let idx = random_usize % monitors.len();
stored_chanmon_idx.insert(idx + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop condition has a potential issue where it might not terminate if the same random index is repeatedly selected. Since stored_chanmon_idx is a BTreeSet, inserting the same value multiple times doesn't increase its size. If random_usize % monitors.len() keeps generating the same index, the condition *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len() will remain true indefinitely.

Consider modifying the approach to ensure termination, such as:

  1. Using a counter to limit iterations
  2. Tracking already-processed indices in a separate set
  3. Using a deterministic sequence instead of random selection

This would prevent potential infinite loops while still achieving the goal of selecting monitors for peer storage.

Suggested change
while curr_size < MAX_PEER_STORAGE_SIZE
&& *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len()
{
let idx = random_usize % monitors.len();
stored_chanmon_idx.insert(idx + 1);
while curr_size < MAX_PEER_STORAGE_SIZE
&& *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len()
&& stored_chanmon_idx.len() < monitors.len()
{
let idx = random_usize % monitors.len();
let inserted = stored_chanmon_idx.insert(idx + 1);
if inserted {
curr_size += 1;
} else {
// If we couldn't insert, try a different index next time
random_usize = random_usize.wrapping_add(1);
}

Spotted by Diamond

Is this helpful? React 👍 or 👎 to let us know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tnull, this is the same concern that I had in my mind as well. Wasn't the previous approach more efficient and secure?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something along these lines?:

diff --git a/lightning/src/chain/chainmonitor.rs b/lightning/src/chain/chainmonitor.rs
index 4a40ba872..dde117ca9 100644
--- a/lightning/src/chain/chainmonitor.rs
+++ b/lightning/src/chain/chainmonitor.rs
@@ -39,6 +39,7 @@ use crate::chain::{ChannelMonitorUpdateStatus, Filter, WatchedOutput};
 use crate::events::{self, Event, EventHandler, ReplayEvent};
 use crate::ln::channel_state::ChannelDetails;
 use crate::ln::msgs::{self, BaseMessageHandler, Init, MessageSendEvent, SendOnlyMessageHandler};
+#[cfg(peer_storage)]
 use crate::ln::our_peer_storage::{DecryptedOurPeerStorage, PeerStorageMonitorHolder};
 use crate::ln::types::ChannelId;
 use crate::prelude::*;
@@ -53,6 +54,8 @@ use crate::util::persist::MonitorName;
 use crate::util::ser::{VecWriter, Writeable};
 use crate::util::wakers::{Future, Notifier};
 use bitcoin::secp256k1::PublicKey;
+#[cfg(peer_storage)]
+use core::iter::Cycle;
 use core::ops::Deref;
 use core::sync::atomic::{AtomicUsize, Ordering};
 
@@ -808,6 +811,7 @@ where
 
 	/// This function collects the counterparty node IDs from all monitors into a `HashSet`,
 	/// ensuring unique IDs are returned.
+	#[cfg(peer_storage)]
 	fn all_counterparty_node_ids(&self) -> HashSet<PublicKey> {
 		let mon = self.monitors.read().unwrap();
 		mon.values().map(|monitor| monitor.monitor.get_counterparty_node_id()).collect()
@@ -815,51 +819,71 @@ where
 
 	#[cfg(peer_storage)]
 	fn send_peer_storage(&self, their_node_id: PublicKey) {
-		#[allow(unused_mut)]
 		let mut monitors_list: Vec<PeerStorageMonitorHolder> = Vec::new();
 		let random_bytes = self.entropy_source.get_secure_random_bytes();
 
 		const MAX_PEER_STORAGE_SIZE: usize = 65531;
 		const USIZE_LEN: usize = core::mem::size_of::<usize>();
-		let mut usize_bytes = [0u8; USIZE_LEN];
-		usize_bytes.copy_from_slice(&random_bytes[0..USIZE_LEN]);
-		let random_usize = usize::from_le_bytes(usize_bytes);
+		let mut random_bytes_cycle_iter = random_bytes.iter().cycle();
+
+		let mut current_size = 0;
+		let monitors_lock = self.monitors.read().unwrap();
+		let mut channel_ids = monitors_lock.keys().copied().collect();
+
+		fn next_random_id(
+			channel_ids: &mut Vec<ChannelId>,
+			random_bytes_cycle_iter: &mut Cycle<core::slice::Iter<u8>>,
+		) -> Option<ChannelId> {
+			if channel_ids.is_empty() {
+				return None;
+			}
 
-		let mut curr_size = 0;
-		let monitors = self.monitors.read().unwrap();
-		let mut stored_chanmon_idx = alloc::collections::BTreeSet::<usize>::new();
-		// Used as a fallback reference if the set is empty
-		let zero = 0;
+			let random_idx = {
+				let mut usize_bytes = [0u8; USIZE_LEN];
+				usize_bytes.iter_mut().for_each(|b| {
+					*b = *random_bytes_cycle_iter.next().expect("A cycle never ends")
+				});
+				// Take one more to introduce a slight misalignment.
+				random_bytes_cycle_iter.next().expect("A cycle never ends");
+				usize::from_le_bytes(usize_bytes) % channel_ids.len()
+			};
+
+			Some(channel_ids.swap_remove(random_idx))
+		}
 
-		while curr_size < MAX_PEER_STORAGE_SIZE
-			&& *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len()
+		while let Some(channel_id) = next_random_id(&mut channel_ids, &mut random_bytes_cycle_iter)
 		{
-			let idx = random_usize % monitors.len();
-			stored_chanmon_idx.insert(idx + 1);
-			let (cid, mon) = monitors.iter().skip(idx).next().unwrap();
+			let monitor_holder = if let Some(monitor_holder) = monitors_lock.get(&channel_id) {
+				monitor_holder
+			} else {
+				debug_assert!(
+					false,
+					"Tried to access non-existing monitor, this should never happen"
+				);
+				break;
+			};
 
-			let mut ser_chan = VecWriter(Vec::new());
-			let min_seen_secret = mon.monitor.get_min_seen_secret();
-			let counterparty_node_id = mon.monitor.get_counterparty_node_id();
+			let mut serialized_channel = VecWriter(Vec::new());
+			let min_seen_secret = monitor_holder.monitor.get_min_seen_secret();
+			let counterparty_node_id = monitor_holder.monitor.get_counterparty_node_id();
 			{
-				let chan_mon = mon.monitor.inner.lock().unwrap();
+				let inner_lock = monitor_holder.monitor.inner.lock().unwrap();
 
-				write_chanmon_internal(&chan_mon, true, &mut ser_chan)
+				write_chanmon_internal(&inner_lock, true, &mut serialized_channel)
 					.expect("can not write Channel Monitor for peer storage message");
 			}
 			let peer_storage_monitor = PeerStorageMonitorHolder {
-				channel_id: *cid,
+				channel_id,
 				min_seen_secret,
 				counterparty_node_id,
-				monitor_bytes: ser_chan.0,
+				monitor_bytes: serialized_channel.0,
 			};
 
-			// Adding size of peer_storage_monitor.
-			curr_size += peer_storage_monitor.serialized_length();
-
-			if curr_size > MAX_PEER_STORAGE_SIZE {
-				break;
+			current_size += peer_storage_monitor.serialized_length();
+			if current_size > MAX_PEER_STORAGE_SIZE {
+				continue;
 			}
+
 			monitors_list.push(peer_storage_monitor);
 		}

There might be still some room for improvement, but IMO something like this would be a lot more readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+			if current_size > MAX_PEER_STORAGE_SIZE {
+				continue;
 			}

Why not break here?

@adi2011 adi2011 requested a review from tnull July 29, 2025 16:17
Comment on lines +834 to +838
while curr_size < MAX_PEER_STORAGE_SIZE
&& *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len()
{
let idx = random_usize % monitors.len();
stored_chanmon_idx.insert(idx + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something along these lines?:

diff --git a/lightning/src/chain/chainmonitor.rs b/lightning/src/chain/chainmonitor.rs
index 4a40ba872..dde117ca9 100644
--- a/lightning/src/chain/chainmonitor.rs
+++ b/lightning/src/chain/chainmonitor.rs
@@ -39,6 +39,7 @@ use crate::chain::{ChannelMonitorUpdateStatus, Filter, WatchedOutput};
 use crate::events::{self, Event, EventHandler, ReplayEvent};
 use crate::ln::channel_state::ChannelDetails;
 use crate::ln::msgs::{self, BaseMessageHandler, Init, MessageSendEvent, SendOnlyMessageHandler};
+#[cfg(peer_storage)]
 use crate::ln::our_peer_storage::{DecryptedOurPeerStorage, PeerStorageMonitorHolder};
 use crate::ln::types::ChannelId;
 use crate::prelude::*;
@@ -53,6 +54,8 @@ use crate::util::persist::MonitorName;
 use crate::util::ser::{VecWriter, Writeable};
 use crate::util::wakers::{Future, Notifier};
 use bitcoin::secp256k1::PublicKey;
+#[cfg(peer_storage)]
+use core::iter::Cycle;
 use core::ops::Deref;
 use core::sync::atomic::{AtomicUsize, Ordering};
 
@@ -808,6 +811,7 @@ where
 
 	/// This function collects the counterparty node IDs from all monitors into a `HashSet`,
 	/// ensuring unique IDs are returned.
+	#[cfg(peer_storage)]
 	fn all_counterparty_node_ids(&self) -> HashSet<PublicKey> {
 		let mon = self.monitors.read().unwrap();
 		mon.values().map(|monitor| monitor.monitor.get_counterparty_node_id()).collect()
@@ -815,51 +819,71 @@ where
 
 	#[cfg(peer_storage)]
 	fn send_peer_storage(&self, their_node_id: PublicKey) {
-		#[allow(unused_mut)]
 		let mut monitors_list: Vec<PeerStorageMonitorHolder> = Vec::new();
 		let random_bytes = self.entropy_source.get_secure_random_bytes();
 
 		const MAX_PEER_STORAGE_SIZE: usize = 65531;
 		const USIZE_LEN: usize = core::mem::size_of::<usize>();
-		let mut usize_bytes = [0u8; USIZE_LEN];
-		usize_bytes.copy_from_slice(&random_bytes[0..USIZE_LEN]);
-		let random_usize = usize::from_le_bytes(usize_bytes);
+		let mut random_bytes_cycle_iter = random_bytes.iter().cycle();
+
+		let mut current_size = 0;
+		let monitors_lock = self.monitors.read().unwrap();
+		let mut channel_ids = monitors_lock.keys().copied().collect();
+
+		fn next_random_id(
+			channel_ids: &mut Vec<ChannelId>,
+			random_bytes_cycle_iter: &mut Cycle<core::slice::Iter<u8>>,
+		) -> Option<ChannelId> {
+			if channel_ids.is_empty() {
+				return None;
+			}
 
-		let mut curr_size = 0;
-		let monitors = self.monitors.read().unwrap();
-		let mut stored_chanmon_idx = alloc::collections::BTreeSet::<usize>::new();
-		// Used as a fallback reference if the set is empty
-		let zero = 0;
+			let random_idx = {
+				let mut usize_bytes = [0u8; USIZE_LEN];
+				usize_bytes.iter_mut().for_each(|b| {
+					*b = *random_bytes_cycle_iter.next().expect("A cycle never ends")
+				});
+				// Take one more to introduce a slight misalignment.
+				random_bytes_cycle_iter.next().expect("A cycle never ends");
+				usize::from_le_bytes(usize_bytes) % channel_ids.len()
+			};
+
+			Some(channel_ids.swap_remove(random_idx))
+		}
 
-		while curr_size < MAX_PEER_STORAGE_SIZE
-			&& *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len()
+		while let Some(channel_id) = next_random_id(&mut channel_ids, &mut random_bytes_cycle_iter)
 		{
-			let idx = random_usize % monitors.len();
-			stored_chanmon_idx.insert(idx + 1);
-			let (cid, mon) = monitors.iter().skip(idx).next().unwrap();
+			let monitor_holder = if let Some(monitor_holder) = monitors_lock.get(&channel_id) {
+				monitor_holder
+			} else {
+				debug_assert!(
+					false,
+					"Tried to access non-existing monitor, this should never happen"
+				);
+				break;
+			};
 
-			let mut ser_chan = VecWriter(Vec::new());
-			let min_seen_secret = mon.monitor.get_min_seen_secret();
-			let counterparty_node_id = mon.monitor.get_counterparty_node_id();
+			let mut serialized_channel = VecWriter(Vec::new());
+			let min_seen_secret = monitor_holder.monitor.get_min_seen_secret();
+			let counterparty_node_id = monitor_holder.monitor.get_counterparty_node_id();
 			{
-				let chan_mon = mon.monitor.inner.lock().unwrap();
+				let inner_lock = monitor_holder.monitor.inner.lock().unwrap();
 
-				write_chanmon_internal(&chan_mon, true, &mut ser_chan)
+				write_chanmon_internal(&inner_lock, true, &mut serialized_channel)
 					.expect("can not write Channel Monitor for peer storage message");
 			}
 			let peer_storage_monitor = PeerStorageMonitorHolder {
-				channel_id: *cid,
+				channel_id,
 				min_seen_secret,
 				counterparty_node_id,
-				monitor_bytes: ser_chan.0,
+				monitor_bytes: serialized_channel.0,
 			};
 
-			// Adding size of peer_storage_monitor.
-			curr_size += peer_storage_monitor.serialized_length();
-
-			if curr_size > MAX_PEER_STORAGE_SIZE {
-				break;
+			current_size += peer_storage_monitor.serialized_length();
+			if current_size > MAX_PEER_STORAGE_SIZE {
+				continue;
 			}
+
 			monitors_list.push(peer_storage_monitor);
 		}

There might be still some room for improvement, but IMO something like this would be a lot more readable.

@@ -809,11 +813,57 @@ where
mon.values().map(|monitor| monitor.monitor.get_counterparty_node_id()).collect()
}

#[cfg(peer_storage)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'll also need to introduce the cfg-gate in a few other places to get rid of the warning, e.g., above for all_counterparty_node_ids and on some imports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants