Monitor Updating Persister (MUP) Design #2545
Replies: 9 comments 45 replies
-
Should that be the SUM or the DIFFERENCE? |
Beta Was this translation helpful? Give feedback.
-
TXID is a hexdump, IDX is a number, and |
Beta Was this translation helpful? Give feedback.
-
Yes. As this is the open-source LDK, the backing store used may be different, and its characterstics like speed-of-list-directory may vary greatly, so it is better to trigger this async. I suggest we also add a cleanup routine at init, also async, which does a best-effort to clean up stale updates. The node may crash at any time. |
Beta Was this translation helpful? Give feedback.
-
At the Lightning BOLT level, it is always safe to lose the latest state --- which LDK should translate to the latest CMU. Lightning BOLT has a "hand-over-hand": when moving from one state to the next, there is a short period when both states are valid (== not punishable even if the older state is published onchain). The exact order of events is:
Thus, as long as "save new state" completes successfully, and on a restart, is indeed in the persistence layer, it does not matter. If the "save new state" never completes (e.g. due to node crashing at that point) then it is still safe at the Lightning BOLT layer. Users of MUP need not worry about this separately. The exact requirement is just: Either the write completes as a whole (whether CM or CMU), or none of it is written. That is, there should be no partial writes. This kind of atomicity is already implemented in the |
Beta Was this translation helpful? Give feedback.
-
Some additional edge cases to note, apart from
Sometime back, I had suggested that we add a flag to delete interface, "can_be_eventual_delete". This has multiple advantages:
Separately, need to evaluate If it is possible to publish an event at this layer i.e. something like CmuCleanup in range i..j |
Beta Was this translation helpful? Give feedback.
-
Probably, have to do something, yea. Monitor update persistence is the critical path for all our usual lightning actions, so super high tail latency would suck. The issue with async as noted elsewhere is we end up needing a "spawn function" bound, which is kinda awkward (though we do have it elsewhere in LDK already). We currently do deletes one at a time, which is gonna result in a ton of excess fsync traffic. I wonder if we just add a batch delete call and don't worry about async. We could also consider a flag to remove atomicity from deletes, which some backends could use to spawn deletes async or skip fsync. The LDK-provided backends could skip the async part and just remove fsync in the filesystem implementation (which is basically async anyway with the kernel flushing eventually) but those with higher throughput nodes could do their own backend that spawns. |
Beta Was this translation helpful? Give feedback.
-
Because it's so critical to get this right, and bugs can lead to user funds loss, this kinda worries me admittedly. You and @G8XSU both notes on discord that many storage backends return "not found" for "permission denied", which makes this potentially a brittle assumption.
This is kinda true, but in general the ChannelManager's persistence isn't guaranteed to complete in a timely manner, so we could have persisted a ChannelMonitorUpdate, taken irreversible action, and then crash, at which point ensuring that CMU is applied on startup becomes critical. |
Beta Was this translation helpful? Give feedback.
-
I pulled some data, and I'm overstating the typical case, but the max is a different story. We have one that's 60MB; another that's 25MB; lots that are 10-15MB. But most are 5-10MB, that's true. I spend more time looking at the bigger ones, and also we record writes, not actually the file sizes, so I see the larger writes more often (we seem to get bigger CMs with busier channels). I'll update.
Yes, I was writing on the CM line but wrote CMU because the acronyms are easy (for me) to mix up. |
Beta Was this translation helpful? Give feedback.
-
Note that if we don't want to list all CMUs on initialization of the
If we persisted above mentioned state, we could also get rid of depending on |
Beta Was this translation helpful? Give feedback.
-
LDK Monitor Updating Persister (MUP) Design
Motivation
Currently (as of 0.0.116), the "batteries included" persistence traits and implementation in LDK are, at least in practice, limited to reading at init, and writing whole objects when anything important in memory changes. This model is simple and reliable, but inefficient: typically a very small part of the object's state actually changes, and so this model mostly writes data that are unchanged.
Simplicity at the cost of efficiency is a good trade for most nodes, like wallets: with a handful of channels and typical updates on the order of perhaps 100s per day*, the inefficiency is not noticeable on modern (even mobile) hardware. But, it is bad for routing nodes, which tend to have many, high-traffic channels. These nodes must write objects around 5-10MB (typically) many times per second, and still with most bytes unchanged on each write. The inefficiency in this scenario is much more noticeable, driving up costs (like synchronous replication), and even provoking failures in storage that hurt reliability.
Anecdotally (as data is still scarce), an LDK routing node with relatively little traffic must write 400-600% of their overall channel state corpus per minute. For some channels that are very active, the percentages can be an order of magnitude higher. For such routing nodes, a different balance is probably desireable, trading improved efficiency for more complexity.
Design overview
We propose a
MonitorUpdatingPersister
(MUP). This is primarily an implementation ofPersist
, which is the interface LDK uses to drive channel state updates to storage. MUP uses a key-value storage model, requiring aKVStore
implementation for storage.MUP is a work-in-progress at #2359, which at the time of this writing is of a different design, but provided useful R&D input to this one.
Persist
backgroundPersist
drives channel state storage by prompting the node to write either:ChannelMonitor
(CM), which contains the entire state of a channel. TLV-serialized CMs are typically on the order of 10s of MB.ChannelMonitorUpdate
(CMU), which contains only a differential change to channel state. A serialized CMU is a tiny fraction of the size of a CM, as small as 1KB.Both CMs and CMUs contain an
update_id
. For CMs, this reflects the latest update applied to it; for CMUs, this reflects the update the CMU contains. An in-memory CM struct can apply an update to itself, increasing itsupdate_id
.An invariant of this design is that
update_id
is a monotonically increasingu64
, except for theCLOSED_CHANNEL_UPDATE_ID
(equivalent tou64::MAX
), which may be in multiple updates to the channel.How MUP implements
Persist
(writes)The MUP keeps the following internal state:
last_monitor_update_id
where:u64
update_id
at which aChannelMonitor
was last persisted for the channel.u64
maximum_pending_updates
, which describes the target maximum of stored CMUs that haven't been applied to the last stored CM.The MUP implements the trait as follows:
KVStore
. MUP uses a namespace likemonitors
, and a[TXID]_[IDX]
key format. When the write is complete,last_monitor_update_id
is updated accordingly.last_monitor_update_id
.update_id
!=CLOSED_CHANNEL_UPDATE_ID
update_id
andlast_monitor_update_id
is less than or equal tomaximum_pending_updates
monitor_updates
, using a key that extends the CM key to[TXID]_[IDX]_[UPDATE_ID]
.last_monitor_update_id
in-scope asold_update_id
.persist_new_channel
(also updateslast_monitor_update_id
).old_update_id
..=CM.update_id
. This implies that if theupdate_id
didn't actually change, such as when we get here via multiple updates atCLOSED_CHANNEL_UPDATE_ID
, the cleanup is a no-op. TBD: Is this worth doing as an async, fire-and-forget task, to avoid blocking on this I/O?In this way, failed deletes notwithstanding, the boundary of required storage for a given channel is
maximum_pending_updates
+ 1 (the CM) items. Additionally, update keys are derived without listing them.How MUP reconstructs state (reads)
The MUP primarily needs to read data when starting an LDK program (we'll call this time "init"). It must reconstruct the state of all channels in-memory from what was persisted in storage. Users should therefore call MUP's
read_channel_monitors_with_updates
at init, as a substitute forread_channel_monitors
as included in LDK 0.0.116 and other recent versions (note there is not an LDK trait that describes this functionality; it is, at most, a convention).The
read_channel_monitors_with_updates
function performs the following actions:result
.update_id
==CLOSED_CHANNEL_UPDATE_ID
, add the CM toresult
and continue, as we never write CMUs once this sequence height is reached.u
equal theupdate_id
of the CMu
by one (u+=1
)[TXID]_[IDX]_[UPDATE_ID]
using the CM data andu
TBD: should we use a different delimiter between[TXID]_[IDX]
and[UPDATE_ID]
?result
result
.In this design, we rely upon the invariant of the monotonically increasing
update_id
to simply read CMUs until we fail to find a next CMU to signal that we've reached the final update in the sequence (note: VERY IMPORTANT to only terminate the updating loop on a missing CMU, not just on any failure to read). This removes the need to list CMUs.Tolerance of stale CMUs and cleaning them up
MUP's cleanup of stale CMUs doesn't list all CMUs, and doesn't guarantee deletes succeed (e.g., the program could abort mid-cleanup). Therefore, it's possible to have stale CMUs on disk, where "stale" means having an
update_id
that is lower than the CM stored for that channel.Stale CMUs will not be a problem for MUP itself. During init, these CMUs are never read. During normal cleanups, these CMUs are ignored (that is, they do not become additional work).
However, these can be a source of extra disk usage and a burden to carry around. So, MUP offers a
clean_stale_updates
function, which must be invoked specifically by the node, perhaps via an RPC or acron
-style task. This function:update_id
of the CMU is lower than theupdate_id
of the CM:Design alternatives
The primary alternative considered was to use
KVStore
listing of CMUs at both the init time and during cleanup. This allows external storage metadata to be the "source of truth" about what is stored. But this comes at a high cost (lots of listing I/O) for little benefit, since theupdate_id
is very predictable. About the only benefit this provided is forcing examination of all stored updates at init and cleanup times, which did two things:clean_stale_updates
) to make cleaning them up out-of-band easy.KVStore
and its backends get error semantics for failed reads correct. Since we listed updates, we could treat a subsequentNotFound
error as a stop-the-world error, and never as a signal of the end of a sequence. We could also detect and error on a gap in updates. However, such a gap should cause the CM to conflict with theChannelManager
, and therefore prompt LDK to panic and abort, summoning the user for any possible recovery. TBD: Is this true, or is this corner even sharper than that?In all, the tradeoffs seemed to favor this design.
Tradeoffs and risk for the user
Using MUP will present all the same risks as the incumbent, simple, CM-only persister, with some additional ones.
Of note, one of the worst things one could do with MUP is restore old CMs from backup. This raises the prospect of punishment transactions, since stored CMUs would probably not be read and applied (though, reconciliation with the
ChannelManager
is required before LDK will let you use this data with peers). However, doing such a restore with the CM-only persister would be equally disastrous, so no new risk is introduced.A novel risk with MUP is a failure to write CMUs without proper error handling (including immediate panic-and-abort). One may be able to persist CMs and not CMUs if they use different datastores for them (e.g., a DB for CMUs and object storage for CMs). It is even more critical that node operators using MUP carefully consider how to detect and handle storage failures.
Also, as mentioned, users must get the
NotFound
error semantics correct in theirKVStore
implementation.Expected interaction with different storage backends
MUP should be compatible with most backends that one would choose to use with
KVStore
. The greatest area of concern is the storage of CMUs, which could get to be many (the number of channels, timesmaximum_pending_updates
, leaked stale updates notwithstanding).Note that
KVStore
has a concept of "namespaces," which enumerate the various things LDK cares to store (CMs, CMUs, a channel manager, a network graph, etc.). All CMUs stored by MUP will be in a single namespace specific to CMUs.Therefore, it's possible this namespace could need to accommodate millions or even billions of entries, depending on the scale of the node and how
maximum_pending_updates
is set.KVStore
backends may want to consider special handling for the CMU namespace:[TXID]_[IDX]
) is probably wise.[TXID]_[IDX]
, which would collide.Backwards compatibility
While one can upgrade from CM-only persistence to MUP with no changes, downgrading is more complex.
Because MUP's CMs could have pending updates in storage, it's important that we don't make it too easy to read CMs from MUP into the CM-only persistence model, which assumes that stored CMs are always up-to-date. To disarm this "footgun," we prepend some sentinel bytes to MUP-written CMs that break deserialization for any implementation that doesn't know about the sentinel bytes.
This way, one can downgrade through some deliberate steps, in no particular order:
Addenda
*: The author is speculating. I don't run an LDK wallet, I don't know how many updates they really get per channel per day, but I am pretty sure it's a lot less than the 100k-per-hour we're forecasting for aggressively used routing channels.
Request for comment
Please comment on the "TBD" (to be determined) items.
Otherwise, looking for critique and/or consensus.
Beta Was this translation helpful? Give feedback.
All reactions