Monitor Updating Persister (MUP) Design #2545

domZippilli · 2023-09-01T19:41:50Z

domZippilli
Sep 1, 2023

LDK Monitor Updating Persister (MUP) Design

Motivation

Currently (as of 0.0.116), the "batteries included" persistence traits and implementation in LDK are, at least in practice, limited to reading at init, and writing whole objects when anything important in memory changes. This model is simple and reliable, but inefficient: typically a very small part of the object's state actually changes, and so this model mostly writes data that are unchanged.

Simplicity at the cost of efficiency is a good trade for most nodes, like wallets: with a handful of channels and typical updates on the order of perhaps 100s per day*, the inefficiency is not noticeable on modern (even mobile) hardware. But, it is bad for routing nodes, which tend to have many, high-traffic channels. These nodes must write objects around 5-10MB (typically) many times per second, and still with most bytes unchanged on each write. The inefficiency in this scenario is much more noticeable, driving up costs (like synchronous replication), and even provoking failures in storage that hurt reliability.

Anecdotally (as data is still scarce), an LDK routing node with relatively little traffic must write 400-600% of their overall channel state corpus per minute. For some channels that are very active, the percentages can be an order of magnitude higher. For such routing nodes, a different balance is probably desireable, trading improved efficiency for more complexity.

Design overview

We propose a MonitorUpdatingPersister (MUP). This is primarily an implementation of Persist, which is the interface LDK uses to drive channel state updates to storage. MUP uses a key-value storage model, requiring a KVStore implementation for storage.

MUP is a work-in-progress at #2359, which at the time of this writing is of a different design, but provided useful R&D input to this one.

`Persist` background

Persist drives channel state storage by prompting the node to write either:

a ChannelMonitor (CM), which contains the entire state of a channel. TLV-serialized CMs are typically on the order of 10s of MB.
a ChannelMonitorUpdate (CMU), which contains only a differential change to channel state. A serialized CMU is a tiny fraction of the size of a CM, as small as 1KB.

Both CMs and CMUs contain an update_id. For CMs, this reflects the latest update applied to it; for CMUs, this reflects the update the CMU contains. An in-memory CM struct can apply an update to itself, increasing its update_id.

An invariant of this design is that update_id is a monotonically increasing u64, except for the CLOSED_CHANNEL_UPDATE_ID (equivalent to u64::MAX), which may be in multiple updates to the channel.

How MUP implements `Persist` (writes)

The MUP keeps the following internal state:

A hashmap called last_monitor_update_id where:
- The keys are channel IDs (type TBD)
- The values are the u64 update_id at which a ChannelMonitor was last persisted for the channel.
A user-defined u64 maximum_pending_updates, which describes the target maximum of stored CMUs that haven't been applied to the last stored CM.

The MUP implements the trait as follows:

persist_new_channel: Write whole monitors to the KVStore. MUP uses a namespace like monitors, and a [TXID]_[IDX] key format. When the write is complete, last_monitor_update_id is updated accordingly.
update_persisted_channel: First, examine the value stored for the subject channel in last_monitor_update_id.
- If:
  - the CMU update_id != CLOSED_CHANNEL_UPDATE_ID
  - and the difference of the CMU's update_id and last_monitor_update_id is less than or equal to maximum_pending_updates
  - then:
    - write just the CMU. MUP stores CMUs in a single namespace like monitor_updates, using a key that extends the CM key to [TXID]_[IDX]_[UPDATE_ID].
- Else (the channel is closed, or has too many pending updates):
  - Store the current recorded last_monitor_update_id in-scope as old_update_id.
  - Persist the entire CM, by forwarding to persist_new_channel (also updates last_monitor_update_id).
  - Delete stale updates. This is done by computing the keys of stale updates, specifically the range between old_update_id..=CM.update_id. This implies that if the update_id didn't actually change, such as when we get here via multiple updates at CLOSED_CHANNEL_UPDATE_ID, the cleanup is a no-op. TBD: Is this worth doing as an async, fire-and-forget task, to avoid blocking on this I/O?

In this way, failed deletes notwithstanding, the boundary of required storage for a given channel is maximum_pending_updates + 1 (the CM) items. Additionally, update keys are derived without listing them.

How MUP reconstructs state (reads)

The MUP primarily needs to read data when starting an LDK program (we'll call this time "init"). It must reconstruct the state of all channels in-memory from what was persisted in storage. Users should therefore call MUP's read_channel_monitors_with_updates at init, as a substitute for read_channel_monitors as included in LDK 0.0.116 and other recent versions (note there is not an LDK trait that describes this functionality; it is, at most, a convention).

The read_channel_monitors_with_updates function performs the following actions:

Create an empty vector of CMs, result.
List all CMs stored. For each:
- Deserialize the CM. If its update_id == CLOSED_CHANNEL_UPDATE_ID, add the CM to result and continue, as we never write CMUs once this sequence height is reached.
- Construct keys for pending CMUs using the following algorithm:
  - let u equal the update_id of the CM
  - loop
    - Increment u by one (u+=1)
    - Compute the key of the next possible CMU, [TXID]_[IDX]_[UPDATE_ID] using the CM data and u TBD: should we use a different delimiter between [TXID]_[IDX] and [UPDATE_ID]?
    - Try to read the CMU
      - If it succeeds, apply the CMU to the CM and continue
      - If the CMU is not found, break
- Add the updated CM to result
Return result.

In this design, we rely upon the invariant of the monotonically increasing update_id to simply read CMUs until we fail to find a next CMU to signal that we've reached the final update in the sequence (note: VERY IMPORTANT to only terminate the updating loop on a missing CMU, not just on any failure to read). This removes the need to list CMUs.

Tolerance of stale CMUs and cleaning them up

MUP's cleanup of stale CMUs doesn't list all CMUs, and doesn't guarantee deletes succeed (e.g., the program could abort mid-cleanup). Therefore, it's possible to have stale CMUs on disk, where "stale" means having an update_id that is lower than the CM stored for that channel.

Stale CMUs will not be a problem for MUP itself. During init, these CMUs are never read. During normal cleanups, these CMUs are ignored (that is, they do not become additional work).

However, these can be a source of extra disk usage and a burden to carry around. So, MUP offers a clean_stale_updates function, which must be invoked specifically by the node, perhaps via an RPC or a cron-style task. This function:

Iterates over the CMs in memory. For each:
- List the CMUs for that channel (TBD: is it a better tradeoff to read all CMU entries into memory once?). For each:
  - If the update_id of the CMU is lower than the update_id of the CM:
    - Delete the CMU.

Design alternatives

The primary alternative considered was to use KVStore listing of CMUs at both the init time and during cleanup. This allows external storage metadata to be the "source of truth" about what is stored. But this comes at a high cost (lots of listing I/O) for little benefit, since the update_id is very predictable. About the only benefit this provided is forcing examination of all stored updates at init and cleanup times, which did two things:

Reduced the possibility of stale updates lingering for very long. However, stale updates should be exceptional, and we can offer users tools (like clean_stale_updates) to make cleaning them up out-of-band easy.
Reduced the importance that KVStore and its backends get error semantics for failed reads correct. Since we listed updates, we could treat a subsequent NotFound error as a stop-the-world error, and never as a signal of the end of a sequence. We could also detect and error on a gap in updates. However, such a gap should cause the CM to conflict with the ChannelManager, and therefore prompt LDK to panic and abort, summoning the user for any possible recovery. TBD: Is this true, or is this corner even sharper than that?

In all, the tradeoffs seemed to favor this design.

Tradeoffs and risk for the user

Using MUP will present all the same risks as the incumbent, simple, CM-only persister, with some additional ones.

Of note, one of the worst things one could do with MUP is restore old CMs from backup. This raises the prospect of punishment transactions, since stored CMUs would probably not be read and applied (though, reconciliation with the ChannelManager is required before LDK will let you use this data with peers). However, doing such a restore with the CM-only persister would be equally disastrous, so no new risk is introduced.

A novel risk with MUP is a failure to write CMUs without proper error handling (including immediate panic-and-abort). One may be able to persist CMs and not CMUs if they use different datastores for them (e.g., a DB for CMUs and object storage for CMs). It is even more critical that node operators using MUP carefully consider how to detect and handle storage failures.

Also, as mentioned, users must get the NotFound error semantics correct in their KVStore implementation.

Expected interaction with different storage backends

MUP should be compatible with most backends that one would choose to use with KVStore. The greatest area of concern is the storage of CMUs, which could get to be many (the number of channels, times maximum_pending_updates, leaked stale updates notwithstanding).

Note that KVStore has a concept of "namespaces," which enumerate the various things LDK cares to store (CMs, CMUs, a channel manager, a network graph, etc.). All CMUs stored by MUP will be in a single namespace specific to CMUs.

Therefore, it's possible this namespace could need to accommodate millions or even billions of entries, depending on the scale of the node and how maximum_pending_updates is set. KVStore backends may want to consider special handling for the CMU namespace:

Databases: Very dependent on schema, but partitioning by the CM portion of the key ([TXID]_[IDX]) is probably wise.
Filesystems: Similar to databases, creating different directories along the CM portion of the key is probably wise. It'll probably be convenient to do this in a directory separate from CM storage itself (e.g., a directory per namespace), as the most natural name for a CM file and a CMU directory would be [TXID]_[IDX], which would collide.

Backwards compatibility

While one can upgrade from CM-only persistence to MUP with no changes, downgrading is more complex.

Because MUP's CMs could have pending updates in storage, it's important that we don't make it too easy to read CMs from MUP into the CM-only persistence model, which assumes that stored CMs are always up-to-date. To disarm this "footgun," we prepend some sentinel bytes to MUP-written CMs that break deserialization for any implementation that doesn't know about the sentinel bytes.

This way, one can downgrade through some deliberate steps, in no particular order:

making very sure all updates are applied
removing the sentinel bytes

Addenda

*: The author is speculating. I don't run an LDK wallet, I don't know how many updates they really get per channel per day, but I am pretty sure it's a lot less than the 100k-per-hour we're forecasting for aggressively used routing channels.

Request for comment

Please comment on the "TBD" (to be determined) items.

Otherwise, looking for critique and/or consensus.

Concerns about how this might not serve some users well? Should there be warning labels?
Any sharp corners?
Any glaring inefficiencies?

ZmnSCPxj-jr · 2023-09-02T00:06:09Z

ZmnSCPxj-jr
Sep 2, 2023

and the sum of last_monitor_update_id and the CMU's update_id is less than or equal to maximum_pending_updates

Should that be the SUM or the DIFFERENCE?

1 reply

domZippilli Sep 2, 2023
Author

Yeah, I revised it from a way where I was using sum, and forgot to go back s/sum/difference, thx.

ZmnSCPxj-jr · 2023-09-02T00:09:59Z

ZmnSCPxj-jr
Sep 2, 2023

TBD: should we use a different delimiter between [TXID]_[IDX] and [UPDATE_ID]?

TXID is a hexdump, IDX is a number, and update_id is a number, so why?

15 replies

tnull Sep 5, 2023
Collaborator

Is there a reason we didn't go for single namespace and just fixing key_format for CMU namespace ? (not the enum thing)

Not sure I understand what you mean with "just fixing key_format for CMU namespace"?

TheBlueMatt Sep 5, 2023
Maintainer

I ask this because double-namespace is confusing for a KVStore, there is not a single KVStore to best of my knowledge which uses a 3-key structure.

Given we have a restrictive namespace charset, it shouldn't be hard to adapt as long as key length limit is long enough. We could maybe document that more clearly - suggesting / or @ or $ or whatever for separators for users who want a single key.

tnull Sep 5, 2023
Collaborator

Given we have a restrictive namespace charset, it shouldn't be hard to adapt as long as key length limit is long enough. We could maybe document that more clearly - suggesting / or @ or $ or whatever for separators for users who want a single key.

I'm confused, the charset is pretty clearly documented? The alphabet is even a pub const that is linked in the relevant docs? And backwards compat. notes for single-key backends are included in the KVStore docs and the release notes?

TheBlueMatt Sep 5, 2023
Maintainer

Right, I don't think its a bit deal, I was just suggesting we could write something like "If you are implementing for a K-V store with only single-layer keys, you should use a null byte or an @ for a separator" or whatever, committing to never use that as a part of the valid alphabet going forward.

G8XSU Sep 6, 2023
Collaborator

Revisiting, I think it should be fine to have 2-level namespace, where second level is more of a partitioning.

Given we have a restrictive namespace charset, it shouldn't be hard to adapt as long as key length limit is long enough.

I think that's fair. There are some additional things we might want to specify, will discuss it on PR.

ZmnSCPxj-jr · 2023-09-02T00:12:57Z

ZmnSCPxj-jr
Sep 2, 2023

TBD: Is this worth doing as an async, fire-and-forget task, to avoid blocking on this I/O?

Yes. As this is the open-source LDK, the backing store used may be different, and its characterstics like speed-of-list-directory may vary greatly, so it is better to trigger this async.

I suggest we also add a cleanup routine at init, also async, which does a best-effort to clean up stale updates. The node may crash at any time.

3 replies

domZippilli Sep 2, 2023
Author

it is better to trigger this async

The main complication with that is, the Persist trait specifies the "coloring" of the function, so I'm not sure exactly how it'll work. Maybe we just have you pass in a runtime reference when constructing the MUP, and we'd be able to spawn without having to color the function async.

I suggest we also add a cleanup routine at init, also async, which does a best-effort to clean up stale updates.

Well, we tolerate (in this design, never even look at) stale updates, in that we always read CMUs relative to the CM we've stored. So cleaning them up is more about keeping storage in check than an integrity issue, if that makes sense. That's why I'm proposing that we include a clean_stale_updates function to do the cleanup, but the user decides when and how to call it. Right after init would be a good time to call it, and doing it as an async task is probably right for most people.

ZmnSCPxj-jr Sep 2, 2023

Also the cleanup should probably run "backwards", ie delete files from highest update ID to lowest, if you are not going to use a "list directory" call.

ZmnSCPxj-jr Sep 2, 2023

The main complication with that is, the Persist trait specifies the "coloring" of the function, so I'm not sure exactly how it'll work. Maybe we just have you pass in a runtime reference when constructing the MUP, and we'd be able to spawn without having to color the function async.

Hmm. Well, if we want to avoid tying to a specific runtime as well, we probably need to pass in a RunInBackground trait like:

trait RunInBackground {
  fn run_in_background<F>(fun: F)
  where F: FnOnce() -> ();
}

or something more rust-y. Then the user has to put these functions in some kind of queue which the runtime idle function will then call once, or something similar, or in Tokio there would be some kind of signaling queue that a background process pops from and calls.

ZmnSCPxj-jr · 2023-09-02T00:27:46Z

ZmnSCPxj-jr
Sep 2, 2023

One may be able to persist CMs and not CMUs if they use different datastores for them (e.g., a DB for CMUs and object storage for CMs). It is even more critical that node operators using MUP carefully consider how to detect and handle storage failures.

At the Lightning BOLT level, it is always safe to lose the latest state --- which LDK should translate to the latest CMU. Lightning BOLT has a "hand-over-hand": when moving from one state to the next, there is a short period when both states are valid (== not punishable even if the older state is published onchain). The exact order of events is:

Channel parties agree on new state.
We save new state (== write the new CMU)
AFTER we are sure the new state is saved, tell counterparty the secret that lets them punish the old state.

Thus, as long as "save new state" completes successfully, and on a restart, is indeed in the persistence layer, it does not matter.

If the "save new state" never completes (e.g. due to node crashing at that point) then it is still safe at the Lightning BOLT layer. Users of MUP need not worry about this separately.

The exact requirement is just: Either the write completes as a whole (whether CM or CMU), or none of it is written. That is, there should be no partial writes. This kind of atomicity is already implemented in the lightning-persistence package I think, and can be reused, though we still do need a separate process to clean up temporary files at init.

3 replies

domZippilli Sep 2, 2023
Author

This kind of atomicity is already implemented in the lightning-persistence package I think, and can be reused

Right, I think you're talking about the FilesystemPersister that does atomic writing for *nixes. It's being revised in #2472, I believe. I think since we are using the KVStore API that PR introduces, my note here is really more for the people writing backends for that, with whatever storage they're using. You really don't want to mess that up.

If the "save new state" never completes (e.g. due to node crashing at that point) then it is still safe at the Lightning BOLT layer.

I think I'm probably overindexing on a personal experience, where updating the ChannelManager was silently failing for us, and I'm imagining that happening for CMUs. It's perhaps unhelpful speculation in the RFC, since you could just as easily silently fail to write CMs (now, or with MUP). It's just that by introducing CMUs as a separate datum, you could choose to put them in a different storage (I know at least one LDK node does this) and have one (silently) fail and the other not.

domZippilli Sep 2, 2023
Author

This is good info though. The "hand over hand" part helps make it clear where you can safely fail.

ZmnSCPxj-jr Sep 2, 2023

and have one (silently) fail and the other not.

It is the silent failure which would be problematic. It would be safe to abort, and safe to return an error in the write, but silent failure of persistence layer is a "we are not liable for loss of funds" things.

G8XSU · 2023-09-02T02:41:26Z

G8XSU
Sep 2, 2023
Collaborator

must write objects ranging from 10-100MB (typically)

Are we talking about CM's here? from some estimates it should be around few KB's to 5-10ish MB max

CMUs are typically on the order of 10s of MB.

Do you mean CM's? CMU's should be few KBs.

Some additional edge cases to note, apart from CLOSED_CHANNEL_UPDATE_ID:

When a channel is updated due to bitcoin-block-update, there is no corresponding CMU, you are forced to persist CM currently.
There is an additional case for first updates's update_id if i remember correctly, can't find code ref for it yet but something to keep note of/verify.

we prepend some sentinel bytes to MUP-written CMs

I am not a big fan of this, this adds additional complexity and makes migrating very hard. Since this doesn't ensure that all updates will/are applied, I see limited benefit to this. Also, we will have to provide a fn in channel_monitor to unset this flag if we decide to do this.

One more additional risk introduced:

Able to read CM's but not able to read CMU's (different datastore)(or erroneously returning NotFound since it is different data-store)

Async delete/cleanup cost.

Sometime back, I had suggested that we add a flag to delete interface, "can_be_eventual_delete". This has multiple advantages:

Cost - Eventually consistent writes sometimes can be significantly cheaper (4x ?)
Can be done async on backend, could be more reliable as well since we can publish to queue and process/retry later at low traffic-time
Can be batched and done later.
but this doesn't avoid i/o for each delete
Since this could be added anytime in future, i didn't consider it important enough.

Separately, need to evaluate If it is possible to publish an event at this layer i.e. something like CmuCleanup in range i..j

4 replies

TheBlueMatt Sep 3, 2023
Maintainer

I am not a big fan of this, this adds additional complexity and makes migrating very hard. Since this doesn't ensure that all updates will/are applied, I see limited benefit to this.

It does add some complexity, but I'm otherwise unsure how we make migration safe? Users may see that they have an updating persister and decide to turn that off by dropping it, which results in us simply throwing away data.

Also, we will have to provide a fn in channel_monitor to unset this flag if we decide to do this.

I don't think the proposal is to do this at the monitor level, rather do it transparently at the storage level.

domZippilli Sep 4, 2023
Author

I don't think the proposal is to do this at the monitor level, rather do it transparently at the storage level.

Yeah more or less. I just prepended bytes, and strip them off, which is very easy to do efficiently now that KVStore::read returns a Vec. Another way to do it is some kind of encapsulating struct, although I'm not sure if that would result in additional allocation or how complex it would get. Would look about the same on disk.

G8XSU Sep 6, 2023
Collaborator

That means it will not go beyond MUP and just be handled inside read_monitors function?

domZippilli Sep 6, 2023
Author

Correct, it gets written by and removed by MUP, at least the way it sits in that PR.

TheBlueMatt · 2023-09-03T17:26:41Z

TheBlueMatt
Sep 3, 2023
Maintainer

TBD: Is this worth doing as an async, fire-and-forget task, to avoid blocking on this I/O?

Probably, have to do something, yea. Monitor update persistence is the critical path for all our usual lightning actions, so super high tail latency would suck.

The issue with async as noted elsewhere is we end up needing a "spawn function" bound, which is kinda awkward (though we do have it elsewhere in LDK already).

We currently do deletes one at a time, which is gonna result in a ton of excess fsync traffic. I wonder if we just add a batch delete call and don't worry about async. We could also consider a flag to remove atomicity from deletes, which some backends could use to spawn deletes async or skip fsync. The LDK-provided backends could skip the async part and just remove fsync in the filesystem implementation (which is basically async anyway with the kernel flushing eventually) but those with higher throughput nodes could do their own backend that spawns.

5 replies

TheBlueMatt Sep 3, 2023
Maintainer

Oh, I see @G8XSU suggested the same flag in another thread.

tnull Sep 4, 2023
Collaborator

So, I pushed back on having an eventual_delete flag in the past because a) there wasn't any use case for it and b) it increases interface complexity and messes with the atomicity guarantees that the interface currently tries to provide.

Now we'd have a first use case at hand, however, I think it would go hand-in-hand with getting rid of the remaining list requirements where possible, e.g., in clean_stale_updates, as otherwise list would include items that had previously been marked for deletion (see https://github.com/orgs/lightningdevkit/discussions/2545#discussioncomment-6902212).

We currently do deletes one at a time, which is gonna result in a ton of excess fsync traffic. I wonder if we just add a batch delete call and don't worry about async. We could also consider a flag to remove atomicity from deletes, which some backends could use to spawn deletes async or skip fsync. The LDK-provided backends could skip the async part and just remove fsync in the filesystem implementation (which is basically async anyway with the kernel flushing eventually) but those with higher throughput nodes could do their own backend that spawns.

Right, the FilesystemStore could just refrain from fsyncing as much, but every other implementation would either need to ignore the flag or remember deleted files and then run regular cleanup tasks deleting multiple keys at once.

To quote @G8XSU:

Can be done async on backend, could be more reliable as well since we can publish to queue and process/retry later at low traffic-time

While multi-delete will overall likely be faster than individual calls for other implementations, having a task or thread run eventually to clean up all files pending removal sounds like it could potentially backfire as it may introduce artificial IO load spikes that may or may not happen at an opportune time. I think we should at least be cautious with a 'naïve' batching design in this regard.

So, I'm not entirely convinced this is necessary in the first place now that we don't do the delete/cleanup of CMUs on every single write. However, if we think it is, I definitely agree that an 'eventual delete' flag is much preferable to somehow introducing task semantics to the storage interface, or leaving it to the best guess of the back-end implementation.

domZippilli Sep 4, 2023
Author

Strong agree on the eventual delete flag as a way to make this asynchronous. A backend may implement that as enqueuing for a batch, and in fact some storage systems might have batch semantics already so all they need to do is serialize the queue into a request.

TheBlueMatt Sep 4, 2023
Maintainer

If we don't want to do deletes with a million syncs, we could also just swap delete for batch-delete - many backends have some kind of batch-deletion we could implement, and a single batch delete is probably about as fast as a single write or whatever, so it wouldn't make much difference.

G8XSU Sep 6, 2023
Collaborator

While multi-delete will overall likely be faster than individual calls for other implementations, having a task or thread run eventually to clean up all files pending removal sounds like it could potentially backfire as it may introduce artificial IO load spikes.

I will let the backend design and handle it, it is simple problem to solve. I meant more like dom describes as "A backend may implement that as enqueuing for a batch".
Note: This is not on client side but on server side, there is not thread on client side for batching/queuing, if it wasn't clear.

TheBlueMatt · 2023-09-03T17:52:53Z

TheBlueMatt
Sep 3, 2023
Maintainer

Reduced the importance that KVStore and its backends get error semantics for failed reads correct. Since we listed updates, we could treat a subsequent NotFound error as a stop-the-world error, and never as a signal of the end of a sequence. We could also detect and error on a gap in updates.

Because it's so critical to get this right, and bugs can lead to user funds loss, this kinda worries me admittedly. You and @G8XSU both notes on discord that many storage backends return "not found" for "permission denied", which makes this potentially a brittle assumption.

However, such a gap should cause the CM to conflict with the ChannelManager, and therefore prompt LDK to panic and abort, summoning the user for any possible recovery. TBD: Is this true, or is this corner even sharper than that?

This is kinda true, but in general the ChannelManager's persistence isn't guaranteed to complete in a timely manner, so we could have persisted a ChannelMonitorUpdate, taken irreversible action, and then crash, at which point ensuring that CMU is applied on startup becomes critical.

3 replies

domZippilli Sep 3, 2023
Author

We could list, although the trick with that is, I think if the API is consistent about giving a 404 when it's really 403, the listing would come up empty too.

I'm not sure this risk is avoidable, at least not totally. Perhaps we need a lot of guidance and warnings? It might be a very good idea to use the same storage for CMs and CMUs, for example. This would drastically reduce the risk that you'd misconfigure permissions or have an outage in one and not the other, and munge the error into a 404.

TheBlueMatt Sep 3, 2023
Maintainer

Mmm, yea, that's a fair point. At least fetching metadata and the file contents themselves is separate and potentially may fail separately? But, I see your point, it'll just have to be carefully documented.

G8XSU Sep 6, 2023
Collaborator

Current KVStore API already assumes functional correctness for "NotFound" error.
Even if this makes it more critical, the risk of wrong api implementation/usage exists without this.

I did a brief survey for popular backend options, i think in most scenarios we can distinguish NotFound correctly. (There are some edge cases in complex multi-account setups, where the resource/table exists in another aws/gcp-account where it might cause trouble but nothing we can't handle with correct permissions setup and pre-deployment checks.)

domZippilli · 2023-09-03T23:19:40Z

domZippilli
Sep 3, 2023
Author

must write objects ranging from 10-100MB (typically)

Are we talking about CM's here? from some estimates it should be around few KB's to 5-10ish MB max

I pulled some data, and I'm overstating the typical case, but the max is a different story. We have one that's 60MB; another that's 25MB; lots that are 10-15MB. But most are 5-10MB, that's true. I spend more time looking at the bigger ones, and also we record writes, not actually the file sizes, so I see the larger writes more often (we seem to get bigger CMs with busier channels). I'll update.

CMUs are typically on the order of 10s of MB.

Do you mean CM's? CMU's should be few KBs.

Yes, I was writing on the CM line but wrote CMU because the acronyms are easy (for me) to mix up.

0 replies

tnull · 2023-09-04T08:20:25Z

tnull
Sep 4, 2023
Collaborator

The MUP keeps the following internal state:

A hashmap called last_monitor_update_id where:

The keys are channel IDs (type TBD)

The values are the u64 update_id at which a ChannelMonitor was last persisted for the channel.

Note that if we don't want to list all CMUs on initialization of the MonitorUpdatingPersister, this state would need to be persisted, too.

MUP's cleanup of stale CMUs doesn't list all CMUs, and doesn't guarantee deletes succeed (e.g., the program could abort mid-cleanup). Therefore, it's possible to have stale CMUs on disk, where "stale" means having an update_id that is lower than the CM stored for that channel.

Stale CMUs will not be a problem for MUP itself. During init, these CMUs are never read. During normal cleanups, these CMUs are ignored (that is, they do not become additional work).

However, these can be a source of extra disk usage and a burden to carry around. So, MUP offers a clean_stale_updates function, which must be invoked specifically by the node, perhaps via an RPC or a cron-style task. This function:

If we persisted above mentioned state, we could also get rid of depending on KVStore::list in clean_stale_updates, as we'd now which updates are still pending cleanup.

11 replies

domZippilli Sep 6, 2023
Author

Another way is, where it works without any internal state.

When consolidation is a result of reaching maximum_pending_updates If maximum_pending_updates was static we could do it without internal state easily, just issue deletes for (current_update_id-maximum_pending_updates)..current_update_id.

I agree you can consolidate CMUs into the CM without listing CMUs. I'm not clear on how you'd know when you've reached maximum_pending_updates, though. At the time you get a CMU, I don't think you have enough information if you're not reading the highest CM stored (either from the CM itself, or a hashmap to avoid reading the CM on every CMU). You only know the CMU's update_id, no? So you'd have to issue deletes on every CMU?

G8XSU Sep 7, 2023
Collaborator

Idea was to detect consolidation using CMU.update_id % max_pending_update_threshold==0
Since update_id is monotonic increasing one at a time, this should work.

This might result in early-consolidation sometimes(due to full CM persist), but that should be ok, since we promise max_pending_update_threshold not the minimum.

After this and adopting approach from previous comment, we are left with no internal state in MUP and could be simpler.

For issuing deletes, at the time of consolidation:

Described it in previous comment mostly, (Either do a get of CM or issue extra deletes) , this cost is mainly 1-get per 1000ish updates/writes, and avoids another state for persisted_update_id.(which was not ground truth)
As we move towards less and less ldk-forced full CM persists, this becomes more efficient.

domZippilli Sep 8, 2023
Author

(disregard any emails from an earlier comment)

IDK. I want to give this a think.

The extra consolidation when we hit modulo 0 is probably not a big deal, we're still saving massive amounts of I/O.

What I'm less sure about is issuing deletes. I'm actually leaning toward reading the CM before writing its replacement, rather than issuing a lot of extra no-op deletes.

domZippilli Sep 11, 2023
Author

@G8XSU I ended up implementing this in #2359.

I chose to read the stored monitor. This seems right, to me, even though it introduces deserialization overhead and additional allocation. Issuing a lot of known-bad delete requests could be unnoticeable in some storage systems (e.g., DELETE FROM updates WHERE update_id IN range(x, y)), but could fill the logs with errors in others (e.g., ENOENT). Deleting a known-stale range is also more precise.

Reading the stored monitor did introduce a minor complication, in that the MUP now must have an internal reference to an EntropySource and SignerProvider, but I don't think that this will be a problem in practice. Might need to double check the type bounds, though, now that I think of it.

Thanks for the suggestion, I think the minor inefficiency of additional operations and RAM are a good trade for a more architectural goal of reducing the number of sources of truth. Especially since most MUP users should be running with more resources.

G8XSU Sep 11, 2023
Collaborator

Thanks, this makes sense !

I chose to read the stored monitor.

This is what I preferred as well, the other one was more of a silent mention (not really a serious suggestion, only to just put it out there)

I will review it shortly.

Monitor Updating Persister (MUP) Design #2545

LDK Monitor Updating Persister (MUP) Design

Motivation

Design overview

Persist background

How MUP implements Persist (writes)

How MUP reconstructs state (reads)

Tolerance of stale CMUs and cleaning them up

Design alternatives

Tradeoffs and risk for the user

Expected interaction with different storage backends

Backwards compatibility

Addenda

Request for comment

Replies: 9 comments · 45 replies

domZippilli Sep 2, 2023 Author

tnull Sep 5, 2023 Collaborator

TheBlueMatt Sep 5, 2023 Maintainer

tnull Sep 5, 2023 Collaborator

TheBlueMatt Sep 5, 2023 Maintainer

G8XSU Sep 6, 2023 Collaborator

domZippilli Sep 2, 2023 Author

domZippilli Sep 2, 2023 Author

domZippilli Sep 2, 2023 Author

G8XSU Sep 2, 2023 Collaborator

TheBlueMatt Sep 3, 2023 Maintainer

domZippilli Sep 4, 2023 Author

G8XSU Sep 6, 2023 Collaborator

domZippilli Sep 6, 2023 Author

TheBlueMatt Sep 3, 2023 Maintainer

TheBlueMatt Sep 3, 2023 Maintainer

tnull Sep 4, 2023 Collaborator

domZippilli Sep 4, 2023 Author

TheBlueMatt Sep 4, 2023 Maintainer

G8XSU Sep 6, 2023 Collaborator

TheBlueMatt Sep 3, 2023 Maintainer

domZippilli Sep 3, 2023 Author

TheBlueMatt Sep 3, 2023 Maintainer

G8XSU Sep 6, 2023 Collaborator

domZippilli Sep 3, 2023 Author

tnull Sep 4, 2023 Collaborator

domZippilli Sep 6, 2023 Author

G8XSU Sep 7, 2023 Collaborator

domZippilli Sep 8, 2023 Author

domZippilli Sep 11, 2023 Author

G8XSU Sep 11, 2023 Collaborator

`Persist` background

How MUP implements `Persist` (writes)

Replies: 9 comments 45 replies

domZippilli Sep 2, 2023
Author

tnull Sep 5, 2023
Collaborator

TheBlueMatt Sep 5, 2023
Maintainer

tnull Sep 5, 2023
Collaborator

TheBlueMatt Sep 5, 2023
Maintainer

G8XSU Sep 6, 2023
Collaborator

domZippilli Sep 2, 2023
Author

domZippilli Sep 2, 2023
Author

domZippilli Sep 2, 2023
Author

G8XSU
Sep 2, 2023
Collaborator

TheBlueMatt Sep 3, 2023
Maintainer

domZippilli Sep 4, 2023
Author

G8XSU Sep 6, 2023
Collaborator

domZippilli Sep 6, 2023
Author

TheBlueMatt
Sep 3, 2023
Maintainer

TheBlueMatt Sep 3, 2023
Maintainer

tnull Sep 4, 2023
Collaborator

domZippilli Sep 4, 2023
Author

TheBlueMatt Sep 4, 2023
Maintainer

G8XSU Sep 6, 2023
Collaborator

TheBlueMatt
Sep 3, 2023
Maintainer

domZippilli Sep 3, 2023
Author

TheBlueMatt Sep 3, 2023
Maintainer

G8XSU Sep 6, 2023
Collaborator

domZippilli
Sep 3, 2023
Author

tnull
Sep 4, 2023
Collaborator

domZippilli Sep 6, 2023
Author

G8XSU Sep 7, 2023
Collaborator

domZippilli Sep 8, 2023
Author

domZippilli Sep 11, 2023
Author

G8XSU Sep 11, 2023
Collaborator