feat(swing-store): budget-limited deletion of snapshot and transcripts

Both `snapStore.deleteVatSnapshots()` and `transcriptStore.deleteVatTranscripts()` now take a numeric `budget=` argument, which will limit the number of snapshots or transcript spans deleted in each call. Both return a `{ done, cleanups }` record so the caller knows when to stop calling. This enables the slow deletion of large vats (lots of transcript spans or snapshots), a small number of items at a time. Recommended budget is 5, which (given SwingSet's `snapInterval=200` default) will cause the deletion of 1000 rows from the `transcriptItems` table each call, which shouldn't take more than 100ms. Without this, the kernel's attempt to slowly delete a terminated vat would succeed in slowly draining the kvStore, but would trigger a gigantic SQL transaction at the end, as it deleted every transcript item in the vat's history. The worst-case example I found would be the mainnet chain's v43-walletFactory, which (as of apr-2024) has 8.2M transcript items in 40k spans. A fast machine takes two seconds just to count all the items, and deletion took 22 *minutes*, with a `swingstore.wal` file that peaked at 27 GiB. This would cause an enormous chain stall at some surprising point in time weeks or months after the vat was first terminated. In addition, both the transcript spans and the snapshot records are shadowed into IAVL (via `export-data`) for integrity, and deleting 40k+40k=80k IAVL records in a single block might cause some significant churn too. The kernel should call `transcriptStore.stopUsingTranscript()` and `snapStore.stopUsingLastSnapshot()` as soon as the vat is terminated, to make exports smaller right away (by omitting all transcript/snapshot artifacts for the given vat, even before those DB rows or their export-data records have been deleted). New swing-store documentation was added. refs #8928 Co-authored-by: Richard Gibson <[email protected]>
Agoric · Aug 12, 2024 · f6787e8 · f6787e8
1 parent 6547c83
commit f6787e8
Show file tree

Hide file tree

Showing 10 changed files with 1,006 additions and 33 deletions.
diff --git a/packages/SwingSet/docs/configuration.md b/packages/SwingSet/docs/configuration.md
@@ -150,7 +150,7 @@ The `snapshotInitial` property is a special snapshot interval that applies only
 to the vat's very first snapshot.  We treat it as a special case because a few
 of the very first cranks of a vat (which involve initialization) can be quite
 expensive, and we'd like to be able to promptly capture the benefit having paid
-that expense so that future replays don't need to repeat the work.  Defaults to 2.
+that expense so that future replays don't need to repeat the work.  Defaults to 3.
 
 The code that realizes a vat or device can be specified in one of five
 ways:

diff --git a/packages/swing-store/docs/bundlestore.md b/packages/swing-store/docs/bundlestore.md
@@ -0,0 +1,30 @@
+# BundleStore
+
+The `kernelStorage.bundleStore` sub-store manages code bundles. These can be used to hold vat-worker supervisor code (e.g. the [`@endo/lockdown`](https://github.com/endojs/endo/tree/master/packages/lockdown) bundle, or the [`@agoric/swingset-xsnap-supervisor` package](../../swingset-xsnap-supervisor), which incorporates liveslots), or the initial vat code bundles (for both kernel-defined bundles like vat-comms or vat-timer, or for application-defined bundles like vat-zoe or the ZCF code). It can also hold bundles that will be loaded later by userspace vat code, such as contract bundles.
+
+Each bundle held by the bundleStore is identified by a secure BundleID, which contains a format version integer and a hash, with a format like `b0-123abc456def...` or `b1-789ghi012...`. This contains enough information to securely define the behavior of the code inside the bundle, and to identify the tools needed to load/evaluate it.
+
+The bundleStore provides a simple add/get/remove API to the kernel. The kernel adds its own bundles during initialization, and provides the host application with an API to load additional ones in later. The kernel code that creates new vats will read bundles from the bundleStore when necessary, as vats are created. Userspace can get access to "BundleCap" objects that represent bundles, to keep the large bundle blobs out of RAM as much as possible.
+
+## Data Model
+
+Bundles are actually JavaScript objects: records of at least `{ moduleFormat }`, plus some format-specific fields like `endoZipBase64` and `endoZipBase64Sha512`. They are created by the [`@endo/bundle-source`](https://github.com/endojs/endo/tree/master/packages/bundle-source) package. Many are consumed by [`@endo/import-bundle`](https://github.com/endojs/endo/tree/master/packages/import-bundle), but the `b0-` format bundles can be loaded with some simple string manipulation and a call to `eval()` (which is how supervisor bundles are injected into new vat workers, before `@endo/import-bundle` is available).
+
+The bundleStore database treats each bundle as BundleID and a blob of contents. The SQLite `bundles` table is just `(bundleID TEXT, bundle BLOB)`. The bundleStore knows about each `moduleFormat` and how to extract the meaningful data and compress it into a blob, and how to produce the Bundle object during retrieval.
+
+The bundleStore also knows about the BundleID computation rules. The `addBundle()` API will verify that the contents match the ID, however it currently relies upon the caller to verify e.g. that the bundle does not contain any unexpected properties. The `importSwingStore()` API performs more extensive validation, to prevent corruption during the export+import process.
+
+The kernel is expected to keep track of which bundles are needed and when (with reference counts), and to not delete a bundle unless it is really unneeded. Currently, this means all bundles are retained forever.
+
+Unlike the `snapStore`, there is no notion of pruning bundles: either the bundle is present (with all its data), or there is no record of the BundleID at all.
+
+## Export Model
+
+Each bundle gets a single export-data entry, whose name is `bundle.${bundleID}`, and whose value is just `${bundleID}`. Each bundle also gets a single export artifact, whose name is `bundle.${bundleID}`, and whose contents are the compressed BLOB from the database (from which a Bundle record can be reconstructed).
+
+## Slow Deletion
+
+Since bundles are not owned by vats, there is nothing to delete when a vat is terminated. So unlike `transcriptStore` and `snapStore`, there is no concept of "slow deletion", and no APIs to support it.
+
+When a bundle is deleted by `bundleStore.deleteBundle()`, its export-data item is deleted immediately, and subsequent exports will omit the corresponding artifact.
+
diff --git a/packages/swing-store/docs/kvstore.md b/packages/swing-store/docs/kvstore.md
@@ -0,0 +1,35 @@
+# KVStore
+
+The `kernelStorage.kvStore` sub-store manages a table of arbitrary key-value (string-to-string) pairs. It provides the usual get/set/has/delete APIs, plus a `getNextKey` call to support lexicographic iteration.
+
+There are three separate sections of the namespace. The normal one is the "consensus" section.  Each value written here will be given an export-data row, and incorporated into the "crankhash" (described below).
+
+The second is "local", and includes any key which is prefixed with `local.`. These keys are *not* given export-data rows, nor are they included in the crankhash.
+
+The third is "host", and includes any key which is prefixed with `host.`. This is not available to `kernelStorage.kvStore` at all: it is only accessed by methods on `hostStorage.kvStore` (the `kernelStorage` methods will throw an error if given a key like `host.foo`, and the `hostStorage` methods will throw *unless* given a key like `host.foo`). These are also excluded from export-data and the crankhash. Host keys are reserved for the host application, and are generally used to keep track of things like which block has been executed, to manage consistency between a separate host database (eg IAVL) and the swingstore. The host can record "I told the kernel to execute the contents of block 56" into `hostStorage.kvStore`, and then do `hostStorage.commit()`, and then it can record "I processed the rest of block 56" into is own DB, and then commit its own DB. If, upon startup, it observes a discrepancy between the `hostStorage.kvStore` record and its own DB, it knows it got interrupted between these two commit points, which can trigger recovery code.
+
+Any key which doesn't start with `local.` or `host.` is part of the "consensus" section.
+
+## CrankHash and ActivityHash
+
+Swingset kernels are frequently run in a consensus mode, where multiple instances of the kernel (on different machines) are expected to execute the same deliveries in lock-step. In this mode, every kernel is expected to do exactly the same computation, and any divergence indicates a failure (or attempt at malice). We want to detect such variations quickly, so the diverging/failing member can "fall out of consensus" promptly.
+
+The swingstore hashes all changes to the "consensus" portion of the kvStore into the "crank hash". This hash covers every change since the beginning of the current crank, and the kernel logs the result at the end of each crank, at which point the crankhash is reset.
+
+Each crank also updates a value called the "activity hash", by hashing the previous activityhash and the latest crankhash together. This records a chain of changes, and is logged at the end of each crank too.
+
+The host application can record the activityhash into its own consensus-tracking database (eg IAVL) at the end of each kernel run, to ensure that any internal divergence of swingset behavior is escalated to a proper consensus failure. Without this, one instance of the kernel might "think differently" than the others, but still "act" the same (in terms of IO or externally-visible messages) without triggering a failure, which would be a lurking problem.
+
+Logging both the crankhash and the activityhash improves our ability to diagnose consensus failures. By comparing logs between a "good" machine and a "bad" (diverging) one, we can quickly determine which crank caused the problem, and usually compare slogfile delivery/syscall records to narrow the divergence down to a specific syscall.
+
+kvStore changes are also recorded by the export-data, but these are too voluminous to be logged, and do not capture multiple changes to the same key. And not all host applications use exports, so there might not be anything watching export data.
+
+## Data Model
+
+The kvStore holds a simple string-to-string key/value store. The SQLite schema for the `kvStore` table is simply `(key TEXT, value TEXT)`.
+
+## Export Model
+
+To ensure that every key/value pair is correctly validatable, *all* in-consensus kvStore rows get their own export-data item. The name is just `kv.${key}`, and the value is just the value. `kvStore.delete(key)` will delete the export-data item. There are no artifacts.
+
+These make up the vast majority of the export-data items, both by count and by "churn" (the number of export-data items changed in a single crank). In the future, we would prefer to keep the kvStore in some sort of Merkle-tree data structure, and emit only a handful of export-data rows that contain hashes (perhaps just a single root hash). In this approach, the actual data would be exported in one or more artifacts. However, our SQLite backend does not provide the same kind of automatic Merkleization as IAVL, and only holds a single version of data at a time, making this impractical.
diff --git a/packages/swing-store/docs/snapstore.md b/packages/swing-store/docs/snapstore.md
@@ -0,0 +1,49 @@
+# SnapStore
+
+The `kernelStorage.snapStore` sub-store tracks vat heap snapshots. These blobs capture the state of an XS JavaScript engine, between deliveries, to speed up replay-based persistence. The kernel can start a vat worker from a recent heap snapshot, and then it only needs to replay a handful of transcript items (deliveries), instead of replaying every delivery since the beginning of the incarnation.
+
+The XS / [`xsnap`](../../xsnap) engine defines the heap snapshot format. It consists of a large table of "slots", which are linked together to form JavaScript objects, strings, Maps, functions, etc. The snapshot also includes "chunks" for large data fields (like strings and BigInts), a stack, and some other supporting tables. The snapStore doesn't care about any of the internal details: it just gets a big blob of bytes.
+
+## Data Model
+
+Each snapshot is compressed and stored in the SQLite row as a BLOB. The snapStore has a single table named `snapshots`, with a schema of `(vatID TEXT, snapPos INTEGER, inUse INTEGER, hash TEXT, uncompressedSize INTEGER, compressedSize INTEGER, compressedSnapshot BLOB)`.
+
+The kernel has a scheduler which decides when to take a heap snapshot for each vat. There is a tradeoff between the immediate cost of creating the snapshot, versus the expected future savings of having a shorter transcript to replay. More frequent snapshots save time later, at the cost of time spent now.
+
+The kernel currently uses a [very simple scheduler](../../SwingSet/src/kernel/vat-warehouse.js), which takes a snapshot every `snapshotInterval` deliveries (e.g. 200), plus an extra one a few deliveries (`snapshotInitial`) into the new incarnation, to avoid replaying expensive contract startup code. The [SwingSet configuration documentation](../../SwingSet/docs/configuration.md) has the details.
+
+However, the swingstore is unaware of the kernel's scheduling policy. Every once in a while, the kernel tells the snapStore about a new snapshot, and the snapStore updates its data.
+
+Like the [transcriptStore](./transcriptstore.md), the snapStore retains a hash of older records, even after it prunes the snapshot data itself. There is at most one `inUse = 1` record for each vatID, and it will always have the highest `snapPos` value. When a particular vatID's active snapshot is replaced, the SQLite table row is updated to clear the `inUse` flag (i.e. set it to NULL). By default, the `compressedSnapshot` field is also set to NULL, removing the large data blob, but there is an option (`keepSnapshots: true`) to retain the full contents of all snapshots, even the ones that are no longer in use.
+
+## Export Model
+
+Each snapshot, both current and historic, gets an export-data entry. The name is `snapshot.${vatID}.${position}`, where `position` is the latest delivery (eg highest delivery number) that was included in the heap state captured by the snapshot. The value is a JSON-serialized record of `{ vatID, snapPos, hash, inUse }`.
+
+If there is a "current" snapshot, there will be one additional export-data record, whose name is `snapshot.${vatID}.current`, and whose value is `snapshot.${vatID}.${position}`. This value is the same as the name of the latest export-data record, and is meant as a convenient pointer to find that latest snapshot.
+
+The export *artifacts* will generally only include the current snapshot for each vat. Only the `debug` mode will include historical snapshots (and only if the swingstore was retaining them in the first place).
+
+## Slow Deletion
+
+As soon as a vat is terminated, the kernel will call `snapStore.stopUsingLastSnapshot()`. The DB is updated to clear the `inUse` flag of the latest snapshot, leaving no rows with `inUse = 1`. This immediately makes the vat non-loadable by the kernel. The snapshot data itself is deleted (unless `keepSnapshots: true`).
+
+This also modifies the latest `snapshot.${vatID}.${snapPos}` export-data record, to change `inUse` to 0, and removes the `snapshot.${vatID}.current` record. The modification and deletion are added to the export-data callback queue, so the host-app can learn about them after the next commit. Any subsequent `getExportData()` calls will observe the changes.
+
+As a result, all non-`debug` swing-store exports after this point will omit any artifacts for the snapshot blob, but they will still include export-data records (hashes) for all snapshots. (Deleting all the export-data records is too much work to do in a single step, so it is spread out over time).
+
+Later, as the kernel performs cleanup work for this vatID, the cleanup call will delete DB rows (one per `budget`). Each row deleted will also remove one export-data record, which feeds the callback queue, as well as affecting the full `getExportData()` results.
+
+Eventually, the snapStore runs out of rows to delete, and `deleteVatSnapshots(budget)` returns `{ done: true }`, so the kernel can finally rest.
+
+### SnapStore Vat Lifetime
+
+The SnapStore doesn't provide an explicit API to call when a vat is first created. The kernel just calls `saveSnapshot()` for both the first and all subsequent snapshots. Each `saveSnapshot()` marks the previous snapshot as unused, so there is at most one `inUse = 1` snapshot at any time. There will be zero in-use snapshots just after each incarnation starts, until enough deliveries have been made to trigger the first snapshot.
+
+When terminating a vat, the kernel should first call `snapStore.stopUsingLastSnapshot(vatID)`, the same call it would make at the end of an incarnation, to indicate that we're no longer using the last snapshot. This results in zero in-use snapshots.
+
+Then, the kernel must either call `snapStore.deleteVatSnapshots(vatID, undefined)` to delete everything at once, or make a series of calls (spread out over time/blocks) to `snapStore.deleteVatSnapshots(vatID, budget)`. Each will return `{ done, cleanups }`, which can be used to manage the rate-limiting and know when the process is finished.
+
+The `stopUsingLastSnapshot()` is a performance improvement, but is not mandatory. If omitted, exports will continue to include the vat's snapshot artifacts until the first call to `deleteVatSnapshots()`, after which they will go away. Snapshots are deleted in descending `snapPos` order, so the first call will delete the only `inUse = 1` snapshot, after which exports will omit all artifacts for the vatID. `stopUsingLastSnapshot()` is idempotent, and extra calls will leave the DB unchanged.
+
+The kernel must keep calling `deleteVatSnapshots(vatID, budget)` until the `{ done }` return value is `true`. It is safe to call it again after that point; the function will keep returning `true`. But note, this costs one DB txn, so it may be cheaper for the kernel to somehow remember that we've reached the end.