Skip to content

Commit

Permalink
fix(swingset): clean up promise c-list entries during vat deletion
Browse files Browse the repository at this point in the history
Previously, when a vat was terminated, and we delete the promise
c-list entries from its old state, the cleanup code was failing to
decrement the kpid's refcount properly. This resulted in a leak: those
promises could never be retired.

This commit updates the vat cleanup code to add a new phase, named
`promises`. This executes after `exports` and `imports`, but before
`kv`, and is responsible for both deleting the c-list entries and also
decrementing the refcounts of the corresponding promises. We do this
slowly, like we do exports and imports, because we don't know how many
there might be, and because those promise records might hold
references to other objects (in the resolution data), which could
trigger additional work. However, this work is unlikely to be
significant: the run-queue is usually empty, so these outstanding
promises are probably unresolved, and thus cannot beholding resolution
data.

All promises *decided* by the dead vat are rejected by the kernel
immediately during vat termination, because those rejections are
visible to userspace in other vats. In contrast, freeing the promise
records is *not* visible to userspace, just like how freeing imports
or exports are not visible to userspace, so this cleanup is safe to do
at a leisurely pace, rate-limited by `runPolicy.allowCleanup`.

The docs are updated to reflect the new `runPolicy` API:

* `budget.promises` is new, and respected by slow cleanup
* `work.promises` is reported to `runPolicy.didCleanup()`

The 'test.failing' marker was removed from the previously updated
tests.

I don't intend to add any remediation code: it requires a full
refcount audit to find such promises, and the mainnet kernel has only
ever terminated one vat so far, so I believe there cannot be very many
leaked promises, if any. Once this fix is applied, no new leaks will
occur.

fixes #10261
  • Loading branch information
warner committed Oct 12, 2024
1 parent 62f9fdb commit b0c4f74
Show file tree
Hide file tree
Showing 6 changed files with 31 additions and 9 deletions.
11 changes: 7 additions & 4 deletions packages/SwingSet/docs/run-policy.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,16 +68,19 @@ Some vats may grow very large (i.e. large c-lists with lots of imported/exported

To protect the system against these bursts, the run policy can be configured to terminate vats slowly. Instead of doing all the cleanup work immediately, the policy allows the kernel to do a little bit of work each time `controller.run()` is called (e.g. once per block, for kernels hosted inside a blockchain). Internally, before servicing the run-queue, the kernel checks to see if any vats are in the "terminated but not fully deleted" state, and executes a "vat-cleanup crank", to delete some state. Depending upon what the run-policy allows, it may do multiple vat-cleanup cranks in a single `controller.run()`, or just one, or none at all. And depending upon the budget provided to each one, it may only take one vat-cleanup crank to finish the job, or millions. If the policy limits the number of cranks in a single block, and limits the budget of the crank, then the cleanup process will be spread over multiple blocks.

For each terminated vat, cleanup proceeds through five phases:
For each terminated vat, cleanup proceeds through six phases:

* `exports`: delete c-list entries for objects/promises *exported* by the vat
* `imports`: same but for objects/promises *imported* by the vat
* `exports`: delete c-list entries for objects *exported* by the vat
* `imports`: same but for objects *imported* by the vat
* `promises`: same but promises referenced by the vat
* `kv`: delete all other kv entries for this vat, mostly vatstore records
* `snapshots`: delete xsnap heap snapshots, typically one per 200 deliveries (`snapshotInterval`)
* `transcripts`: delete transcript spans, and their associated transcript items

The first two phases, `exports` and `imports`, cause the most activity in other vats. Deleting `exports` can cause objects to be retired, which will deliver `dispatch.retireImports` GC messages to other vats which have imported those objects and used them as keys in a WeakMapStore. Deleting `imports` can cause refcounts to drop to zero, delivering `dispatch.dropImports` into vats which were exporting those objects. Both of these will add `gcKref` "dirt" to the other vat, eventually triggering a BringOutYourDead, which will cause more DB activity. These are generally the phases we must rate-limit to avoid overwhelming the system.

The `promises` phase may cause kernel promise table entries to be deleted. This is unlikely to trigger activity in other vats, because only settled promises can have references to other objects, and the notification of settlement is enqueued to all subscribers as soon as the promise is fulfilled or rejected. However, if some of these notifications were still in the run-queue when the vat got deleted, the GC work might be accelerated slightly.

The other phases cause DB activity (deleting rows), but do not interact with other vats, so it is easier to accomodate more cleanup steps. The budget can be tuned to allow more kv/snapshots/transcripts than exports/imports in a single cleanup run.

There are two RunPolicy methods which control this. The first is `runPolicy.allowCleanup()`. This will be invoked many times during `controller.run()`, each time the kernel tries to decide what to do next (once per step). The return value will enable (or not) a fixed amount of cleanup work. The second is `runPolicy.didCleanup({ cleanups })`, which is called later, to inform the policy of how much cleanup work was actually done. The policy can count the cleanups and switch `allowCleanup()` to return `false` when it reaches a threshold. (We need the pre-check `allowCleanup` method because the simple act of looking for cleanup work is itself a cost that we might not be willing to pay).
Expand All @@ -92,7 +95,7 @@ The limit can be set to `Infinity` to allow unlimited deletion of that particula

Each budget record must include a `{ default }` property, which is used as the default for any phase that is not explicitly mentioned in the budget. This also provides forwards-compatibility for any phases that might be added in the future. So `budget = { default: 5 }` would provide a conservative budget for cleanup, `budget = { default: 5, kv: 50 }` would enable faster deletion of the non-c-list kvstore entries, and `budget = { default: Infinity }` allows unlimited cleanup for all phases.

Note that the cleanup crank ends when no more work is left to be done (which finally allows the vat to be forgotten entirely), or when an individual phase's budget is exceeded. That means multiple phases might see deletion work in a single crank, if the earlier phase finishes its work without exhausting its budget. For example, if the budget is `{ default: 5 }`, but the vat had 4 exports, 4 imports, 4 other kv entries, 4 snapshots, and 4 transcript spans, then all that work would be done in a single crank, because no individual phase would exhaust its budget. The only case that is even worth mentioning is when the end of the `exports` phase overlaps with the start of the `imports` phase, where we might do four more cleanups than usual.
Note that the cleanup crank ends when no more work is left to be done (which finally allows the vat to be forgotten entirely), or when an individual phase's budget is exceeded. That means multiple phases might see deletion work in a single crank, if the earlier phase finishes its work without exhausting its budget. For example, if the budget is `{ default: 5 }`, but the vat had 4 exports, 4 imports, 4 promise entries, 4 other kv entries, 4 snapshots, and 4 transcript spans, then all that work would be done in a single crank, because no individual phase would exhaust its budget. The only case that is even worth mentioning is when the end of the `exports` phase overlaps with the start of the `imports` phase, where we might do four more cleanups than usual.

A `true` return value from `allowCleanup()` is equivalent to `{ default: Infinity }`, which allows unlimited cleanup work. This also happens if `allowCleanup()` is missing entirely, which maintains the old behavior for host applications that haven't been updated to make new policy objects. Note that cleanup is higher priority than any delivery, and is second only to acceptance queue routing.

Expand Down
2 changes: 2 additions & 0 deletions packages/SwingSet/src/kernel/kernel.js
Original file line number Diff line number Diff line change
Expand Up @@ -701,6 +701,7 @@ export default function buildKernel(
total:
work.exports +
work.imports +
work.promises +
work.kv +
work.snapshots +
work.transcripts,
Expand Down Expand Up @@ -1820,6 +1821,7 @@ export default function buildKernel(
{
exports: M.number(),
imports: M.number(),
promises: M.number(),
kv: M.number(),
snapshots: M.number(),
transcripts: M.number(),
Expand Down
20 changes: 18 additions & 2 deletions packages/SwingSet/src/kernel/state/kernelKeeper.js
Original file line number Diff line number Diff line change
Expand Up @@ -967,6 +967,7 @@ export default function makeKernelKeeper(
const work = {
exports: 0,
imports: 0,
promises: 0,
kv: 0,
snapshots: 0,
transcripts: 0,
Expand All @@ -991,6 +992,7 @@ export default function makeKernelKeeper(
const clistPrefix = `${vatID}.c.`;
const exportPrefix = `${clistPrefix}o+`;
const importPrefix = `${clistPrefix}o-`;
const promisePrefix = `${clistPrefix}p`;

// Note: ASCII order is "+,-./", and we rely upon this to split the
// keyspace into the various o+NN/o-NN/etc spaces. If we were using a
Expand Down Expand Up @@ -1043,8 +1045,22 @@ export default function makeKernelKeeper(
}
}

// the caller used enumeratePromisesByDecider() before calling us,
// so they already know the orphaned promises to reject
// The caller used enumeratePromisesByDecider() before calling us,
// so they have already rejected the orphan promises, but those
// kpids are still present in the dead vat's c-list. Clean those
// up now.
remaining = budget.promises ?? budget.default;
for (const k of enumeratePrefixedKeys(kvStore, promisePrefix)) {
const kref = kvStore.get(k) || Fail`getNextKey ensures get`;
const vref = stripPrefix(clistPrefix, k);
vatKeeper.deleteCListEntry(kref, vref);
// that will also delete both db keys
work.promises += 1;
remaining -= 1;
if (remaining <= 0) {
return { done: false, work };
}
}

// now loop back through everything and delete it all
remaining = budget.kv ?? budget.default;
Expand Down
1 change: 1 addition & 0 deletions packages/SwingSet/src/types-external.js
Original file line number Diff line number Diff line change
Expand Up @@ -227,6 +227,7 @@ export {};
*
* @typedef { { exports: number,
* imports: number,
* promises: number,
* kv: number,
* snapshots: number,
* transcripts: number,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -314,10 +314,10 @@ async function doSlowTerminate(t, mode) {
t.false(JSON.parse(kvStore.get('vats.terminated')).includes(vatID));
}

test.serial.failing('slow terminate (kill)', async t => {
test.serial('slow terminate (kill)', async t => {
await doSlowTerminate(t, 'kill');
});

test.serial.failing('slow terminate (die happy)', async t => {
test.serial('slow terminate (die happy)', async t => {
await doSlowTerminate(t, 'dieHappy');
});
Original file line number Diff line number Diff line change
Expand Up @@ -449,7 +449,7 @@ test.serial('invalid criticalVatKey causes vat creation to fail', async t => {
});
});

test.serial.failing('dead vat state removed', async t => {
test.serial('dead vat state removed', async t => {
const configPath = new URL('swingset-die-cleanly.json', import.meta.url)
.pathname;
const config = await loadSwingsetConfigFile(configPath);
Expand Down

0 comments on commit b0c4f74

Please sign in to comment.