Skip to content

Commit

Permalink
feat(swingset): allow slow termination/deletion of vats
Browse files Browse the repository at this point in the history
This introduces new `runPolicy()` controls which enable "slow
termination" of vats. When configured, terminated vats are immediately
dead (all promises are rejected, all new messages go splat, they never
run again), however the vat's state is deleted slowly, one piece at a
time. This makes it safe to terminate large vats, with a long history,
lots of c-list imports/exports, or large vatstore tables, without fear
of causing an overload (by e.g. dropping 100k references all in a
single crank).

See docs/run-policy.md for details and configuration instructions.

refs #8928
  • Loading branch information
warner committed Jun 10, 2024
1 parent e2f75f4 commit 13ea1dc
Show file tree
Hide file tree
Showing 15 changed files with 778 additions and 32 deletions.
138 changes: 133 additions & 5 deletions packages/SwingSet/docs/run-policy.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,12 @@ The kernel will invoke the following methods on the policy object (so all must e
* `policy.crankFailed()`
* `policy.emptyCrank()`

All methods should return `true` if the kernel should keep running, or `false` if it should stop.
All those methods should return `true` if the kernel should keep running, or `false` if it should stop.

The following methods are optional (for backwards compatibility with policy objects created for older kernels):

* `policy.allowCleanup()` : may return budget, see "Terminated-Vat Cleanup" below
* `policy.didCleanup({ cleanups })` (if missing, kernel pretends it returned `true` to keep running)

The `computrons` argument may be `undefined` (e.g. if the crank was delivered to a non-`xs worker`-based vat, such as the comms vat). The policy should probably treat this as equivalent to some "typical" number of computrons.

Expand All @@ -53,6 +58,27 @@ More arguments may be added in the future, such as:

The run policy should be provided as the first argument to `controller.run()`. If omitted, the kernel defaults to `forever`, a policy that runs until the queue is empty.

## Terminated-Vat Cleanup

Some vats may grow very large (i.e. large c-lists with lots of imported/exported objects, or lots of vatstore entries). If/when these are terminated, the burst of cleanup work might overwhelm the kernel, especially when processing all the dropped imports (which trigger GC messages to other vats).

To protect the system against these bursts, the run policy can be configured to terminate vats slowly. Instead of doing all the cleanup work immediately, the policy allows the kernel to do a little bit of work each time `controller.run()` is called (e.g. once per block, for kernels hosted inside a blockchain).

There are two RunPolicy methods which control this. The first is `runPolicy.allowCleanup()`. This will be invoked many times during `controller.run()`, each time the kernel tries to decide what to do next (once per step). The return value will enable (or not) a fixed amount of cleanup work. The second is `runPolicy.didCleanup({ cleanups })`, which is called later, to inform the policy of how much cleanup work was actually done. The policy can count the cleanups and switch `allowCleanup()` to return `false` when it reaches a threshold. (We need the pre-check `allowCleanup` method because the simple act of looking for cleanup work is itself a cost that we might be able to afford).

If `allowCleanup()` exists, it must either return a falsy value, or an object. This object may have a `budget` property, which must be a number.

A falsy return value (eg `allowCleanup: () => false`) prohibits cleanup work. This can be useful in a "only clean up during idle blocks" approach (see below), but should not be the only policy used, otherwise vat cleanup would never happen.

A numeric `budget` limits how many cleanups are allowed to happen (if any are needed). One "cleanup" will delete one vatstore row, or one c-list entry (note that c-list deletion may trigger GC work), or one heap snapshot record, or one transcript span (and its populated transcript items). Using `{ budget: 5 }` seems to be a reasonable limit on each call, balancing overhead against doing sufficiently small units of work that we can limit the total work performed.

If `budget` is missing or `undefined`, the kernel will perform unlimited cleanup work. This also happens if `allowCleanup()` is missing entirely, which maintains the old behavior for host applications that haven't been updated to make new policy objects. Note that cleanup is higher priority than anything else, followed by GC work, then BringOutYourDead, then message delivery.

`didCleanup({ cleanups })` is called when the kernel actually performed some vat-termination cleanup, and the `cleanups` property is a number with the count of cleanups that took place. Each query to `allowCleanup()` might (or might not) be followed by a call to `didCleanup`, with a `cleanups` value that does not exceed the specified budget.

To limit the work done per block (for blockchain-based applications) the host's RunPolicy objects must keep track of how many cleanups were reported, and change the behavior of `allowCleanup()` when it reaches a per-block threshold. See below for examples.


## Typical Run Policies

A basic policy might simply limit the block to 100 cranks with deliveries and two vat creations:
Expand All @@ -78,6 +104,7 @@ function make100CrankPolicy() {
return true;
},
});
return policy;
}
```

Expand All @@ -95,15 +122,15 @@ while(1) {

Note that a new policy object should be provided for each call to `run()`.

A more sophisticated one would count computrons. Suppose that experiments suggest that one million computrons take about 5 seconds to execute. The policy would look like:
A more sophisticated one would count computrons. Suppose that experiments suggest that sixty-five million computrons take about 5 seconds to execute. The policy would look like:


```js
function makeComputronCounterPolicy(limit) {
let total = 0;
let total = 0n;
const policy = harden({
vatCreated() {
total += 100000; // pretend vat creation takes 100k computrons
total += 1_000_000n; // pretend vat creation takes 1M computrons
return (total < limit);
},
crankComplete(details) {
Expand All @@ -112,18 +139,119 @@ function makeComputronCounterPolicy(limit) {
return (total < limit);
},
crankFailed() {
total += 1000000; // who knows, 1M is as good as anything
total += 65_000_000n; // who knows, 65M is as good as anything
return (total < limit);
},
emptyCrank() {
return true;
}
});
return policy;
}
```

See `src/runPolicies.js` for examples.

To slowly terminate vats, limiting each block to 5 cleanups, the policy should start with a budget of 5, return the remaining `{ budget }` from `allowCleanup()`, and decrement it as `didCleanup` reports that budget being consumed:

```js
function makeSlowTerminationPolicy() {
let cranks = 0;
let vats = 0;
let cleanups = 5;
const policy = harden({
vatCreated() {
vats += 1;
return (vats < 2);
},
crankComplete(details) {
cranks += 1;
return (cranks < 100);
},
crankFailed() {
cranks += 1;
return (cranks < 100);
},
emptyCrank() {
return true;
},
allowCleanup() {
if (cleanups > 0) {
return { budget: cleanups };
} else {
return false;
}
},
didCleanup(spent) {
cleanups -= spent.cleanups;
},
});
return policy;
}
```

A more conservative approach might only allow cleanup in otherwise-empty blocks. To accompish this, use two separate policy objects, and two separate "runs". The first run only performs deliveries, and prohibits all cleanups:

```js
function makeDeliveryOnlyPolicy() {
let empty = true;
const didWork = () => { empty = false; return true; };
const policy = harden({
vatCreated: didWork,
crankComplete: didWork,
crankFailed: didWork,
emptyCrank: didWork,
allowCleanup: () => false,
});
const wasEmpty = () => empty;
return [ policy, wasEmpty ];
}
```

The second only performs cleanup, with a limited budget, stopping the run after any deliveries occur (such as GC actions):

```js
function makeCleanupOnlyPolicy() {
let cleanups = 5;
const stop: () => false;
const policy = harden({
vatCreated: stop,
crankComplete: stop,
crankFailed: stop,
emptyCrank: stop,
allowCleanup() {
if (cleanups > 0) {
return { budget: cleanups };
} else {
return false;
}
},
didCleanup(spent) {
cleanups -= spent.cleanups;
},
});
return policy;
}
```

On each block, the host should only perform the second (cleanup) run if the first policy reports that the block was empty:

```js
async function doBlock() {
const [ firstPolicy, wasEmpty ] = makeDeliveryOnlyPolicy();
await controller.run(firstPolicy);
if (wasEmpty()) {
const secondPolicy = makeCleanupOnlyPolicy();
await controller.run(secondPolicy);
}
}
```

Note that regardless of whatever computron/delivery budget is imposed by the first policy, the second policy will allow one additional delivery to be made (we do not yet have an `allowDelivery()` pre-check method that might inhibit this). The cleanup work, which may or may not happen, will sometimes trigger a GC delivery like `dispatch.dropExports`, but at most one such delivery will be made before the second policy returns `false` and stops `controller.run()`. If cleanup does not trigger such a delivery, or if no cleanup work needs to be done, then one normal run-queue delivery will be performed before the policy has a chance to say "stop". All other cleanup-triggered GC work will be deferred until the first run of the next block.

Also note that `budget` and `cleanups` are plain `Number`s, whereas `comptrons` is a `BigInt`.


## Non-Consensus Wallclock Limits

If the SwingSet kernel is not being operated in consensus mode, then it is safe to use wallclock time as a block limit:
Expand Down
73 changes: 63 additions & 10 deletions packages/SwingSet/src/kernel/kernel.js
Original file line number Diff line number Diff line change
Expand Up @@ -266,12 +266,17 @@ export default function buildKernel(
// (#9157). The fix will add .critical to CrankResults, populated by a
// getOptions query in deliveryCrankResults() or copied from
// dynamicOptions in processCreateVat.
critical = kernelKeeper.provideVatKeeper(vatID).getOptions().critical;
const vatKeeper = kernelKeeper.provideVatKeeper(vatID);
critical = vatKeeper.getOptions().critical;

// Reject all promises decided by the vat, making sure to capture the list
// of kpids before that data is deleted.
const deadPromises = [...kernelKeeper.enumeratePromisesByDecider(vatID)];
kernelKeeper.cleanupAfterTerminatedVat(vatID);
// remove vatID from the list of live vats, and mark for deletion
kernelKeeper.deleteVatID(vatID);
kernelKeeper.addTerminatedVat(vatID);
// remove vat from swing-store exports
kernelKeeper.removeVat(vatID);
for (const kpid of deadPromises) {
resolveToError(kpid, makeError('vat terminated'), vatID);
}
Expand Down Expand Up @@ -378,7 +383,8 @@ export default function buildKernel(
* abort?: boolean, // changes should be discarded, not committed
* consumeMessage?: boolean, // discard the aborted delivery
* didDelivery?: VatID, // we made a delivery to a vat, for run policy and save-snapshot
* computrons?: BigInt, // computron count for run policy
* computrons?: bigint, // computron count for run policy
* cleanups?: number, // cleanup budget spent
* meterID?: string, // deduct those computrons from a meter
* measureDirt?: { vatID: VatID, dirt: Dirt }, // the dirt counter should increment
* terminate?: { vatID: VatID, reject: boolean, info: SwingSetCapData }, // terminate vat, notify vat-admin
Expand Down Expand Up @@ -642,16 +648,39 @@ export default function buildKernel(
if (!vatWarehouse.lookup(vatID)) {
return NO_DELIVERY_CRANK_RESULTS; // can't collect from the dead
}
const vatKeeper = kernelKeeper.provideVatKeeper(vatID);
/** @type { KernelDeliveryBringOutYourDead } */
const kd = harden([type]);
const vd = vatWarehouse.kernelDeliveryToVatDelivery(vatID, kd);
const status = await deliverAndLogToVat(vatID, kd, vd);
vatKeeper.clearReapDirt(); // BOYD zeros out the when-to-BOYD counters
// no gcKrefs, BOYD clears them anyways
return deliveryCrankResults(vatID, status, false); // no meter, BOYD clears dirt
}

/**
* Perform a small (budget-limited) amount of dead-vat cleanup work.
*
* @param {RunQueueEventCleanupTerminatedVat} message
* 'message' is the run-queue cleanup action, which includes a vatID and budget.
* A budget of 'undefined' allows unlimited work. Otherwise, the budget is a Number,
* and cleanup should not touch more than maybe 5*budget DB rows.
* @returns {Promise<CrankResults>}
*/
async function processCleanupTerminatedVat(message) {
const { vatID, budget } = message;
const { done, cleanups } = kernelKeeper.cleanupAfterTerminatedVat(
vatID,
budget,
);
if (done) {
kernelKeeper.deleteTerminatedVat(vatID);
}
// We don't perform any deliveries here, so tere are no computrons to
// report, but we do tell the runPolicy know how much kernel-side DB
// work we did, so it can decide how much was too much.
const computrons = 0n;
return harden({ computrons, cleanups });
}

/**
* The 'startVat' event is queued by `initializeKernel` for all static vats,
* so that we execute their bundle imports and call their `buildRootObject`
Expand Down Expand Up @@ -903,7 +932,6 @@ export default function buildKernel(
const boydVD = vatWarehouse.kernelDeliveryToVatDelivery(vatID, boydKD);
const boydStatus = await deliverAndLogToVat(vatID, boydKD, boydVD);
const boydResults = deliveryCrankResults(vatID, boydStatus, false);
vatKeeper.clearReapDirt();

// we don't meter bringOutYourDead since no user code is running, but we
// still report computrons to the runPolicy
Expand Down Expand Up @@ -1159,6 +1187,7 @@ export default function buildKernel(
* @typedef { import('../types-internal.js').RunQueueEventRetireImports } RunQueueEventRetireImports
* @typedef { import('../types-internal.js').RunQueueEventNegatedGCAction } RunQueueEventNegatedGCAction
* @typedef { import('../types-internal.js').RunQueueEventBringOutYourDead } RunQueueEventBringOutYourDead
* @typedef { import('../types-internal.js').RunQueueEventCleanupTerminatedVat } RunQueueEventCleanupTerminatedVat
* @typedef { import('../types-internal.js').RunQueueEvent } RunQueueEvent
*/

Expand Down Expand Up @@ -1226,6 +1255,8 @@ export default function buildKernel(
} else if (message.type === 'negated-gc-action') {
// processGCActionSet pruned some negated actions, but had no GC
// action to perform. Record the DB changes in their own crank.
} else if (message.type === 'cleanup-terminated-vat') {
deliverP = processCleanupTerminatedVat(message);
} else if (gcMessages.includes(message.type)) {
deliverP = processGCMessage(message);
} else {
Expand Down Expand Up @@ -1285,6 +1316,10 @@ export default function buildKernel(
// sometimes happens randomly because of vat eviction policy
// which should not affect the in-consensus policyInput)
policyInput = ['create-vat', {}];
} else if (message.type === 'cleanup-terminated-vat') {
const { cleanups } = crankResults;
assert(cleanups !== undefined);
policyInput = ['cleanup', { cleanups }];
} else {
policyInput = ['crank', {}];
}
Expand Down Expand Up @@ -1318,7 +1353,9 @@ export default function buildKernel(
const { computrons, meterID } = crankResults;
if (computrons) {
assert.typeof(computrons, 'bigint');
policyInput[1].computrons = BigInt(computrons);
if (policyInput[0] !== 'cleanup') {
policyInput[1].computrons = BigInt(computrons);
}
if (meterID) {
const notify = kernelKeeper.deductMeter(meterID, computrons);
if (notify) {
Expand Down Expand Up @@ -1742,20 +1779,30 @@ export default function buildKernel(
* Pulls the next message from the highest-priority queue and returns it
* along with a corresponding processor.
*
* @param {RunPolicy} [policy] - a RunPolicy to limit the work being done
* @returns {{
* message: RunQueueEvent | undefined,
* processor: (message: RunQueueEvent) => Promise<PolicyInput>,
* }}
*/
function getNextMessageAndProcessor() {
function getNextMessageAndProcessor(policy) {
const acceptanceMessage = kernelKeeper.getNextAcceptanceQueueMsg();
if (acceptanceMessage) {
return {
message: acceptanceMessage,
processor: processAcceptanceMessage,
};
}
const allowCleanup = policy?.allowCleanup ? policy.allowCleanup() : {};
// false, or an object with optional .budget
if (allowCleanup) {
assert.typeof(allowCleanup, 'object');
if (allowCleanup.budget) {
assert.typeof(allowCleanup.budget, 'number');
}
}
const message =
kernelKeeper.nextCleanupTerminatedVatAction(allowCleanup) ||
processGCActionSet(kernelKeeper) ||
kernelKeeper.nextReapAction() ||
kernelKeeper.getNextRunQueueMsg();
Expand Down Expand Up @@ -1831,7 +1878,8 @@ export default function buildKernel(
await null;
try {
kernelKeeper.establishCrankSavepoint('start');
const { processor, message } = getNextMessageAndProcessor();
const { processor, message } =
getNextMessageAndProcessor(foreverPolicy());
// process a single message
if (message) {
await tryProcessMessage(processor, message);
Expand Down Expand Up @@ -1867,7 +1915,7 @@ export default function buildKernel(
kernelKeeper.startCrank();
try {
kernelKeeper.establishCrankSavepoint('start');
const { processor, message } = getNextMessageAndProcessor();
const { processor, message } = getNextMessageAndProcessor(policy);
if (!message) {
break;
}
Expand All @@ -1889,6 +1937,11 @@ export default function buildKernel(
case 'crank-failed':
policyOutput = policy.crankFailed(policyInput[1]);
break;
case 'cleanup': {
const { didCleanup = () => true } = policy;
policyOutput = didCleanup(policyInput[1]);
break;
}
case 'none':
policyOutput = policy.emptyCrank();
break;
Expand Down
Loading

0 comments on commit 13ea1dc

Please sign in to comment.