Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add controller.terminateVat() #8687

Closed
warner opened this issue Dec 21, 2023 · 2 comments · Fixed by #9253
Closed

add controller.terminateVat() #8687

warner opened this issue Dec 21, 2023 · 2 comments · Fixed by #9253
Assignees
Labels
enhancement New feature or request SwingSet package: SwingSet

Comments

@warner
Copy link
Member

warner commented Dec 21, 2023

What is the Problem Being Solved?

While working on ghost-replay tooling, I found a need to terminate a vat from outside the kernel. We currently have controller.upgradeStaticVat() to trigger vat upgrades, but nothing to trigger vat termination.

This might be useful for mainnet remediation of things like #8401 , if our strategy is to just kill the price-feed vats that are participating in the majority of the cycles. OTOH, our remediation might live mostly in userspace, if the adminNodes for those vats are reachable from the core-eval environment, so we can E(adminNode).terminateWithFailure() each one "from the inside".

Description of the Design

Add controller.terminateVat(vatID, reasonString).

To implement this properly, we must first convert vat termination into a queued event, like update-vat. Some notes are already present as comments:

// TODO: terminateVat is async, result doesn't fire until worker
// is dead. To fix this we'll probably need to move termination
// to a run-queue ['terminate-vat', vatID] event, like createVat

// ISSUE: terminate stuff in its own crank like creation?

The basic plan is:

  • define a new type RunQueueEventTerminateVat
  • add kernel.js processTerminateVat(), just after processUpgradeVat
    • return a CrankResults with the .terminate flag set
    • this will trigger the termination work at the end of processDeliveryMessage
  • update deliverRunQueueEvent to add a case for terminate-vat, just after the clause for changeVatOptions
  • change vat-admin-hooks.js terminate to enqueue a terminate-vat event (using kernelKeeper.addToAcceptanceQueue(), just like the neighboring changeOptions or upgrade do it)

Timing Changes

This will change the timing of external vat termination slightly.

In the case of external termination, some parent vat does E(adminNode).terminateWithFailure(). This enqueues a message to vat-vat-admin. Later, when this message arrives, vat-vat-admin invokes device-vat-admin, which calls into vat-admin-hooks.js, which calls void terminateVat() and then returns right away. terminateVat() is async, so vat-vat-admin regains control promptly, before any state changes have actually happened, and it resolves the terminateWithFailure() result promise right away. On the next kernel turn, terminateVat() will reject all the vat's outstanding promises, delete its remaining state, and enqueue a notification to vat-vat-admin. On some future crank, that notification arrives at vat-vat-admin, which will resolve the adminNode's done() promise to anyone who might be watching.

There is an awkwardness in the old implementation: after terminateVat() enqueues the notification, it does await vatWarehouse.stopWorker(vatID);, which won't fire until after the xsnap worker process has been fully killed. We don't know how long this will take. We think it shouldn't cause a problem, even if it took several minutes, but it would be nice to be more deterministic, and not claim success until the worker is really dead. The code in vat-admin-hooks currently ignores this Promise.. I'm not sure if we could/should change that to await it instead.

With this change, vat-admin-hooks.js will merely enqueue a request, and the termination won't actually happen until that request makes it to the top of the run-queue. Where previously the termination was delayed by one trip-through-the-queue (the delivery of E(adminNode).terminateWithFailure()), it is now delayed by two trips (adding waiting for terminate-vat to reach the top). Likewise, previously the done() promise resolution was delayed by three trips (relative to the sending of terminateWithFailure()): delivery of terminateWithFailure() to vat-vat-admin, notification from terminateVat() to vat-vat-admin, and finally notification of done() promise resolution to the subscribing vats. This change will increase that to four trips.

In the case of self-termination (vatPowers.exitVat()) or error-termination (bad syscall), the syscall will set a flag, and the end-of-crank processing will see crankResults.terminate, and will call terminateVat. This will remain immediate: the vat must not survive the crank ("death before confusion": if it requested self-termination, it was because some invariant has been violated, and it must not receive any further messages).

Alternate Approaches

We could consider leaving E(adminNode).terminateWithFailure() alone, and merely introduce the new run-queue event for the benefit of controller.terminateVat(). That would leave the worker-killed awkwardness, but would avoid changing the timing of terminateWithFailure(). I don't think that timing is important, and the awkwardness is awkward, so I'm inclined to do the full cleanup.

Security Considerations

None: this new method is only available from outside the kernel.

Scaling Considerations

Terminating a vat will clean up everything it was referencing (modulo our ongoing efforts to finish vat-data cleanup: we don't delete unreachable durable data yet). This may trigger a significant amount of GC work, which might take a while to play out. Keep this in mind when choosing to delete a vat with a lot of imports or exports.

Test Plan

Normal unit tests, following the pattern of controller.upgradeStaticVat

Upgrade Considerations

The new kernel will have a new kind of run-queue message, but will also be able to process this message. This introduces a no-downgrade ratchet, albeit short-range: while a terminate-vat message is on the run-queue, downgrading the kernel will cause a failure when that message reaches the top. But we don't support kernel downgrades in general anyways, every change we make introduces another such ratchet.

@warner warner added enhancement New feature or request SwingSet package: SwingSet labels Dec 21, 2023
@warner warner assigned warner and unassigned warner Jan 9, 2024
@warner
Copy link
Member Author

warner commented Apr 15, 2024

@Chris-Hibbert and @dckc pointed out that the mainnet bootstrap vat only has the zoe-level adminFacet (i.e. which allows bootstrap to tell Zoe to do things to/about a given contract instance), and Zoe has the vat's adminNode but doesn't expose any path to invoke adminNode.terminateWithFailure(), so the only inside-the-box option would be to upgrade the contract to a version which immediately terminates.

So, I'm going to proceed with adding this controller.terminateVat() feature, so we'll have more options by the time we need to kill off those large price-feed vats.

@warner
Copy link
Member Author

warner commented Apr 17, 2024

One consideration for these sorts of external provocations (controller.terminateVat(), .queueToVatRoot(), .upgradeStaticVat()), is what state changes will happen during the call, and when exactly they will get committed. In general, the host application is responsible either calling hostStorage.commit() just afterwards, or for calling controller.run() and then do hostStorage.commit(). None of the provoked changes will be persisted until that commit().

queueToVatRoot() needs to manipulate the run-queue, add a result promise, and tweak its refcount. It also needs to call kernelStorage.emitCrankHashes() to keep the hash of these changes associated with the external call, instead of getting folded into the first crank of the next controller.run(). upgradeStaticVat() works by enqueueing a message to vat-vat-admin, so it follows the same pattern.

We didn't think to implement a broad-authority terminateVat(vatID) in vat-vat-admin, and can't add one without first upgrading vat-vat-admin (which has other dependencies). And we don't currently manage vat termination through the run-queue (like start-vat or upgrade-vat). So controller.terminateVat() must work by performing the termination work in the middle of controller.terminateVat().

Fortunately, #8928 reduces the size of the DB changes that happen during vat termination. The main changes are to add the vatID to vats.terminated (one key), then reject all outstanding promises (each one does 2x kvStore changes for the promise table, then enqueues some number of notify events to the run-queue, plus some refcount changes). The rest of the deletion is now deferred to a later crank. If the host app does not configure their runPolicy objects to rate-limit cleanup, all that work will happen on the first crank of the next controller.run(). That will be a big crank (cleanup runs to completion by default), but it won't happen in the middle of controller.terminateVat(). If slow-termination is configured, cleanup will happen a tiny bit at a time until the vat's data is completely deleted.

Alternatively, it might be cleaner to finally add a terminate-vat run-queue event (kernel.js has a few scattered notes about this maybe being a good idea), something like ['terminate-vat', vatID]. That would minimize the state change of controller.terminateVat() to a single run-queue entry, which seems nice. It would also change the timing of vat termination: a busy run-queue could delay the termination. For some cases this is fine, E(adminNode).terminateWithFailure() is async anyways. But vatPowers.exitVat() / exitVatWithFailure() should probably not allow the vat to see another delivery, and vat-fatal syscall errors (like bad vrefs) really need to prevent future deliveries too. And, really, if you call controller.terminateVat() when there's a full run-queue, with pending deliveries to the doomed vat, you probably mean for those deliveries to go splat, so you'd prefer prompt termination over queued termination.

Ok, based on that argument I think I'll stick with controller.terminateVat() making non-trivial state changes, and establishing a rule that host applications must hostStorage.commit() those changes in the same atomicity domain as whatever provoked the termination.

@warner warner self-assigned this Apr 17, 2024
warner added a commit that referenced this issue Apr 18, 2024
This new API allows the host application to terminate any vat for
which is knows the VatID (which must be gleaned manually from logs or
the database). This might be useful if the normal vat code is unable
or unwilling to terminate the vat, or if you need to trigger
termination at some specific point in time.

closes #8687
warner added a commit that referenced this issue Apr 23, 2024
This new API allows the host application to terminate any vat for
which is knows the VatID (which must be gleaned manually from logs or
the database). This might be useful if the normal vat code is unable
or unwilling to terminate the vat, or if you need to trigger
termination at some specific point in time.

closes #8687
warner added a commit that referenced this issue Jun 10, 2024
This new API allows the host application to terminate any vat for
which is knows the VatID (which must be gleaned manually from logs or
the database). This might be useful if the normal vat code is unable
or unwilling to terminate the vat, or if you need to trigger
termination at some specific point in time.

closes #8687
warner added a commit that referenced this issue Jun 12, 2024
This new API allows the host application to terminate any vat for
which is knows the VatID (which must be gleaned manually from logs or
the database). This might be useful if the normal vat code is unable
or unwilling to terminate the vat, or if you need to trigger
termination at some specific point in time.

closes #8687
warner added a commit that referenced this issue Jun 13, 2024
This new API allows the host application to terminate any vat for
which is knows the VatID (which must be gleaned manually from logs or
the database). This might be useful if the normal vat code is unable
or unwilling to terminate the vat, or if you need to trigger
termination at some specific point in time.

closes #8687
warner added a commit that referenced this issue Jun 14, 2024
This new API allows the host application to terminate any vat for
which is knows the VatID (which must be gleaned manually from logs or
the database). This might be useful if the normal vat code is unable
or unwilling to terminate the vat, or if you need to trigger
termination at some specific point in time.

closes #8687
warner added a commit that referenced this issue Jun 14, 2024
This new API allows the host application to terminate any vat for
which is knows the VatID (which must be gleaned manually from logs or
the database). This might be useful if the normal vat code is unable
or unwilling to terminate the vat, or if you need to trigger
termination at some specific point in time.

closes #8687
warner added a commit that referenced this issue Jun 14, 2024
This new API allows the host application to terminate any vat for
which is knows the VatID (which must be gleaned manually from logs or
the database). This might be useful if the normal vat code is unable
or unwilling to terminate the vat, or if you need to trigger
termination at some specific point in time.

closes #8687
warner added a commit that referenced this issue Jun 24, 2024
This new API allows the host application to terminate any vat for
which is knows the VatID (which must be gleaned manually from logs or
the database). This might be useful if the normal vat code is unable
or unwilling to terminate the vat, or if you need to trigger
termination at some specific point in time.

closes #8687
warner added a commit that referenced this issue Jun 26, 2024
This new API allows the host application to terminate any vat for
which is knows the VatID (which must be gleaned manually from logs or
the database). This might be useful if the normal vat code is unable
or unwilling to terminate the vat, or if you need to trigger
termination at some specific point in time.

closes #8687
warner added a commit that referenced this issue Jul 11, 2024
This new API allows the host application to terminate any vat for
which is knows the VatID (which must be gleaned manually from logs or
the database). This might be useful if the normal vat code is unable
or unwilling to terminate the vat, or if you need to trigger
termination at some specific point in time.

closes #8687
warner added a commit that referenced this issue Aug 12, 2024
This new API allows the host application to terminate any vat for
which is knows the VatID (which must be gleaned manually from logs or
the database). This might be useful if the normal vat code is unable
or unwilling to terminate the vat, or if you need to trigger
termination at some specific point in time.

closes #8687
@mergify mergify bot closed this as completed in #9253 Aug 13, 2024
@mergify mergify bot closed this as completed in 693b367 Aug 13, 2024
mergify bot added a commit that referenced this issue Aug 13, 2024
This new API allows the host application to terminate any vat for which is knows the VatID (which must be gleaned manually from logs or the database). This might be useful if the normal vat code is unable or unwilling to terminate the vat, or if you need to trigger termination at some specific point in time.

closes #8687
kriskowal pushed a commit that referenced this issue Aug 27, 2024
This new API allows the host application to terminate any vat for
which is knows the VatID (which must be gleaned manually from logs or
the database). This might be useful if the normal vat code is unable
or unwilling to terminate the vat, or if you need to trigger
termination at some specific point in time.

closes #8687
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request SwingSet package: SwingSet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant