vat upgrade leaves refcount=1 on disconnected/rejected promises #9039

warner · 2024-03-06T18:41:08Z

Describe the bug

While looking at the results of the emerynet test of upgrade-14, we noticed an increase (by several hundred) in the number of resolved kernel promise table entries. Further investigation of the DB contents and slogfiles show that, while the kernel correctly disconnects/rejects all promises decided by the upgraded vat, it does not delete them as it is supposed to, because it fails to decrement the refcount on that promise.

The consequence is a storage leak, as we leave the kernel-promise table record around, even though nothing is still referring to it. For the promises leaked during this upgrade, record contains four keys:

kp1068808.data.body: #{"incarnationNumber":0,"name":"vatUpgraded","upgradeMessage":"vat upgraded"}
kp1068808.data.slots:
kp1068808.refCount: 1
kp1068808.state: rejected

Normally, when a vat uses syscall.resolve() to fulfill or reject a promise, the vat's c-list entry for that promise is deleted during the translateResolve() translation step, by calling vatKeeper.deleteCListEntriesForKernelSlots() here:

agoric-sdk/packages/SwingSet/src/kernel/vatTranslator.js

Lines 564 to 567 in c5284e4

    
             kresolutions.push([kpid, rejected, kernelData]); 
        
             kpidsResolved.push(kpid); 
        
           } 
        
           vatKeeper.deleteCListEntriesForKernelSlots(kpidsResolved);

A standard side-effect of deleting a c-list entry is to decrement the kref's refcount, in e.g. kp1068808.refCount.

After translation from vat-refs to kernel-refs, the syscall is actually executed with a function named doResolve():

When the promise is resolved (rejected) during processUpgradeVat(), here:

agoric-sdk/packages/SwingSet/src/kernel/kernel.js

Lines 902 to 905 in c5284e4

    
           // reject all promises for which the vat was decider 
        
           for (const kpid of kernelKeeper.enumeratePromisesByDecider(vatID)) { 
        
             resolveToError(kpid, disconnectionCapData, vatID); 
        
           }

it uses resolveToError(), which routes to the same doResolve() as a syscall. But, because the decref happens during translation, and we don't do any translation in the upgrade path, we're missing the decref. Thus, the refCount is left at 1, and the table entry is never removed.

Fix Ideas

The rough fix is to change processUpgradeVat() to decrement the refcount as the promises are resolved, or use deleteCListEntriesForKernelSlots and follow the syscall.resolve() path as much as possible. However, there are cases where the vat is subscribed to their own promise (eg if they used a durable watchPromise() against a locally-decided Promise), and we need to keep working in those.

Every promise in the vat's c-list is either vat-decided or kernel-decided (as determined by the ${kpid}.decider kvStore entry). For vat-decided promises, the usual resolution path is a syscall.resolve(), which deletes the c-list entry (and decrement the refcount). For kernel-decided promises, the usual path is a dispatch.notify(), which does the same thing.

For upgrade, we identify all the vat-decided kpids and use doResolve() to reject them. This sets their state, and enqueues notify() for any vat that is subscribed (possibly including the upgraded vat, which handles the watchPromise() case). And we need the c-list entry to stick around until that time, because we'll be translating the kpid to a vpid to deliver a dispatch.notify() to the new incarnation. So in that case, we must retain the refcount until that time, and allow translation to handle the decref.

I'll need to think about what the right approach will be.

Remediation

We're likely to deploy upgrade-14 without a fix for this problem, which means mainnet will wind up with about 730 promise records.

Later, we'll want to figure out a remediation path. This will need some sort of mark-and-sweep analysis of kernel promises, locating the reference sources for every kpid, counting them, comparing against the recorded refCount, adjusting the refCount when wrong, and deleting the entry when refCount === 0. The references can come from c-list entries (which are easy to find), but also from enqueued messages. So we'd have to scan the run-queue as well as every unresolved promise's kp$NN.queue.$MM entries.

We currently have about 800 promises on mainnet, none of which have any queued messages. Examining all promises requires enumerating all kpNN* keys in the kvStore, of which there about 4000. So remediation will requires 4000 kvStore reads, plus 800 c-list checks for each of our 94-ish vats, plus a run-queue scan (probably empty), making about 79k kvStore reads. I'm guessing that might take more than a second but less than a minute to run.

We could probably pull this off at the start of an upgrade, but I'm not sure we'd want to do it every time. So maybe we add a controller.fixPromiseRefcounts() call and invoke it once, in some upgrade event after we fix the main bug.

The text was updated successfully, but these errors were encountered:

warner · 2024-03-07T00:50:02Z

One option, not necessarily the worst one, would be to change this upgrade-time code:

      const deadPromises = [...kernelKeeper.enumeratePromisesByDecider(vatID)];
      kernelKeeper.cleanupAfterTerminatedVat(vatID);
      for (const kpid of deadPromises) {
        resolveToError(kpid, makeError('vat terminated'), vatID);
      }

to insert a subscribe call for each promise:

      const deadPromises = [...kernelKeeper.enumeratePromisesByDecider(vatID)];
      kernelKeeper.cleanupAfterTerminatedVat(vatID);
      for (const kpid of deadPromises) {
        doSubscribe(vatID, kpid);
        resolveToError(kpid, makeError('vat terminated'), vatID);
      }

This would forcibly subscribe the vat to every one of its own vat-decided promise, both watched and unwatched (and doSubscribe is idempotent). Then the resolveToError() will enqueue a notify() for every such promise, and the delivery of that notify() will delete the c-list entries (and decref the kpid).

The downside is that we'l be sending spurious notifications into the vat. It will ignore them, but it still feels weird to lie to ourselves about which kpids the vat is interested in.

And, I want this to integrate sensibly with a future world where vats have durable resolution authority. In that world, the vat needs to tell the kernel which promises should be disconnected/rejected by the kernel during upgrade, and which should be left intact (i.e. the new incarnation takes still holds a resolver, and will take responsibility for resolving it). The kernel must skip resolveToError for the durable ones.

Another option, not necessarily the best one, would be to have the upgrade-time code look at each kpid to see whether the vat is already subscribed. If yes, do resolveToError(). If not, do vatKeeper.deleteCListEntriesForKernelSlots(), and then do resolveToError().

      const deadPromises = [...kernelKeeper.enumeratePromisesByDecider(vatID)];
      kernelKeeper.cleanupAfterTerminatedVat(vatID);
      for (const kpid of deadPromises) {
        if (!vatKeeper.isSubscribed(kpid)) {
          vatKeeper.deleteCListEntriesForKernelSlots([kpid]);
        }
        resolveToError(kpid, makeError('vat terminated'), vatID);
      }

That would avoid lying about vat subscriptions, but having the kernel be sensitive to the subscription state feels kinda wrong. And any logic that says "if not A then do B" feels kinda wrong: looking at the sample code above, it's not at all obvious why the lack of a subscription should ask us to delete the c-list ("if X is present you should delete X" would make the most sense).

warner · 2024-03-07T17:18:32Z

Oh, actually, the remediation is easier. The kpid is still in the upgraded vat's c-list, making this a reference leak, not a refcount leak: the refcount is still accurate. Remediation just requires walking the kpid portion of all vat c-lists (eg vNN.c.kpNN, checking the kpNN.state of each one, and doing deleteCListEntriesForKernelSlots for each one that is resolved (i.e. kpNN.state === 'fulfilled' or 'rejected').

On mainnet (run-21, as of 08-feb-2024) we have 1612 such promise c-list entries (mostly one import and one export for each of 800-ish promises). They are tightly grouped by vat, so it will take one DB read each (a getNextKey, which turns into a SELECT key FROM kvStore WHERE key > X), plus an additional no-result -returning query per 94-ish vats, plus the kpNN.state check on each. So it'll need about 3320 queries, plus the c-list deletions, refcount decrements, and promise-table deletions for the ones that get freed. 668 promises are decided by v9-zoe, and 27 are decided by v43-walletFactory, all of which will get disconnected when upgrade14 is deployed and those two vats are upgraded. So that's three kernel-promise table entries to delete for each, plus two c-list entries (two kvStore entries each, since the c-list is bidirectional), making seven deletions for each, so call it 4700 SQL DELETEs and 3200 SQL SELECT statements, so roughly 8000 statements. An order of magnitude better than the full mark-and-sweep would require.

warner · 2024-03-07T18:57:05Z

Ok, I think the right fix is:

on vat upgrade, enumerate all promises in the vat's c-list for which the vat is the .decider, and for each kpid:
- disconnect/reject the kpid (enqueueing notifications to all subscribed vats)
- check to see if the upgraded vat is a subscriber of its own kpid, if not then delete the c-list entry (causing a decref and eventual deletion)

The framework I think I've settled upon is that durable handling of vpids is the rule, not the exception. If we imagine that vats take responsibility for remembering every vpid they've told the kernel about, and have some specific signal or exceptional thing they do to indicate ephemeral usage, then:

the kernel should not disconnect/reject most promises of upgraded vats, since the new incarnation will retain a resolver and will continue to have responsibility for resolution
the kernel should not notify vats about the resolution of their own promises, for the same reason: in this world it would be nonsensical for a vat to subscribe to its own promises, because the vat could reliably notify itself internally upon resolution

Then the exceptions fall into two categories:

"hey kernel, the resolver for this vpid-that-I-decide is ephemeral, I'm going to forget about it when I upgrade, could you please disconnect/reject it for me when that happens"
- (we could imagine a special syscall to announce this exception, like syscall.disconnectOnUpgrade(vpid))
"hey kernel, not only will I forget that, but I also have a durable promise watcher looking at it, so I also need your help notifying my new incarnation about the disconnect/rejection"

With this point of view, we can now say that all existing liveslots versions only emit ephemeral vpids, pretending that they've already made whatever special announcements/syscalls necessary to mark their vpids this way. And the processVatUpgrade code is allowed to do an exceptional/unusual thing (like unilaterally rejecting the vat's promises) because of that (invisible/assumed) annotation. We might make this more legible by having the upgrade code do something like:

  for (const kpid of decidedByVat(vatID)) {
    const ephemeral = true; // TODO: new liveslots should annotated each vpid somehow
    const selfSubscribed = isSubscribed(kpid, vatID); // TODO: same
    if (ephemeral) {
      resolveToError(kpid, disconnectError);
      if (!selfSubscribed) {
        deleteFromCList(vatID, kpid);
      }
    }
  }

and then in the future, when a new syscall API and liveslots version gives us the ability to explicitly mark the unusual promises as ephemeral, that would become:

  for (const kpid of decidedByVat(vatID)) {
    const ephemeral = isEphemeral(kpid);
    const selfSubscribed = isSubscribed(kpid, vatID);
    if (ephemeral) {
      resolveToError(kpid, disconnectError);
      if (!selfSubscribed) {
        deleteFromCList(vatID, kpid);
      }
    }
  }

in which the kernel-unilaterally-rejects behavior is clearly the unusual path, and the "normal" path is for the vat to own all promises and the kernel to never mess with them.

Chris-Hibbert · 2024-03-18T16:38:47Z

Early in the description, it sounded as if the problem results in a failure to transition from a refcount of 1 to 0. Can you reduce the work of your remediation by limiting the search to objects with a refcount of 1?

warner · 2024-07-08T17:26:25Z

Early in the description, it sounded as if the problem results in a failure to transition from a refcount of 1 to 0. Can you reduce the work of your remediation by limiting the search to objects with a refcount of 1?

Alas no, the core issue is that the kernel forgot to remove the ex-deciding vat's c-list entry, but the kpid in question might have additional references (e.g. an argument of a message queued to an unresolved promise), and we need such refs to not inhibit the remediation of deleting the c-list entry (and decrementing its refcount by one). Assuming we only perform this remediation once, if we limited the search to refcount=1, we'd ignore the kpids with extra references, and then once that extra reference went away, we'd be left with the primary problem (refcount=1, ex-deciding vat holds a c-list entry for it, nothing short of vat deletion will remove it).

And anways, that wouldn't make anything faster: our kvStore API doesn't have the ability to filter on the values (and our SQL table doesn't have an index on the values anyways).

warner · 2024-07-08T17:41:28Z

I think I'll use a kvStore schema version (introduced in #9169) to manage the remediation, so version = 1 means the 9039 remediation has not been performed, version = 2 means both 9039 has been fixed and the remediation has been performed.

The remediation process will be:

first, walk the run-queue, find the pending notify events, and make a list of all their vatID, kpid pairs
now, iterate over all vatIDs (vat.names mapped through vat.name.${name}, plus vat.dynamicIDs)
- skip any that are terminated (vatID in vats.terminated), leave cleanup to the slow-deletion code
- iterate over all kvStore keys with prefix ${vatID}.c.kp, yielding kpNNN kpids
  - skip any whose kpNN.state is unresolved
  - the vat might have been subscribed, and a notify might be in the queue
    - so skip if the vatID/kpid pair is in our previous list
  - otherwise, use vatKeeper.deleteCListEntriesForKernelSlots([kpid]) to delete the c-list entry and decrement its refcount
    - that will automatically delete the kpNN table record if appropriate, and in any case it will remove the stale refcount so the record can be deleted at the correct point in the future

The remediation process cost will be:

runQueue.length DB reads to scan for the notifies (usually 0, but we must survive a remediation happening on a non-idle kernel)
one DB read per Promise in all c-lists
- plus an additional DB read for each, to get the kpNN.state status
then a handful of reads and writes for each one that is fulfilled/rejected

mhofman · 2024-09-05T22:45:00Z

then in the future, when a new syscall API and liveslots version gives us the ability to explicitly mark the unusual promises as ephemeral

Well currently a vat doesn't explicitly inform the kernel that an exported ref is ephemeral/virtual, yet the kernel "has some knowledge" about vref allowing it to disconnect these objects on upgrade. I would expect we'll need a cohesive way to handle both objects and promises that are ephemeral / purely virtual, but that seems like a follow up. For now we can keep assuming all promises are ephemeral and reject them.

I think I'll use a kvStore schema version (introduced in #9169) to manage the remediation

If we have a powerful controller API to manage clist of vats (as described in #10028 (comment)), we could simply have a "controller proposal" with a list of vatID/vpid to remove from clists, and rely on an offline analysis. The API could enforce that the promise is settled and doesn't have an active subscription.

warner · 2024-10-08T22:59:19Z

If we have a powerful controller API to manage clist of vats (as described in #10028 (comment)), we could simply have a "controller proposal" with a list of vatID/vpid to remove from clists, and rely on an offline analysis. The API could enforce that the promise is settled and doesn't have an active subscription.

Interesting.. I'm thinking about how much the kernel can trust and how much it needs to verify. Maybe the API basically says "I claim that kpNN in a weird state for vNN". Then the remediation code would see that:

1: kpNN is resolved, and (assumed ephemeral), but also in the vNN c-list, which is not a stable state
2: there is no notify queued, which is then not a correct state
3: then it can correctly delete the c-list entry, and allow refcounting to clean out the promise table entry

Another "weird" state would be: resolved, in the c-list, no notify queued. I don't know of any bugs which would result in that situation, but the code could fix it by enqueuing a notify, whose delivery would clean up everything.

If kpNN were unresolved, then the code would skip it.

The lookup cost would be:

walking the run-queue and finding all notifys: O(lenRunQueue) (and becomes more interesting with parallel execution)
checking each promise: one lookup for the state, then maybe one lookup for the c-list
then if remediation is called for, two writes to delete the entry, a read+write to update the refcount, then some more to deal with the count reaching zero

I've got the "stop causing the problem in the first place" code written, and then I need to write tests. I guess the next decision to make is whether to include automatic remediation code now, or write this external controller API now, or defer writing it for later.

warner · 2024-10-09T00:57:23Z

Hm, so in one path I need to add automatic remediation code (and unit tests), cosmic-swingset doesn't even see it happening, but the first upgrade will take some extra time to do the remediation. On the other path I need to add a controller API to accept VPIDs to examine, swingset unit tests for that API, a new cosmos transaction type to trigger it, cosmic-swingset code to deliver that txn, client code to create those txns, cosmos-side tests to exercise the txns, cosmic-swingset tests (which are kind of uncharted territory) to exercise delivery of those txns, then we need to do the upgrade (which doesn't take any extra time) and then identify the VPIDs that need fixing and create (and pay for) txns to clean them up.

And, both of these mechanisms will only really be used once. Once the main fix goes in, we won't be creating any new problems. So I want to limit how much "throwaway" code we have to write. OTOH, having the overall framework for "alleged start fixup transactions" might be useful in the future, it would certainly reduce (or at least pre-pay) the cost of adding new kinds of fixups, like GC cycle cleanup. OT3H, the rule is to not write those things too early, because we'll know more about the future problem in the future, and we'll do a better job with it if it's driven by a more immediate problem.

So I guess It'll depend on the number of VPIDs that need fixing, and whether the upgrade-time automatic remediation cost looks like it would be too high. Adding/testing the controller API feels about the same amount of work as adding/testing the automatic remediation code. But the extra complexity of adding new cosmos txns (specifically the testing burden of exercising the cosmic-swingset glue code, where we don't have a good framework to write tests) feels particularly high.

warner · 2024-10-09T01:59:34Z

For reference, a recent DB snapshot (agreeable run-53, taken 29-sep-2024) shows 3166 KPIDs to clean up (3128 in the v9-zoe c-list, 38 in v43-walletFactory). The scanning tool (using the faster algorithm) only needed 12k queries to find them all, and ran on my laptop in 0.4s . It will take more queries to delete them, but not more than 2x or 3x. So I'm pretty confident this can safely run at upgrade time, and DB load won't be a compelling reason to prefer offline analysis / "please cleanup XYZ" transactions.

(my first implementation of the scanner used the old/slow algorith, and it required 547k queries and 30s, so the faster approach is a significant improvement).

The kernel is responsible for "disconnecting" (rejecting) promises decided by a vat being upgraded, because Promise settlement functions (resolve/reject) are ephemeral, so they are lost during the upgrade, so the new incarnation will have no way to ever settle them. When the vat resolves a promise with syscall.resolve(), that vat's c-list entry is removed by the syscall handler. But when the kernel did the rejection, the c-list entry was left in place, causing the promise to be pinned forever. This changes the kernel to remove the c-list entry as well, except in the case that the vat is also subscribed to its own promise (which happens when a PromiseWatcher is used on a locally-decided promise: unusual but we needed to tolerate that case). Self-subscribed promises are left in the c-list, because the subsequent dispatch.notify() will finish the job and remove the entry. This adds unit tests for the new behavior (including the self-subscription case). It does not yet add remediation code. Therefore it merely: refs #9039 rather than fixing it entire.

The kernel is responsible for "disconnecting" (rejecting) promises decided by a vat being upgraded, because Promise settlement functions (resolve/reject) are ephemeral, so they are lost during the upgrade, so the new incarnation will have no way to ever settle them. When the vat resolves a promise with syscall.resolve(), that vat's c-list entry is removed by the syscall handler. But when the kernel did the rejection, the c-list entry was left in place, causing the promise to be pinned forever. This changes the kernel to remove the c-list entry as well, except in the case that the vat is also subscribed to its own promise (which happens when a PromiseWatcher is used on a locally-decided promise: unusual but we needed to tolerate that case). Self-subscribed promises are left in the c-list, because the subsequent dispatch.notify() will finish the job and remove the entry. This adds unit tests for the new behavior (including the self-subscription case). It does not yet add remediation code. Therefore it merely: refs #9039 rather than fixing it entirely.

Give this tool a swingstore DB, and it will identify all the vat/kpid pairs that need cleanup to remediate the #9039 bug. This also estimates the amount of work the kernel-side remediation tool will need to do the cleanup automatically. The tool needed 11,882 DB queries to find the 3166 KPIDs needing cleanup in run-53, and ran in 0.37s on my laptop

Most of the kernelKeeper API methods are defined inside makeKernelKeeper, so they can close over some helper functions, which makes it difficult to use those methods from outside a kernelKeeper (e.g. from upgradeSwingset). This commit rearranges things to export two helper functions: * incrementReferenceCount() * addToQueue() The upcoming #9039 remediation work will need access to these.

The kernel is responsible for "disconnecting" (rejecting) promises decided by a vat being upgraded, because Promise settlement functions (resolve/reject) are ephemeral, so they are lost during the upgrade, so the new incarnation will have no way to ever settle them. When the vat resolves a promise with syscall.resolve(), that vat's c-list entry is removed by the syscall handler. But when the kernel did the rejection as part of upgrade, the c-list entry was left in place, causing the promise to be pinned forever. This changes the kernel to remove the c-list entry as well, except in the case that the vat is also subscribed to its own promise (which happens when a PromiseWatcher is used on a locally-decided promise: unusual but we needed to tolerate that case). Self-subscribed promises are left in the c-list, because the subsequent dispatch.notify() will finish the job and remove the entry. This adds unit tests for the new vat-upgrade -time behavior (including the self-subscription case). It does not yet add remediation code, which will appear in a later commit. Therefore it merely: refs #9039 rather than fixing it entirely.

We use the `upgradeSwingset()` mechanism (when switching to version 3) to identify and fix all the leftover kpids caused by #9039. This performs automatic remediation of the damage, exactly once per database. The v2-to-v3 upgrader will look for all c-list promise entries, then check to see if the kpid is settled. It ignores the ones that have `dispatch.notify` events in either the runQueue or the acceptanceQueue. It then enqueues notify events to the vats, adjusts the refcounts and kernel stats accordinging, and finally and bumps the version to 3. These kpids will be for promises which were decided by an upgraded vat, but not resolved by upgrade time. The kernel "disconnects" (rejects) these promises, but #9039 failed to remove them from the upgraded vat's c-list. On the first run after this remediation, the kernel will deliver those notifies, which will delete the c-list entries and decrement their refcounts, which may trigger the usual deletions and further decrefs. The notifies will be delivered to the vat's new incarnation, which will harmlessly ignore them (as unrecognized vpids). This turned out to be the easiest way to get all the possible cleanups to run. New tests were added to exercise the remediation code. As this implements the last part of the fix, it: fixes #9039