Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vat termination leaks promises #10261

Closed
warner opened this issue Oct 11, 2024 · 2 comments · Fixed by #10268
Closed

vat termination leaks promises #10261

warner opened this issue Oct 11, 2024 · 2 comments · Fixed by #10268
Assignees
Labels
bug Something isn't working SwingSet package: SwingSet

Comments

@warner
Copy link
Member

warner commented Oct 11, 2024

Describe the bug

While writing tests for remediation of #9039 , I noticed a second problem. When a vat is terminated, and the kernel rejects its remaining outstanding (vat-decided) promises, we delete the late vat's c-list entry without decrementing the corresponding refcount. That means these promises will be left with a final refcount of 1, not 0, causing the promise (and it's rejected value, e.g. the DisconnectionObject) to be retained forever.

To Reproduce

I'm modifying SwingSet/test/vat-admin/terminate/terminate.test.js to demonstrate the problem, by gleaning the kpid of a dead-vat -decided promise, and then examining the kvStore afterwards to see whether or not the kernel promise table entry is still present.

Impact

On chain, we've only ever deleted one vat (a provisioning vat, during the initial bootstrap run at chain launch, back in 2023). So the impact is minimal so far. But we're intending to terminate some large vats in the near future, so we should get this fixed before we perform those terminations.

Cause

When the kernel terminates a vat, we run a function named terminateVat() in kernel.js . Originally, before we landed slow-termination, this built a list of deadPromises by scanning for all promises decided by the vatID, then used kernelKeeper.cleanupAfterTerminatedVat() to delete all c-list entries (and vatstore), then submit each dead kpid to resolveToError(). This resolveToError calls doResolve with a rejection value, which is the same function that is used by syscall.resolve.

The critical difference is that when a vat does syscall.resolve(), the vatTranslator (which converts the VatSyscallObject to a KernelSyscallObject, translating vrefs to krefs) would also retire the c-list entries immediately after translation, using vatKeeper.deleteCListEntriesForKernelSlots(). This function decrements the entry's refcount just before deleting the kvStore keys, allowing the usual maybeFreeKrefs processing to clean things up.

But terminateVat goes directly to resolveToError/doResolve without going through a vref/kref translator, so nothing ever calls deleteCListEntriesForKernelSlots. It calls cleanupAfterTerminatedVat, and that function does decrement refcounts for kernel objects, but it does not do anything special for kernel promises, it merely deletes the forward (vNN.c.kpNN) and reverse kvStore entries without touching refcounts.

When we landed slow termination, we retained this pattern. All the dead vat's promises are rejected immediately, inside terminateVat(), but we leave c-list handling to the slow/background kernelKeeper.cleanupAfterTerminatedVat(), which behaves just like the old version, and doesn't perform refcount handling on promise c-list entries.

Likely Fix

One path would be to properly retire the promise c-list entries during terminateVat(). We would change it to use vatKeeper.deleteCListEntriesForKernelSlots(deadPromises). I'd have to think about the right ordering: should be retire those entries before or after doing resolveToError on all of them? The vat might have been self-subscribed to some of those promises, so resolveToError will enqueue notifications to the dead vat. I'm pretty sure those will be safely ignored when they reach the front of the run-queue, but we'd need to test it.

A second path would be to change cleanupAfterTerminatedVat() to have a 'promise' phase, probably just after the 'import' and 'export' phases, but before the 'kv' phase. Currently, the vNN.c.kpNN promise c-list entries are bulk-deleted during the kv phase, because I didn't think those deletions had side-effects (I thought the refcounts were decremented immediately, during terminateVat(), so it's safe to delete them at a faster rate. This would be in keeping with our slow-deletion patterns: all userspace-visible consequences of the termination happen immediately, during terminateVat(), but non-userspace-visible consequences (GC drops/retires) are allowed to happen slowly, afterwards.

I'm inclined to take the second path, although it probably means more work: adding a new phase involves cosmic-swingset code to make that limit be configurable.

@warner warner added bug Something isn't working SwingSet package: SwingSet labels Oct 11, 2024
@warner warner self-assigned this Oct 11, 2024
@warner
Copy link
Member Author

warner commented Oct 11, 2024

I took a quick look at the mainnet DB. We have two vats which are already unused, which we'll terminate as soon as the demolition perimeter is secure: v45-auctioneer and v44-auctioneer.governor. These currently have 6 and 0 outstanding promises, respectively.

We're also planning to cease using (and later terminate) v29-ATOM-USD_price_feed and v46-scaledPriceAuthority-ATOM. These are still in use, so they might have more outstanding promises than they will once we stop talking to them, but they currently have 6 and 12 promises outstanding.

So, the overall size of the leak would not be huge.

@warner
Copy link
Member Author

warner commented Oct 11, 2024

Oh, I think it's worse than that.. even externally-decided promises, known to the dead vat (and probably subscribed to), will be leaked, when we delete the c-list entry without decrementing the refcount.

That means the second approach is definitely the better one. At termination time, the kernel walks all the promises decided by the vat, for rejection, but that's not a good place to walk the entire c-list. The slow-cleanup handler is a much better place for that kind of scan.

@warner warner changed the title vat termination leaks kernel-rejected promises vat termination leaks promises Oct 12, 2024
warner added a commit that referenced this issue Oct 12, 2024
Two tests are updated to exercise the cleanup of promise c-list
entries during vat termination. `terminate.test.js` adds some promises
to the c-list and then checks their refcounts after termination, to
demonstrate that bug #10261 is leaking a refcount when it deletes the
dead vat's c-list entry without also decrementing the
refcount. `slow-termination.test.js` adds a number of promises to the
c-list, and expectes the budget-limited cleanup to spend some time on
promises.

Both tests are marked as failing until the code fix is landed in the
next commit.
warner added a commit that referenced this issue Oct 12, 2024
Previously, when a vat was terminated, and we delete the promise
c-list entries from its old state, the cleanup code was failing to
decrement the kpid's refcount properly. This resulted in a leak: those
promises could never be retired.

This commit updates the vat cleanup code to add a new phase, named
`promises`. This executes after `exports` and `imports`, but before
`kv`, and is responsible for both deleting the c-list entries and also
decrementing the refcounts of the corresponding promises. We do this
slowly, like we do exports and imports, because we don't know how many
there might be, and because those promise records might hold
references to other objects (in the resolution data), which could
trigger additional work. However, this work is unlikely to be
significant: the run-queue is usually empty, so these outstanding
promises are probably unresolved, and thus cannot beholding resolution
data.

All promises *decided* by the dead vat are rejected by the kernel
immediately during vat termination, because those rejections are
visible to userspace in other vats. In contrast, freeing the promise
records is *not* visible to userspace, just like how freeing imports
or exports are not visible to userspace, so this cleanup is safe to do
at a leisurely pace, rate-limited by `runPolicy.allowCleanup`.

The docs are updated to reflect the new `runPolicy` API:

* `budget.promises` is new, and respected by slow cleanup
* `work.promises` is reported to `runPolicy.didCleanup()`

The 'test.failing' marker was removed from the previously updated
tests.

I don't intend to add any remediation code: it requires a full
refcount audit to find such promises, and the mainnet kernel has only
ever terminated one vat so far, so I believe there cannot be very many
leaked promises, if any. Once this fix is applied, no new leaks will
occur.

fixes #10261
mergify bot pushed a commit that referenced this issue Oct 26, 2024
Two tests are updated to exercise the cleanup of promise c-list
entries during vat termination. `terminate.test.js` adds some promises
to the c-list and then checks their refcounts after termination, to
demonstrate that bug #10261 is leaking a refcount when it deletes the
dead vat's c-list entry without also decrementing the
refcount. `slow-termination.test.js` adds a number of promises to the
c-list, and expects the budget-limited cleanup to spend a phase on
promises.

Both tests are marked as failing until the code fix is landed in the
next commit.
mergify bot pushed a commit that referenced this issue Oct 26, 2024
Previously, when a vat was terminated, and we delete the promise
c-list entries from its old state, the cleanup code was failing to
decrement the kpid's refcount properly. This resulted in a leak: those
promises could never be retired.

This commit updates the vat cleanup code to add a new phase, named
`promises`. This executes after `exports` and `imports`, but before
`kv`, and is responsible for both deleting the c-list entries and also
decrementing the refcounts of the corresponding promises. We do this
slowly, like we do exports and imports, because we don't know how many
there might be, and because those promise records might hold
references to other objects (in the resolution data), which could
trigger additional work. However, this work is unlikely to be
significant: the run-queue is usually empty, so these outstanding
promises are probably unresolved, and thus cannot beholding resolution
data.

All promises *decided* by the dead vat are rejected by the kernel
immediately during vat termination, because those rejections are
visible to userspace in other vats. In contrast, freeing the promise
records is *not* visible to userspace, just like how freeing imports
or exports are not visible to userspace, so this cleanup is safe to do
at a leisurely pace, rate-limited by `runPolicy.allowCleanup`.

The docs are updated to reflect the new `runPolicy` API:

* `budget.promises` is new, and respected by slow cleanup
* `work.promises` is reported to `runPolicy.didCleanup()`

The 'test.failing' marker was removed from the previously updated
tests.

I don't intend to add any remediation code: it requires a full
refcount audit to find such promises, and the mainnet kernel has only
ever terminated one vat so far, so I believe there cannot be very many
leaked promises, if any. Once this fix is applied, no new leaks will
occur.

fixes #10261
@mergify mergify bot closed this as completed in #10268 Oct 26, 2024
mergify bot added a commit that referenced this issue Oct 26, 2024
…10268)

Previously, when a vat was terminated, and we delete the promise
c-list entries from its old state, the cleanup code was failing to
decrement the kpid's refcount properly. This resulted in a leak: those
promises could never be retired.

This commit updates the vat cleanup code to add a new phase, named
`promises`. This executes after `exports` and `imports`, but before
`kv`, and is responsible for both deleting the c-list entries and also
decrementing the refcounts of the corresponding promises. We do this
slowly, like we do exports and imports, because we don't know how many
there might be, and because those promise records might hold
references to other objects (in the resolution data), which could
trigger additional work. However, this work is unlikely to be
significant: the run-queue is usually empty, so these outstanding
promises are probably unresolved, and thus cannot beholding resolution
data.

All promises *decided* by the dead vat are rejected by the kernel
immediately during vat termination, because those rejections are
visible to userspace in other vats. In contrast, freeing the promise
records is *not* visible to userspace, just like how freeing imports
or exports are not visible to userspace, so this cleanup is safe to do
at a leisurely pace, rate-limited by `runPolicy.allowCleanup`.

The docs are updated to reflect the new `runPolicy` API:

* `budget.promises` is new, and respected by slow cleanup
* `work.promises` is reported to `runPolicy.didCleanup()`

I don't intend to add any remediation code: it requires a full
refcount audit to find such promises, and the mainnet kernel has only
ever terminated one vat so far, so I believe there cannot be very many
leaked promises, if any. Once this fix is applied, no new leaks will
occur.

fixes #10261
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working SwingSet package: SwingSet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant