Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExitObject/SeatHandle cross-vat reference cycle retains old objects #8401

Closed
warner opened this issue Sep 28, 2023 · 14 comments
Closed

ExitObject/SeatHandle cross-vat reference cycle retains old objects #8401

warner opened this issue Sep 28, 2023 · 14 comments
Assignees
Labels
bug Something isn't working Inter-protocol Overarching Inter Protocol performance Performance related issues Zoe Contract Contracts within Zoe Zoe package: Zoe

Comments

@warner
Copy link
Member

warner commented Sep 28, 2023

While investigating mainnet storage usage, I discovered a cross-vat reference cycle between v9 (zoe) and v29 (ATOM-USD price feed). There are about 50k instances of this cycle, and each one is keeping enough objects alive to consume 37 kvstore entries. I estimate that this is currently costing about 310MB of combined SQLite+IAVL space. The cycles involve Seats for price-feed operations, so I suspect the quantity is growing over time. I'm not confident of the growth rate, but it might be about 3MB per day.
zoe-pricefeed cycle

Following the cycle upstream, starting from Zoe, we get:

  • Zoe (v9) has a (strongly) imported Presence for an ExitObject, which is hosted by the price-feed vat (V29)
  • this Presence is held as part of the state in a zoeSeatAdmin durable object
  • the zoeSeatAdmin is held as a value of the seatHandleToZoeSeatAdmin WeakMapStore
  • the key which holds it is a SeatHandle
  • that SeatHandle is a zoe-local durable object, which is also exported (strongly)

Now on the price-feed vat (v29):

  • v29 has a (strongly) imported Presence for the SeatHandle
  • this Presence is held as a value of the zcfSeatToSeatHandle WeakMapStore
  • the key which holds it is a zcfSeat, a local durable object
  • the zcfSeat is held as part of the state of an ExitObject, a local durable object
  • this ExitObject is the one exported to zoe, completing the cycle

SwingSet cannot collect cross-vat cycles, even if they merely involved weak imports, not strong ones. (In fact, performing distributed GC among mutually-suspicious parties is a significant research topic, what we in the ocap community call a "purple box" for the PhD-thesis -level impact it has on project planning). So all the objects and collection entries involved are kept alive, costing about 37 kvStore entries for each instance.

I'm still examining the chain state to confirm that there are no other references to the objects in this cycle, and the swingstore would not show whether there are ephemeral (RAM) references like a closed-over Presence. But I suspect that it is only the reference cycle that is keeping them alive.

Fixes

The next step of the investigation is to check the history of one instance, and determine whether the seat/offer involved has been exited or not. I also need to find out whether any other contracts are involved, or if all 50k instances are from the price-feed vat.

One outcome might be that exiting an offer allows everything to be cleaned up, and these cycles only exist for non-exited offers, and the core issue is that our price-feed vat is failing to exit the seats. If that's the case, then the flatten-the-slope fix is to change those vats to always exit their old offers/seats when they're done, and deploy a vat upgrade.

Another possibility is that seats have been exited, but the code is relying upon GC to collect everything, and the cycle is inhibiting that cleanup. In that case, we need to change something (either in v9-zoe, or in the other vats) to react to the exit by nulling out some state, to break the cycle:

  • the price-feed vat's ExitObject could do state.zcfSeat = null
  • the price-feed vat could do zcfSeatToSeatHandle.delete(seat)
  • zoe could do seatHandleToZoeSeatAdmin.delete(seatHandle)
  • zoe's zoeSeatAdmin object could do state.exitObject = null

It might also be possible to rearchitect the data structures to avoid the cycle. Each state property and map is surely there for a reason, but perhaps there is a rearrangement that would lack a cycle. I've seen cases before where object X holds property Y and Z for convenience, but having both on the same object creates graph shapes with cycles, and introducing a separate WeakMap or moving a property to a different object could provide a fix.

Remediation

Given the durability of the objects/collections involved, upgrading the code to prevent new cycles from forming will not help remove the 50k existing ones.

More investigation might reveal that strong references are still being held to these objects (perhaps there's an in-RAM Set of un-exited seats?). If so, and if being un-exited is the problem, then it might be possible to explicitly exit each one. Given the sheer number (1.6M) of kvStore entries involved, we should rate-limit this process, to avoid interfering with normal operation or IAVL pruning (same issue as in #8400).

However, if WeakMapStore cycle is really the only thing keeping these alive, then none of the objects are reachable from JS code. One possible path would be for the upgraded code to replace its zcfSeatToSeatHandle WeakMapStore entirely, however that might cause problems for real (live) seats.

In this case, our best idea so far is to entirely terminate the price feed vats.

This would trigger the kernel's cleanup code (terminateVat, cleanupAfterTerminatedVat, and deleteCListEntry), which should decrement the refcounts of the imported SeatHandle objects, allowing v9-zoe to drop its WeakMapStore entries, dropping the (now-abandoned) ExitObjects, and deleting all parts of the cycle. (I'm 90% sure the kernel will do this correctly, however I need to add an explicit test that uses a real cross-vat cycle to be sure).

Terminating the price-feed vat will also abandon its Issuer: the identity remains the same, however it no longer accepts messages, so getAmountOf() will never again succeed. This is a visible trauma (worse than a a mere vat upgrade would cause), equivalent to destroying the feed, so downstream vats will need to somehow be told to start using a new one, and I don't know what that would take to implement.

cc @Chris-Hibbert @erights

@warner warner added bug Something isn't working Zoe package: Zoe performance Performance related issues Zoe Contract Contracts within Zoe Inter-protocol Overarching Inter Protocol labels Sep 28, 2023
@warner
Copy link
Member Author

warner commented Sep 28, 2023

Incidentally, it looks like some of these cycles are also keeping deposited IST Payment objects alive in v9-zoe.

% grep 'o+d11/984\b' v9.vs
v9.vs.vom.o+d11/984|{}
v9.vs.vom.o+d31/7765|{"currentAllocation":{"body":"#{\"Collateral\":{\"brand\":\"$0.Alleged: ATOM brand\",\"value\":\"+0\"},\"Minted\":{\"brand\":\"$1.Alleged: IST brand\",\"value\":\"+26167041\"}}","slots":["o-292","o+d10/1"]},"proposal":{"body":"#{\"exit\":{\"onDemand\":null},\"give\":{\"Collateral\":{\"brand\":\"$0.Alleged: ATOM brand\",\"value\":\"+7000000\"}},\"want\":{\"Minted\":{\"brand\":\"$1.Alleged: IST brand\",\"value\":\"+26167041\"}}}","slots":["o-292","o+d10/1"]},"exitObj":{"body":"#\"$0.Alleged: ExitObject\"","slots":["o-16246"]},"offerResult":{"body":"#{\"invitationMakers\":\"$0.Alleged: Vault Holder invitationMakers\",\"publicSubscribers\":{\"vault\":{\"description\":\"Vault holder status\",\"storagePath\":\"published.vaultFactory.managers.manager0.vaults.vault17\",\"subscriber\":\"$1.Alleged: Durable Publish Kit subscriber\"}},\"vault\":\"$2.Alleged: Vault Holder holder\",\"vaultUpdater\":\"$3.Alleged: Recorder\"}","slots":["o-16249","o-16250","o-16251","o-16252"]},"offerResultStored":{"body":"#true","slots":[]},"instanceAdminHelper":{"body":"#\"$0.Alleged: instanceAdmin helper\"","slots":["o+d33/36:0"]},"withdrawFacet":{"body":"#\"$0.Alleged: InstanceStorageManager withdrawFacet\"","slots":["o+d26/36:2"]},"publisher":{"body":"#\"$0.Alleged: zoe Seat publisher publisher\"","slots":["o+d30/7765:0"]},"subscriber":{"body":"#\"$0.Alleged: zoe Seat publisher subscriber\"","slots":["o+d30/7765:1"]},"payouts":{"body":"#{\"Collateral\":\"$0.Alleged: ATOM payment\",\"Minted\":\"$1.Alleged: IST payment\"}","slots":["o-16253","o+d11/984"]},"exiting":{"body":"#true","slots":[]}}
v9.vs.vom.rc.o+d11/984|1
%

The first row is that vref's VOM state (an empty object, as expected). The second row is a zoeSeatAdmin object's state, which points at the Payment in state.payouts.Minted. The last row is the refcount (1).

I noticed that half of the 3,446 IST Payment objects in v9-zoe are kept alive by the recovery set (so they haven't been deposited), and their balance is 0.0 IST. I think these zoeSeatAdmin references explain the other half: if they've been deposited, they'll no longer be in the recovery set or the paymentLedger.

@mhofman
Copy link
Member

mhofman commented Sep 28, 2023

In fact, performing distributed GC among mutually-suspicious parties is a significant research topic

For reference, some idea for enabling cooperative distributed gc are detailed in #6793, and while it was explored with mutually suspicious vats in mind, it is definitely not fully fleshed out yet.

@warner
Copy link
Member Author

warner commented Oct 2, 2023

@erights and I dug into this more today. We think the code isn't doing anything surprising or wrong, but it's relying upon distributed GC features that swingset doesn't have, and it would be appropriate for the code to cooperate in breaking the cycles to avoid that reliance.

We identified three kinds of changes to make. The first is to delete state or WeakStore entries when the seat is exited. The second is to remediate (break) the existing cycles that involve fully-exited seats. The third is to change the data structures to avoid creating the cycles in the first place.

cycle-fixes

Breaking Cycles During seat.exit

The diagram above shows a different view of the cycle causing problems. @erights and I walked around this loop, examining each place where data was held, with two questions in mind:

  • can we delete this edge when the seat is exited?
  • could we avoid having this edge in the first place?

We identified two links that could (probably) safely be severed when the seat is exited.

  • 1: The ZCF/contract-vat -side zcfSeatToSeatHandle weakstore, defined in zcfSeat.js
    • Once the seat is exited, we can do zcfSeatToSeatHandle.delete(zcfSeat) because we no longer need to get to the seatHandle (which Zoe accepts to interact with the corresponding Zoe-side pieces, like re-allocation)
    • The contract can call zcfSeat.exit() or zcfSeat.fail(). At the end of each, after we've sent exitSeat or failSeat to zoe, we can delete the weakstore entry
    • that deletes both the weak-key reference to zcfSeat (which allowed the ExitObj to keep the seat-to-handle entry alive), and the strong value reference to the imported seatHandle Presence
  • 2: The exitObj state.zcfSeat virtual-object property link
    • In some cases ("onDemand"), the user can invoke exitObj.exit(), which calls zcfSeat.exit().
    • Once exitObj.exit() calls state.zcfSeat.exit(), it should do state.zcfSeat = undefined
    • We must also change it to check if (!state.zcfSeat) and throw the same Fail'seat has been exited' that zcfSeat.exit() would throw when it called+failed assertActive().
    • That deletes the strong VOM property reference from exitObj to zcfSeat

#3 shows a link that is not safe to sever at exit: the Zoe-side seatHandletoSeatAdmin weakstore, which maps SeatHandles (either Presences when the handle was created on the contract-vat side, for internal/no-escrow seats, or local virtual objects when the handle was created by zoe, for Offer-driven seats) to a zoeSeatAdmin. The zoeSeatAdmin is one facet of the cohort created by makeZoeSeatAdminKit (the other is zoeUserSeat, which is given to the offering user).

We cannot delete this entry (e.g. during seatAdmin.exitSeat()) because the ZCF side might call getExitSubscriber() at any time, even after exiting, and we need it to keep working.

These fixes do not require drastic changes to the data structures. Both would need to be made to the ZCF code.

Deployment would consist of:

  • 1: deploy a version of Zoe that has a governance control over which ZCF bundle to use for new/upgraded contracts
  • 2: deploy a fixed version of ZCF to the chain (install the new bundle, then use the governance control to make Zoe use it)
  • 3: upgrade contract vats until all are using the new ZCF

Note that this first change does not break cycles for non-exited seats, nor does it break any existing cycles already present in the contract vats.

Remediating Existing Cycles

Most of our existing cycles are for seats that have already been exited, so adding code to delete edges during exit() won't help us remove them.

However, we can add code to ZCF to explicitly break existing cycles for exited seats. ZCF has a activeZCFSeats, which..

oh, wait, @erights and I mis-read activeZCFSeats as being a strong map, but in fact it is a weak map. That breaks our remediation plan.

For reference, if it had been a strong map, the plan was:

  • stash the exiting zcfSeatToSeatHandle in a temporary named oldZcfSeatToSeatHandle
  • create a replacement WeakMapStore called newZcfSeatToSeatHandle
  • iterate through all ZCFSeat keys in activeZCFSeats, for each one:
    • use oldZcfSeatToSeatHandle to obtain the SeatHandle
    • add the seat/handle mapping into newZcfSeatToSeatHandle
  • after iteration, swap newZcfSeatToSeatHandle into place as zcfSeatToSeatHandle, like how dropAllReferences does it

That would retain references to all active SeatHandles, but would break the cycles for the exited seats.

Deployment would involve changing ZCF to perform this work once, at upgrade-time startup, then deploying Zoe/ZCF updates as above, then upgrading each contract vat (probably one at a time, given how much GC work each one will trigger).

But, I now see that activeZCFSeats is weak, so we'll have to go back to the drawing board. Perhaps there is another table somewhere that holds strong references to all non-exited seats, from which we can perform this kind of selective "deletion" from zcfSeatToSeatHandle. Maybe on the Zoe side.

Note that calling dropAllReferences() would break the cycles, but it would also break the non-exited seats (they could no longer be used when talking to zoe). This would still be less traumatic than killing the contract vat entirely (which would lose the identity of the Issuer, so clients would somehow need to be instructed to accept a new one), and new seats created after the break would work, but any existing seats would be left in a non-working state.

Avoiding Cycles In The First Place

Both of the above fixes would only help with exited seats: the cycle would still exist from seat creation until seat exit. We identified one change which would reduce the duration of the cycle still further, from zoe-side offer submission to contract-side seat creation. This window is normally very short, as contracts do not wait for external input before creating the seat.

As described above, ZoeSeatKit is currently a cohort of two facets: facets.zoeSeatAdmin and facet.userSeat. The first is the internal administrative interface for a seat. Each contract vat's ZCF code gets access to zoeInstanceAdmin, and it can pass the seatHandle to methods on it, to instruct Zoe how to manipulate the seat (e.g. exitSeat(), replaceAllocations, etc). Internally, zoeInstanceAdmin uses seatHandleToZoeSeatAdmin to look up facets.zoeSeatAdmin and call internal methods that should not be shared with the contract side.

The other facet, facet.userSeat, is given to users of the seat. This has methods like getPayouts and hasExited, but the most significant one is tryExit, which needs to forward a request to the contract-side ExitObject. All user seats get some kind of exitObj, however only seats whose exit policy is onDemand will do anything when userSeat.tryExit() calls exitObj.exit() (e.g. if I sell you an option, you can exit the contract early, but I cannot). The rest will just throw.

This ExitObject is created by ZCF in the contract vat, and delivered to zoeSeatAdmin.resolveExitAndResult(), which stores both exitObj and a promise for the offer result in the ZoeSeatKit's state. None of the other zoeSeatAdmin methods touch exitObj. Only zoeSeatAdmin.resolveExitAndResult needs to write it, and only userSeat.tryExit needs to read it.

The way to break the cycle would be to move the userSeat out of ZoeSeatKit. We make a new cohort, perhaps named userSeatKit, with two facets: userSeat (same API as before), and exitObjectSetter. The state of userSeatKit contains only state.exitObj and state.zoeSeatAdmin. exitObjectSetter.setExitObj() sets state.exitObj. All userSeat methods forward directly to zoeSeatAdmin except for tryExit, which uses its special access to state.exitObj to send it an .exit() method.

We then have a separate WeakStore that maps seatHandle to exitObjectSetter. This WeakStore participates in the cycle, but only until setExitObj() is called, at which point we remove the entry and break the cycle.

By retaining a reference to the userSeat, external callers will keep the same set of objects alive. However once they drop that reference, there will no longer be a cycle, and everything should get collected.

The challenging part is how to deploy this kind of change. Our virtual-object system supports the addition of new facets to Kinds (in new incarnations), but not the removal or renaming of facets. Existing userSeat facets must continue to work. So we'd need to support two different forms of userSeat at the same time: legacy ones (facets of ZoeSeatKit, whose state.exitObj causes cycles), and new ones (separate Kinds, with an exitObjSetter facet, present in the new weakstore).

@warner
Copy link
Member Author

warner commented Oct 4, 2023

BTW I walked through one instance of this cycle (v29: o+d11/1000), and found no other inbound references (neither strong nor weak) in the vatstore. We cannot tell if there are ephemeral references (e.g. a Representative which is held by something in RAM) by looking at the vatstore: we'd have to somehow peek inside the vat's memory and look for slotToVal.get('o+d11/1000'). But I suspect there are no RAM references.

@warner
Copy link
Member Author

warner commented Oct 6, 2023

In my run-5 dataset, v9-zoe has 55,867 instances of the ZoeSeatKit Kind (grep 'v9.vs.vom.o+d31/' v9.vs |wc -l). Of those, 54,100 have state.exiting = true, and the other 1,767 have state.exiting = false.

v9-zoe has a scalarDurableSetStore (c100) with 1,738 entries and a label of zoeSeatAdmins. Sampling a few, it looks like all the members are ZoeSeatKits whose state.exiting = false. (I don't know why 29 are missing).

That would let us enumerate most of the live seat-admin objects. But I don't know how to turn that into a replace-the-weakmap remediation. We'd need two more things:

  • 1: find out about the 29 missing ones, we can't leave them out if they're still live
  • 2: find a way for Zoe to get from the (enumerable) zoeSeatAdmin objects to the seatHandle that is a key in the seatHandleToZoeSeatAdmin weakmap that we need to replace

@dckc
Copy link
Member

dckc commented Oct 6, 2023

Replacing the price-feed vat is feasible / straightforward

Terminating the price-feed vat will also abandon its Issuer: the identity remains the same, however it no longer accepts messages, so getAmountOf() will never again succeed. This is a visible trauma (worse than a a mere vat upgrade would cause), equivalent to destroying the feed, so downstream vats will need to somehow be told to start using a new one, and I don't know what that would take to implement.

Fortunately, most of our system deals only indirectly with the price-feed vat.

  • the ATOM brand in agoricNames.brand (connected to an IBC asset by way of the vbank) is distinct from the ATOM brand in agoricNames.oracleBrand
  • v29 (ATOM-USD price feed) produces quotes for the oracleBrand
  • v8 (priceAuthority) is really a price authority multiplexer. v29 is registered with v8. v8 is the home of the priceAuthority given to vaults etc.
  • v46 (scaledPriceAuthority-ATOM) produces quotes for the brand by asking the priceAuthrority from v8 for an oracleBrand.ATOM quote and then "scaling" it to brand.ATOM
  • v8 currently services oracleBrand.ATOM requests by forwarding them to v29.
  • v46 is also registered with v8, which is how vaults can get quotes for the negotiable IBC brand.ATOM

It's straightforward to start a new price feed (aggregator) for oracleBrand.ATOM and register it with v8. The biggest impact would be on the oracle operators, who would have to redo the "redeem invitation" step so that they can submit prices to the new price feed (aggregator). v29 would then no longer be needed. I'm reasonably confident that nothing in our system is holding on to quote payments from v29 and might want to getAmountOf() one.

cc @turadg @michaelfig in case I'm mistaken

@warner
Copy link
Member Author

warner commented Dec 8, 2023

In #8402 I'm investigating to see if we can afford to remediate this in "one fell swoop". I don't yet know what we need to do in Zoe or the contracts to break the existing cycles, but assuming we find a way, we're going to be deleting 50k-100k objects all at once. If we're lucky, then we can perform the remediation during a chain-software upgrade (which would upgrade zoe, delete the old WeakMaps during vat upgrade, and perform a prompt BOYD to trigger the retireExports), and take the delay hit while all the validators are starting back up. If the cleanup doesn't take more than 30 minutes or so, this would look a lot like agoric-upgrade-11 where we had to do a huge DB migration that took a while.

@Chris-Hibbert
Copy link
Contributor

@warner wrote:

The #8401 zoe/exitObject cycle problem has 175k cycles (125k in v29, 34k in v68, 10k in v48, then a long tail of 13 other vats with less than 2k each). Clearing those will be more interesting. My current estimate is that dropping them all at the same time might take about 30 minutes to GC. Still working on a testbed to estimate it better (and I won't feel confident about recommending a one-fell-swoop approach until we do a main-fork simulation).
(also, I don't think we have a good plan for deleting that cycle yet, not without also losing something important)

@Chris-Hibbert
Copy link
Contributor

The challenging part is how to deploy this kind of change. Our virtual-object system supports the addition of new facets to Kinds (in new incarnations), but not the removal or renaming of facets. Existing userSeat facets must continue to work.

This seems straightforward to me. We have to continue to define the behavior of the deprecated kind, but don't use its maker for anything. There would be a new exo definition (which would get a new kindID), and its maker function would be used for any future creation of ZoeSeatKits.

@Chris-Hibbert
Copy link
Contributor

My current theory/plan:

@warner
Copy link
Member Author

warner commented Dec 22, 2023

FWIW, I think this cycle exists on all seats, not just price authorities (although those are far more numerous, so a price-authority-only fix would still solve our immediate sustainability problems).

As mentioned in today's meeting, our most likely path toremediate this problem in the large price-feed vats will be to improve the kernel to the point where we can survive a "one-swell-foop" cleanup (#8644, #8402, maybe #8417 if we upgrade zoe to include it), and then terminate the price-feed vats. We can install replacement vats first (to fix the bug and stop the storage growth), then defer terminating the old vats until we can survive the GC.

But, instead of terminating the old vats, we might look for a way to "upgrade" them to a special vat whose only job is to slowly delete the old objects, sort of like how a polluted gas station will be shut down and provisioned with special clean-the-dirt machines for years before it can be safely used for other purposes. Given that the cycles are held by WeakMaps, this might not be easy. One option to consider is a special raw vat (no liveslots, no ocap rules) which uses direct vatstoreGet calls to walk through the data left over by the price-feed vat, and submit syscall.dropImports() calls at a fixed rate. This would be morally equivalent to giving the cleanup vat a special debug ability (perhaps through some liveslots option) that exposes a WeakMapStore.keys() iterator (normally forbidden, of course).

We've discussed having "caretaker" or "estate-winddown" bundles configured for each vat, so that instead of terminating a vat, we "upgrade" it to a version whose only job is an orderly disposal of its assets. It might be reasonable for such an wind-down vat to get more authority than usual, things like WeakMapStore.keys(), or perhaps raw vatstore access. This would allow the code in the winddown vat to violate ocap rules, but we wouldn't expect such vats to host mutually-adversarial code anyways.

@warner
Copy link
Member Author

warner commented Feb 8, 2024

Note to self: to measure the size of this problem, use SwingSet/misc-tools/categorize-kvstore.js:

% node ~/trees/agoric-sdk/packages/SwingSet/misc-tools/categorize-kvstore.js --datafile data17.json ingest run-17-swingstore.sqlite
% node ~/trees/agoric-sdk/packages/SwingSet/misc-tools/categorize-kvstore.js --datafile data17.json large-collections | grep seatHandleToZoeSeatAdmin

 v9   c24  205815 seatHandleToZoeSeatAdmin                                        r0000000001:o-54 -> {"body":"#\"$0.Alleged: ZoeSeatKit zoeSeatAdmin\"","slots":["o+d31/1:1"]}

@aj-agoric
Copy link

Closing as this is shipped in upgrade16 need another ticket for not creating cycles

@warner
Copy link
Member Author

warner commented Aug 19, 2024

So in upgrade-16, we deployed an upgrade to vat-zoe, which includes PR #8697 . So as of 23-Jul-2024, any Seat that gets exited (or fails) will cut the cycle, and allow the objects to be dropped. That should remove most of the growth related to cycles. (TODO: look at the chain stats to verify this)

The remaining issue is for Seats that don't exit. To make sure these don't leave cycles around, we need to avoid creating the cycles in the first place. We (@warner and @Chris-Hibbert ) have discussed a very different design that would avoid the cycles, but transitioning from the old Kinds to the new ones might be tricky. I've filed #9922 to track that work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Inter-protocol Overarching Inter Protocol performance Performance related issues Zoe Contract Contracts within Zoe Zoe package: Zoe
Projects
None yet
Development

No branches or pull requests

5 participants