Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

price-feed vats hold old QuotePayments in recovery set, causing storage leak #8400

Closed
warner opened this issue Sep 28, 2023 · 6 comments
Closed
Assignees
Labels
bug Something isn't working Inter-protocol Overarching Inter Protocol performance Performance related issues Zoe Contract Contracts within Zoe

Comments

@warner
Copy link
Member

warner commented Sep 28, 2023

Describe the bug

I spent a few days investigating storage usage on mainnet, and learned that about 60% of our kvstore entries are the result of a bug in the price feed contracts, and this count grows by about 10 rows for every new price quote created.

External oracles submit messages (signed txns) into the chain with claims about the e.g. ATOM/USD price at a given time. These get executed as contract operations (create an invitation with args that include the price claim, make an offer, get a seat, etc). The price feed contract then publishes a price quote to subscribers, such as an aggregator that averages/etc quotes from multiple sources to drive Vault creation or liquidation.

While the authority to produce a quote is limited by virtue of the wallets which have access to the price feed's objects, the quote objects themselves might be delivered to arbitrary subscribers (over untrusted paths), and those subscribers want to verify that the quote object they received is genuine. The current implementation uses ERTP Payment objects to represent the price quotes, and subscribers can call E(priceQuoteIssuer).getAmountOf(allegedQuote) to confirm their authenticity.

generating a quote object:

const authenticateQuote = async quote => {
/** @type {Amount<'set'>} */
// xxx type should be inferred from brand and value
const quoteAmount = AmountMath.make(quoteKit.brand, harden(quote));
const quotePayment = await E(quoteKit.mint).mintPayment(quoteAmount);
return harden({ quoteAmount, quotePayment });
};

verifying one:

const { value: sourceQuoteValue } = await E(sourceQuoteIssuer).getAmountOf(
sourceQuotePayment,
);

However, subscribers like the scaledPriceAuthority stop there: the quote payment is dropped, without being deposited or burned.

This interacts badly with a safety feature of ERTP: the Recovery Set. To guard against assets being lost, each time you call payment = await E(purse).withdraw(), the payment is added to the Purse's "recovery set": a strong SetStore. If you send the Payment to a correspondent, but they fail to deposit it, you can call getRecoverySet() to access all these Payments, and then deposit them back into a Purse. Or, you can call recoverAll() to sweep them back into the original Purse. Brand new tokens, created by E(mint).mintPayment(), are held in a special mintRecoveryPurse, available from the IssuerKit. All Payments are removed from their Recovery Set as soon as they are deposited in some other Purse, or burned.

Recovery normally isn't necessary, because well-written code will deposit a Payment as soon as it receives it, and only some sort of error would prevent that from happening. However, quotes-as-payments aren't assets to be hoarded: the code only uses ERTP Payments for their authenticity-verification properties, not their exclusive-transfer properties. So it is not surprising that the price contracts fail to deposit() or burn() them.

The consequence is that the Payment remains in the recovery set forever.

The contract and ERTP code uses a variety of SetStores, WeakMapStores, and Exos to express this logic. Swingset (specifically liveslots) supports those with "virtual collections" and "virtual objects", all of which bottoms out in a per-vat key-value table named the "vatstore". This holds rows to represent collection entries, ordinals to enable consistent iteration through collections, reference counts, weak-key reference-graph edges, export statuses, and finally the virtual-object state itself.

The particular data structures used by ERTP means that each Payment costs about 10 vatstore entries. In the following diagram, each circle is a vatstore entry, of various types.

v29 quote payment

And this is an example of one particular (still-extant) payment, where each gray box is a separate vatstore entry:

v29 quote payment example

Since each vatstore entry is really stored in the kernel's kvStore (a table inside the "SwingStore" SQLite DB), they cost 10 kvStore entries each. And because state-sync integrity requires us to mirror all kvStore items into the IAVL database, they also cost 10 IAVL entries each. We've been creating quotes once every five minutes since the Vaults feature was released last June, and by now we have about 120k quote payments kept alive in v29 (the ATOM-USD price-feed vat). A similar problem is keeping 90k payments alive in v46 (scaledPriceAuthority). As we add new denominations in the future, their price feed vats will have the same issue. I estimate this currently costs 440MB of space (combining both swingstore and IAVL), growing at 5MB/day.

Additional Costs

v46 (scaledPriceAuthority) has 90k quote payments in this state, however it has also exported about 30k of them to v7-board. That vat is still holding weak references (c-list entries in the "recognizable" state, as opposed to "reachable"). This causes additional kvstore entries for each one: two for the export-side c-list, two in the kernel object table, two in the import-side clist, and some amount of RAM in v7-board.

We believe that, at some point, v46 received the payment and sent it to E(board.readOnlyMarshaller).toCapData(). The serialization process submits the received quote-payment Presence object to passStyleOf, which would have concluded that the Presence was passable, and added the object to it's WeakMap cache. This WeakMap is virtualized by liveslots (to prevent userspace from sensing GC, as usual), and VirtualObjectAwareWeakMap remembers a real Map keyed by vref strings. So v7-board will spend some (small) amount of RAM on each such object. These WeakMap references cause the c-list to remain in the "recognizable" state: they do not represent a strong ("reachable") reference.

Both the c-list/kvstore entries and the RAM costs will go away if/when the quote payment objects are released.

We've been considering allowing the Endo/passStyleOf code to use a real WeakMap instead of the virtualized one. This requires the permission of liveslots (otherwise arbitrary user code could do it, violating our GC-sensing rules), so coordination between Swingset and Endo is ongoing. If implemented, this would remove the separate passStyleOf cache and would probably allow v9-board to avoid spending either RAM or c-list imports on the object.

Potential Fixes

To flatten this growth rate, at today's Zoe meeting we discussed one short-term fix named "burn after reading": clients should burn their quote payment objects after verifying them. For example, the v29 price-feed quotes are received by v46-scaledPriceAuthority, where priceAuthorityTransform.js, could do:

    const { value: sourceQuoteValue } = await E(sourceQuoteIssuer).getAmountOf(
      sourceQuotePayment,
    );
   await E(sourceQuoteIssuer).burn(sourceQuotePayment); // burn after reading

We would need similar changes in all other places that receive quote payments. In particular, the scaled/aggregated quotes produced by v46-scaledPriceAuthority are delivered to v48-vaultFactory, where something needs its own burn() call.

Another approach would be to have the producing code rotate through recovery purses every hour or so:

  • mint Payments as before, but immediately deposit that first payment into a separate recovery Purse, then .withdraw() a new one from that purse and send the second payment to the client
  • then, once an hour, drop the old purse and start using a new one, relying upon GC to delete the recovery set, and thus forget about all the old payments (if these represented real assets, this would lose funds, but quotes aren't worth anything)
  • note that this doesn't help remediate the existing payments, because those are held by the recovery set associated with the Issuer's single mintRecoveryPurse, which is closed over by mintPayment() and cannot be changed or destroyed short of a vat upgrade which first modifies baggage to remove the reference

In the longer term, we should find a lighter-weight means to verify the authenticity of price quotes. ERTP Payments offer both verification (through issuer.getAmountOf()) and exclusive transfer (through purse.deposit(payment)), but this use case does not require exclusive transfer. The ocap community provides other tools for mere verification, like the Branding pattern. The "issuer" replacement can retain a WeakMap of quote objects, and offer an API to perform validQuotes.has(allegedQuote).

Remediation

In addition to fixing new price quotes, we also need to dispose of the hundreds of thousands of existing ones. One idea from today's meeting:

  • periodically (perhaps once per price quote), the producer could access mintRecoveryPurse, fetch its recovery set, iterate through a few dozen payments, and burn them all
    • to avoid invalidating a quote before a consumer gets a chance to verify it, check the timestamp on the quote, and only burn it if it's at least an hour older than the new quote being created
    • this would spread the work out over time: if we burn ten old payments upon each new 5-minute quote, we'd remediate all 120k from v29 in about 41 days, without causing an undue amount of work for any new quote

Another approach would be to have an upgraded producer vat immediately perform mintRecoveryPurse.recoverAll() (perhaps assuming that no quote payments are outstanding at the moment of upgrade, or accepting one quote getting invalidated before reading). However we need to be cognizant of the size of the recovery set: this would cause v29 to immediately delete about a million items from its vatstore, requiring maybe ten million syscalls, all in a single crank. Who knows how long that might take. In addition, that would remove a million rows from the IAVL tree, and we know that IAVL pruning takes a long time when there has been a lot of churn, so each validator might experience a significant (vote-missing) delay some arbitrary number of blocks after the upgrade (depending upon their individual pruning configuration).

Note that different approaches require upgrades to different vats, and other bugs make some vats are more practical to upgrade than others. In particular, we'd like to fix the price-feed code before making a lot of new deployments, to avoid adding more load to the system, but eventually we need to fix (and also remediate) the existing vats.

cc @Chris-Hibbert @erights

@Chris-Hibbert
Copy link
Contributor

@warner wrote:

I did some estimates on the 10-at-a-time thing, I think that's a good way to approach it.
As of 01-dec-2023, for bug #8400 (Quote Payment), we've got 218k payment objects to delete in v46-scaledPriceAuthorityATOM 291k in v29-ATOM-USD-price_feed, 98k in v68-stATOM-USD_price_feed, 74k in v69-caledPriceAuthority. At the current PushPrice rate (one every 60 seconds), we'll clear the v46 backlog in 15 days, causing a 1.2-second BringOutYourDead every 30 minutes. The v29 backlog will have a similar load but take proportionally longer (maybe 20 days)

@Chris-Hibbert
Copy link
Contributor

#8498 is an update to #8418 that deletes the vestigial payments incrementally.

@warner
Copy link
Member Author

warner commented Feb 8, 2024

Note to self: to measure the size of this problem, use SwingSet/misc-tools/categorize-kvstore.js:

% node ~/trees/agoric-sdk/packages/SwingSet/misc-tools/categorize-kvstore.js --datafile data17.json ingest run-17-swingstore.sqlite
% node ~/trees/agoric-sdk/packages/SwingSet/misc-tools/categorize-kvstore.js --datafile data17.json many-voms |grep "quote payment"

vatID  kindID tag                  defined referenced recognized exported sample
v29        22 quote payment         325384     325372     650744        0 {}
v46        22 quote payment         244029     244029     488058    81361 {}
v68        22 quote payment         144240     144240     288480        0 {}
v69        22 quote payment         108180     108180     216360    36080 {}

mergify bot added a commit that referenced this issue Apr 29, 2024
refs: #8740
closes: #8918
refs: #8400

## Description

Add a new Auction instance in A3P, so #8757 can make use of it. Also
provides upgrade proposals which can be be applied to MainNet and other
chains.

### Security Considerations

N/A

### Scaling Considerations

This is largely in service of #8400, which reports that priceFeed vats
are accumulating garbage. This PR gives a new auction which can rely on
new priceFeeds. The existing auction is not upgradeable and its
pricefeeds can't be updated.

### Documentation Considerations

No user-visible changes to behavior.

### Testing Considerations

Tested in A3P

### Upgrade Considerations

Auctions are not upgradeable, so we have to replace them and update
their clients.
mergify bot added a commit that referenced this issue May 6, 2024
closes: #8049
closes: #8740
closes: #8868
closes: #8918
closes: #8981
closes: #8079
refs: #8400
closes: #8735
closes: #7873
closes: #8726
closes: #7954
closes: #8757
closes: #8728 
closes: #8789

## Description

Upgrade **VaultFactory** in A3P, relying on the new PriceFeeds, and
auctions. The actual upgrade waits for the priceFeeds to start supplying
before doing the upgrade, so there won't be any gap in priceUpdates.

When the upgrade is finished, we also update the auctioneerKit and
Auction instance in the bootstrap environment.

This PR demonstrates that VaultFactory can be upgraded even though
governance is not persistent (#8123).

### Security Considerations

N/A

### Scaling Considerations

This is largely in service of #8400, which reports that priceFeed vats
are accumulating garbage. This PR switches to new priceFeeds, which
won't have that problem, though cleaning up the existing vats is a task
for the future.

### Documentation Considerations

No changes to user-visible behavior.

### Testing Considerations

A3P tests that verify that vaultFactory has been upgraded, that a new
Auctioneer is running and is receiving prices. Verify that when prices
drop, assets are sold via the auction, the bidder gets the proceeds, and
the vaults are liquidated or reconstituted appropriately.

### Upgrade Considerations

Upgrade all the vats!
@Chris-Hibbert
Copy link
Contributor

After #9283 and #9382, we'll no longer have an active leak. The remaining issue is how to deal with the accumulated storage, and that's addressed in #8928. I'm reassigning this to @warner.

mergify bot added a commit that referenced this issue May 24, 2024
refs: #8400
refs: #8498
closes: #9371

## Description

#8400 reported that the priceFeed vats hold onto old quotes in their
recoverySet, preventing them from being collected. #9283 applied the
fixes to master. These fixes address the growth in priceFeed vats and
Zoe, but scaledPriceAuthorities were still growing. This resolves that
problem by upgrading scaledPriceAuthority contracts to not use their
recoverySets.

<details>
  <summary><b>Expand</b> for performance graphs</summary>
  
<img width="809" alt="image"
src="https://github.com/Agoric/agoric-sdk/assets/13372636/889bb283-89c8-434f-8a67-efa56d0382ad">

Kernel allocation is in blue, and the scale is on the left. It varies
from 48862 to 49743, with a small amount of long-term growth.The other
active vats (v9=Zoe, v29=ATOM-USD_priceFeed, v43=wallet,
72=ATOM-USD_priceFeed, 74=auctioneer) use the scale on the right, with
Zoe varying tightly around 3600, and the others low and stable.

scaledPriceAuthority-ATOM doesn't have enough variation to be worth
graphing.

</details>

### Security Considerations

Upgrade existing contracts. No new vulnerabilities.

### Scaling Considerations

This addresses the largest known category of growth on the chain.

### Documentation Considerations

Add some documentation on creating proposals.

### Testing Considerations

Tested in A3P. We should exercise all the clients of priceFeeds in our
test environments.

### Upgrade Considerations

This PR includes a proposal that will upgrade all vats with
`scaledPriceAuthority` in their label. That should work on or test
chains as well as MainNet. These changes should be included in the next
release.
@aj-agoric
Copy link

@warner to make another ticket describing the planned remediation. We will close this ticket once the sub tickets are landed on 16.

@warner
Copy link
Member Author

warner commented Jun 14, 2024

Ok #9483 will track the creation+deployment of a chain-upgrade step that terminates (and initiates the slow deletion of) the old price-feed vats. That, plus/preceded-by the deployment of the #9283 / #9382 fixes to mainnet, will complete the remediation of this problem, so I'm closing this ticket.

@warner warner closed this as completed Jun 14, 2024
mergify bot pushed a commit that referenced this issue Oct 18, 2024
refs: #9886

## Description

Some tests in DevNet indicate that the Oracles need the roundID in vstorage to be reset so they can sync.

### Security Considerations

Not a security issue.

### Scaling Considerations

Not directly a scaling concern.  (see #8400).

### Documentation Considerations

not user visible. It would be nice if the protocol between oracles and fluxAggregators were documented somewhere

### Testing Considerations

updated the a3p tests to increment the roundId before the upgrade and show that this upgrade resets it.

### Upgrade Considerations

This change is to the fluxAggregator which is already being started afresh in this proposal.
mergify bot added a commit that referenced this issue Oct 25, 2024
closes: #9370

## Description

The priceFeed proposal has been tested as an independent coreEval, but the current plan calls for it to be included in Upgrade 18. This adds it to `upgrade.go` and moves the tests to `n:upgrade-next`.

### Security Considerations

No additional security implications.

### Scaling Considerations

Addresses pre-existing scaling concerns (see #8400), and adds no new scaling issues.

### Documentation Considerations

None

### Testing Considerations

An existing test is moved to a new location.

The current version seems to tickle #10292, so tests may not pass at this point.

### Upgrade Considerations

Targeted for upgrade 18.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Inter-protocol Overarching Inter Protocol performance Performance related issues Zoe Contract Contracts within Zoe
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants