old vow watch not woken up by fulfillment in new incarnation. #9126

erights · 2024-03-21T03:44:27Z

Bug isolated and demonstrated at #9125 .

Isolates the bug that #9097 is stuck on. As isolated, the problem does reproduce.

I'm following the general testing pattern used in #9097, especially in test-async-flow.js, which is where I'm having this problem. There, see

agoric-sdk/packages/zone/test/async-flow/test-async-flow.js

Lines 181 to 182 in 9a00502

    
           // TODO I shouldn't need to do this. 
        
           await asyncFlow.wake();

I shouldn't need the asyncFlow.wake() call. But if I remove it I get the same symptom isolated here:

If I watch with an unresolved durable vow and durable watcher in the first incarnation, and then I resolve that durable vow in the second incarnation, the durable watcher that was watching that vow should now wake up. It doesn't.

@michaelfig , is this a bug in watching durable vows across incarnations?

The text was updated successfully, but these errors were encountered:

michaelfig · 2024-03-21T17:08:25Z

After studying the problem, I'm now convinced the only issue is that the nextLife implementation doesn't properly reject platform promises being watched by earlier incarnations the way they would be under SwingSet. That's a kernel feature, and vows rely on those "vat upgraded" rejections to function correctly.

With nextLife, you only get a faked-out liveslots instance, which does no special handling of native promises from former incarnations, resulting in them never being resolved.

So, unless we invest the time to add more functionality to the nextLife implementation, I'd recommend switching to a more faithful upgradeTest framework.

warner · 2024-03-21T17:37:16Z

Yeah, maybe we could add a debug hook to liveslots that nextLife could call, to reject all locally-decided-and-watched promises, to replace what the kernel would do in a more-real environment. It would be a little bit like the old dispatch.stopVat, which we removed because we didn't want to rely upon the old incarnation doing anything (in case it was so broken that it couldn't participate in upgrade). The old stop-vat.js (in git) might be useful for inspiration, although I don't remember if we implemented reject-local-promises before or after we removed stopVat.

closes: #9302 refs: #9125, #9126 #9153 #9154, #9280 ## Description Upgrade while suspended at `await` points! Uses membrane to log and replay everything that happened before each upgrade. In the first incarnation, somewhere, using a ***closed*** async function argument ```js const wrapperFunc = asyncFlow( zone, 'funcName`, async (...) => {... await ...; ...}, ); ``` then elsewhere, as often as you'd like ```js const outcomeVow = wrapperFunc(...); ``` For all these `asyncFlow` calls that happened in the first incarnation, in the first crank of all later incarnations ```js asyncFlow( zone, 'funcName`, async (...) => {... await ...; ...}, ); ``` with async functions that reproduce the original's logged behavior. In these later incarnations, you only need to capture the returned `wrapperFunc` if you want to create new activations. Regardless, the old activations continue. #### Future Growth I designed this PR so it could grow in the following ways: - TODO: The membrane should use the `HandledPromise` API to make proper remote presences and handled promises, so that the guest function can use `E` on objects or promises it receives from the host as expected. I commented out the additional ops needed for these: `doSend` and `checkSend`. - TODO: Currently, I assume that the host side only presents vows, not promises. However, imported remote promises can be stored durably, and be durably watched, so the membrane should be extended to handle those. - TODO: We currently impose the restriction that the guest cannot export to the host guest-created remotables or promises. (It can pass back to the host remotables or promises it received from the host.) I commented out the additional ops needed for these: `doCall`, `checkReturn` and `checkThrow`. I wrote the `bijection` and `equate` subsystems so that old durable host wrappers can be hooked back up on replay to the corresponding new guest remotables and promises. ### Security Considerations Nothing enforces that the argument async function is closed, i.e., does not capture (lexically "close over") any direct access to mutable state or ability to cause effects. If it does, it still cannot harm anything but itself. But it -- or its replayings -- may be more confusable, and so more vulnerable to confusion attacks. Since this entire framework itself is written as a helper (well, a huge helper) with no special privilege, it cannot be used to do anything that could not have otherwise been done. It is not a source of new authority. See caveats in Upgrade Considerations about isolation of effects following a replay failure. ### Scaling Considerations We assume that the total number of async functions, i.e., calls to `asyncFlow`, are low cardinality. This is essential to the design, in exactly the same sense as our assumption that exoClasses are low cardinality. The RAM cost is proportional to the number of these. The current implementation further assumes that the total number of activations of these replayable async functions are low cardinality. This will likely be a scaling problem at some point, but is unlikely to be painful for the initial expected use cases. The current implementation imposes two simplifying restrictions that allow us to relieve some of this memory pressure: Currently, the async function argument cannot make and export new remotables or promises to the host. Thus, once its outcomeVow is settled, its job is done. There is very little more it can do that is observable. Thus, once this outcome is settled, the activation is shut down and most of the memory it held onto is dropped. Of the activations not shut down, they must replay from the beginning in each incarnation. If a given activation has a long history of past activity, this can become expensive. How do we verify in CI that when an asyncFlow is in use & when it has completed, resource usage in RAM & on disk meet our expectations? The PR assumes `low cardinality` of asyncFlows. what scale is `low cardinality` - Is 10^3, 10&5? What is the risk if cardinality is too high? ### Documentation Considerations For the normal developer documentation, `asyncFlow` should make things simpler and more like "just JavaScript". The membrane translates between host vows and guest promises, so the async function can simply `await` on promises without needing the `when` from `@agoric/vow`. ### Testing Considerations This PR is structured as a tower of building blocks, where I unit tested each as I went, in bottom up order, in order to build with confidence. Currently, each of these building blocks is also very paranoid about internal consistency checking, so I'd get early indications if I made a mistake. Probably some of this internal consistency checking can be reduced over time, as we gain more static confidence. This PR is currently using the fake upgrade testing framework from the `@agoric/zone` package. This caused bug #9126. Instead, we need to redo all these tests with a real upgrade testing framework, like the one in bootstrapTests. See https://github.com/Agoric/agoric-sdk/blob/master/packages/boot/test/bootstrapTests/test-zcf-upgrade.ts ### Upgrade Considerations The point. In an reviving incarnation, if the async function argument of ```js asyncFlow( zone, 'funcName`, async (...) => {... await ...; ...}, ); ``` fails to recapitulate the logs of each of its activations, those activations will not have done any damage. They will simply be stuck, with a diagnostic available via ```js adminAsyncFlow.getFailures(), ``` To unstick these, so those stuck activations can continue to make progress, upgrade again using an async function argument that does reproduce the logged behavior of each of its activations. #### Caveat: Imperfect isolation of effects following a replay failure Once a replay failure is detected, we attempt to isolate the replaying activation from its outside world, and to also shut down its further execution as soon as possible. But it may be in the midst of activity with a non-empty stack. Following a replay failure, we have no way to preemptively terminate it without any further execution of the activation. This further execution may therefore be arbitrarily confused. We simply try to isolate it as much as possible, immediately revoking all access it had through the membrane to all authority to cause observable effects. However, - We do not consider `console` logging activity to be an observable effect. These might be caused by diagnostics emitted by this framework in response to its "isolated" confused behavior. - Because we do not consider `console` logging to be an observable effect, we also allow this as an exception to our closed function rule. Messages it sends directly to the console are not logged, and can differ without causing replay failure. During its post-replay-failure confused execution, it can still directly log to the console. - It is not resource limited, so its post-replay confused execution can accidentally engage in resource exhaustion attacks, including infinite loops. However, the vat as a whole is resource limited. An infinite loop will eventually crash the vat, which can then be recovered with yet another upgrade. - Because of metering, an activation that executed successfully in a previous incarnation might not replay correctly, even if it doesn't cause any explicit side-effects. That's because metering is a hidden side-effect of any execution.

mhofman · 2024-09-13T05:34:32Z

add a debug hook to liveslots that nextLife could call, to reject all locally-decided-and-watched promises

Do we actually need a hook to liveslots for this, or could this be done from the outside somehow? Can't we parse the baggage to get the watched promises, and simply inject the rejections like the kernel would do? The part that I'm unsure about is how we would know which promises were decided by the vat (we don't have a kernel or clist in these tests from what I understand).

michaelfig · 2024-09-16T20:36:15Z

how we would know which promises were decided by the vat

Yeah, that's a trick. The current tests don't have any clear boundary between "inside" and "outside" the vat, since both sides share the same compartment without designating what side each of them is on.

If you have anything to sketch out about how I might solve this, the right place to do it is in packages/async-flow/test/prepare-test-ava.js nextLife, where I tried putting a fakeStore dump (baggage is a branch of fakeStore that doesn't contain the watchedPromiseTable):

const { fakeStore } = incarnation;
console.log('@@@ fakeStore.entries());

I don't know enough about liveslots to understand how to get ahold of the watchedPromiseTable contents (given the results of fakeStore.get('watchedPromiseTableID')), traverse them, and queue up rejections for the victims (assuming I can identify them somehow) before returning from nextLife.

mhofman · 2024-09-17T01:14:48Z

both sides share the same compartment

Oh I just realized these tests don't create a full liveslots talking to the kernel over simulated syscalls, but instead instantiate the virtual data part of liveslots only. As such I am not even sure there is any support for "message sends" and "external promise resolution". I'm really confused now.

the watchedPromiseTable contents (given the results of fakeStore.get('watchedPromiseTableID')), traverse them

If we looked from the outside, that'd require parsing the collection as encoded in vatStore. But since we're in a more custom environment, maybe we could get a hold of the collection more directly. Or we might could just hook into maybeExportPromise

agoric-sdk/packages/swingset-liveslots/tools/fakeVirtualSupport.js

Line 273 in a48b2ce

const maybeExportPromise = _vref => false;

erights added the bug Something isn't working label Mar 21, 2024

erights mentioned this issue Mar 21, 2024

test(zone,vow): is this a watch revival bug? #9125

Draft

erights assigned mhofman, erights and michaelfig Mar 21, 2024

mhofman assigned warner Mar 21, 2024

erights mentioned this issue Mar 22, 2024

feat(async-flow): asyncFlow #9097

Merged

erights added the asyncFlow related to membrane-based replay and upgrade of async functions label Apr 25, 2024

erights mentioned this issue Apr 28, 2024

upgrade tests of Orch contracts restarting during a flow #9303

Open

erights removed their assignment Jul 15, 2024

aj-agoric added SwingSet package: SwingSet liveslots requires vat-upgrade to deploy changes labels Jul 22, 2024

mhofman linked a pull request Sep 21, 2024 that will close this issue

Add strict liveslots test env with upgrade tools #10126

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

old vow watch not woken up by fulfillment in new incarnation. #9126

old vow watch not woken up by fulfillment in new incarnation. #9126

erights commented Mar 21, 2024

michaelfig commented Mar 21, 2024

warner commented Mar 21, 2024

mhofman commented Sep 13, 2024

michaelfig commented Sep 16, 2024

mhofman commented Sep 17, 2024

old vow watch not woken up by fulfillment in new incarnation. #9126

old vow watch not woken up by fulfillment in new incarnation. #9126

Comments

erights commented Mar 21, 2024

michaelfig commented Mar 21, 2024

warner commented Mar 21, 2024

mhofman commented Sep 13, 2024

michaelfig commented Sep 16, 2024

mhofman commented Sep 17, 2024