Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(swingset): use "dirt" to schedule vat reap/bringOutYourDead #9169

Merged
merged 5 commits into from
Aug 12, 2024

Commits on Aug 12, 2024

  1. refactor(swingset): create vat state non-lazily

    Previously, vat state initialization (e.g. setting counters to zero)
    happened lazily, the first time that `provideVatKeeper()` was
    called. When creating a new vat, the pattern was:
    
      vk = kernelKeeper.provideVatKeeper();
      vk.setSourceAndOptions(source, options);
    
    Now, we initialize both counters and source/options explicitly, in
    `recordVatOptions`, when the static/dynamic vat is first defined:
    
      kernelKeeper.createVatState(vatID, source, options);
    
    In the future, this will make it easier for `provideVatKeeper` to rely
    upon the presence of all vat state keys, especially `options`.
    
    Previously, vatKeeper.getOptions() would tolerate a missing key by
    returning empty options. The one place where this was
    needed (terminating a barely-created vat because startVat() failed,
    using getOptions().critical) was changed, so this tolerance is no
    longer needed, and was removed. The tolerance caused bug #9157 (kernel
    doesn't panic when it should), which continues to be present, but at
    least won't cause a crash.
    
    refs #8980
    warner committed Aug 12, 2024
    Configuration menu
    Copy the full SHA
    6ee1df8 View commit details
    Browse the repository at this point in the history
  2. fix(swingset): use "dirt" to schedule vat reap/bringOutYourDead

    NOTE: deployed kernels require a new `upgradeSwingset()` call upon
    (at least) first boot after upgrading to this version of the kernel
    code. See below for details.
    
    `dispatch.bringOutYourDead()`, aka "reap", triggers garbage collection
    inside a vat, and gives it a chance to drop imported c-list vrefs that
    are no longer referenced by anything inside the vat.
    
    Previously, each vat has a configurable parameter named
    `reapInterval`, which defaults to a kernel-wide
    `defaultReapInterval` (but can be set separately for each vat). This
    defaults to 1, mainly for unit testing, but real applications set it
    to something like 1000.
    
    This caused BOYD to happen once every 1000 deliveries, plus an extra
    BOYD just before we save an XS heap-state snapshot.
    
    This commit switches to a "dirt"-based BOYD scheduler, wherein we
    consider the vat to get more and more dirty as it does work, and
    eventually it reaches a `reapDirtThreshold` that triggers the
    BOYD (which resets the dirt counter).
    
    We continue to track `dirt.deliveries` as before, with the same
    defaults. But we add a new `dirt.gcKrefs` counter, which is
    incremented by the krefs we submit to the vat in GC deliveries. For
    example, calling `dispatch.dropImports([kref1, kref2])` would increase
    `dirt.gcKrefs` by two.
    
    The `reapDirtThreshold.gcKrefs` limit defaults to 20. For normal use
    patterns, this will trigger a BOYD after ten krefs have been dropped
    and retired. We choose this value to allow the #8928 slow vat
    termination process to trigger BOYD frequently enough to keep the BOYD
    cranks small: since these will be happening constantly (in the
    "background"), we don't want them to take more than 500ms or so. Given
    the current size of the large vats that #8928 seeks to terminate, 10
    krefs seems like a reasonable limit. And of course we don't want to
    perform too many BOYDs, so `gcKrefs: 20` is about the smallest
    threshold we'd want to use.
    
    External APIs continue to accept `reapInterval`, and now also accept
    `reapGCKrefs`, and `neverReap` (a boolean which inhibits all BOYD,
    even new forms of dirt added in the future).
    
    * kernel config record
      * takes `config.defaultReapInterval` and `defaultReapGCKrefs`
      * takes `vat.NAME.creationOptions.reapInterval` and `.reapGCKrefs`
        and `.neverReap`
    * `controller.changeKernelOptions()` still takes `defaultReapInterval`
       but now also accepts `defaultReapGCKrefs`
    
    The APIs available to userspace code (through `vatAdminSvc`) are
    unchanged (partially due to upgrade/backwards-compatibility
    limitations), and continue to only support setting `reapInterval`.
    Internally, this just modifies `reapDirtThreshold.deliveries`.
    
    * `E(vatAdminSvc).createVat(bcap, { reapInterval })`
    * `E(adminNode).upgrade(bcap, { reapInterval })`
    * `E(adminNode).changeOptions({ reapInterval })`
    
    Internally, the kernel-wide state records `defaultReapDirtThreshold`
    instead of `defaultReapInterval`, and each vat records
    `.reapDirtThreshold` in their `vNN.options` key instead of
    `vNN.reapInterval`. The vat-level records override the kernel-wide
    values. The current dirt level is recorded in `vNN.reapDirt`.
    
    NOTE: deployed kernels require explicit state upgrade, with:
    
    ```js
    import { upgradeSwingset } from '@agoric/swingset-vat';
    ..
    upgradeSwingset(kernelStorage);
    ```
    
    This must be called after upgrading to the new kernel code/release,
    and before calling `buildVatController()`. It is safe to call on every
    reboot (it will only modify the swingstore when the kernel version has
    changed). If changes are made, the host application is responsible for
    commiting them, as well as recording any export-data updates (if the
    host configured the swingstore with an export-data callback).
    
    During this upgrade, the old `reapCountdown` value is used to
    initialize the vat's `reapDirt.deliveries` counter, so the upgrade
    shouldn't disrupt the existing schedule. Vats which used `reapInterval
    = 'never'` (eg comms) will get a `reapDirtThreshold.never = true`, so
    they continue to inhibit BOYD. Any per-vat settings that match the
    kernel-wide settings are removed, allowing the kernel values to take
    precedence (as well as changes to the kernel-wide values; i.e. the
    per-vat settings are not sticky).
    
    We do not track dirt when the corresponding threshold is 'never', or
    if `neverReap` is true, to avoid incrementing the comms dirt counters
    forever.
    
    This design leaves room for adding `.computrons` to the dirt record,
    as well as tracking a separate `snapshotDirt` counter (to trigger XS
    heap snapshots, ala #6786). We add `reapDirtThreshold.computrons`, but
    do not yet expose an API to set it.
    
    Future work includes:
    * upgrade vat-vat-admin to let userspace set `reapDirtThreshold`
    
    New tests were added to exercise the upgrade process, and other tests
    were updated to match the new internal initialization pattern.
    
    We now reset the dirt counter upon any BOYD, so this also happens to
    help with #8665 (doing a `reapAllVats()` resets the delivery counters,
    so future BOYDs will be delayed, which is what we want). But we should
    still change `controller.reapAllVats()` to avoid BOYDs on vats which
    haven't received any deliveries.
    
    closes #8980
    warner committed Aug 12, 2024
    Configuration menu
    Copy the full SHA
    ee58485 View commit details
    Browse the repository at this point in the history
  3. fix(swingset): update initialization API from PR feedback

    makeKernelKeeper() now takes the expected version, and throws if it
    doesn't match. This replaces the version check in kernel.js start().
    warner committed Aug 12, 2024
    Configuration menu
    Copy the full SHA
    ff1e9b2 View commit details
    Browse the repository at this point in the history
  4. fix(cosmic-swingset): call upgradeSwingset at startup

    Any changes will be committed as part of the first block after
    reboot. This should be in-consensus, because all nodes are supposed to
    change kernel versions at the same time, and only during a
    chain-halting upgrade.
    warner committed Aug 12, 2024
    Configuration menu
    Copy the full SHA
    c769606 View commit details
    Browse the repository at this point in the history
  5. fix(cosmic-swingset): add exportCallback interlock

    The swingstore export-data callback gives us export-data records,
    which must be written into IAVL by sending them over to the golang
    side with swingStoreExportCallback . However, that callback isn't
    ready right away, so if e.g. openSwingStore() were to invoke it, we
    might lose those records. Likewise saveOutsideState() gathers the
    chainSends just before calling commit, so if the callback were invoked
    during commit(), those records would be left for a subsequent block,
    which would break consensus if the node crashed before the next
    commit.
    
    This commit adds an `allowExportCallback` flag, to catch these two
    cases. It is enabled at the start of AG_COSMOS_INIT and BEGIN_BLOCK,
    and then disabled just before we flush the chain sends in
    saveOutsideState() (called during COMMIT_BLOCK).
    
    Note that swingstore is within its rights to call exportCallback
    during openSwingStore() or commit(), it just happens to not do so
    right now. If that changes under maintenance, this guard should
    turn a corruption bug into a crash bug
    
    refs #9655
    warner committed Aug 12, 2024
    Configuration menu
    Copy the full SHA
    6547c83 View commit details
    Browse the repository at this point in the history