Skip to content

Commit

Permalink
Add cosmos-sdk vat snapshot/transcript retention configuration (#10032)
Browse files Browse the repository at this point in the history
Ref #9174
Fixes #9387
Fixes #9386

TODO:
- [ ] #9389

## Description
Adds consensus-independent `vat-snapshot-retention` ("debug" vs. "operational") and `vat-transcript-retention` ("archival" vs. "operational" vs. "default") cosmos-sdk swingset configuration (values chosen to correspond with [`artifactMode`](https://github.com/Agoric/agoric-sdk/blob/master/packages/swing-store/docs/data-export.md#optional--historical-data)) for propagation in AG_COSMOS_INIT. The former defaults to "operational" and the latter defaults to "default", which infers a value from cosmos-sdk `pruning` to allow simple configuration of archiving nodes.

It also updates the semantics of TranscriptStore `keepTranscripts: false` configuration to remove items from only the previously-current span rather than from all previous spans when rolling over (to avoid expensive database churn). Removal of older items can be accomplished by reloading from an export that does not include them.

### Security Considerations
I don't think this changes any relevant security posture.

### Scaling Considerations
This will reduce the SQLite disk usage for any node that is not explicitly configured to retain snapshots and/or transcripts. The latter in particular is expected to have significant benefits for mainnet (as noted in #9174, about 116 GB ÷ 147 GB ≈ 79% of the database on 2024-03-29 was vat transcript items).

### Documentation Considerations
The new fields are documented in our default TOML template, and captured in a JSDoc type on the JavaScript side.

### Testing Considerations
This PR extends coverage TranscriptStore to include `keepTranscripts` true vs. false, but I don't see a good way to cover Go→JS propagation other than manually (which I have done). It should be possible to add testing for the use and validation of `resolvedConfig` in AG_COSMOS_INIT handling, but IMO that is best saved for after completion of split-brain (to avoid issues with same-process Go–JS entanglement).

### Upgrade Considerations
This is all kernel code that can be used at any node restart (i.e., because the configuration is consensus-independent, it doesn't even need to wait for a chain software upgrade). But we should mention the new cosmos-sdk configuration in release notes, because it won't be added to existing app.toml files already in use.
  • Loading branch information
mergify[bot] authored Sep 6, 2024
2 parents b31bd09 + 8b3a6d4 commit d6f50e3
Show file tree
Hide file tree
Showing 8 changed files with 280 additions and 125 deletions.
9 changes: 9 additions & 0 deletions golang/cosmos/util/util.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,15 @@ import (
"github.com/spf13/viper"
)

func IndexOf[T comparable](a []T, x T) int {
for i, s := range a {
if s == x {
return i
}
}
return -1
}

func NewFileOnlyViper(v1 *viper.Viper) (*viper.Viper, error) {
v2 := viper.New()
v2.SetConfigFile(v1.ConfigFileUsed())
Expand Down
119 changes: 107 additions & 12 deletions golang/cosmos/x/swingset/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ import (
"github.com/spf13/viper"

"github.com/cosmos/cosmos-sdk/client/flags"
pruningtypes "github.com/cosmos/cosmos-sdk/pruning/types"
serverconfig "github.com/cosmos/cosmos-sdk/server/config"
servertypes "github.com/cosmos/cosmos-sdk/server/types"

"github.com/Agoric/agoric-sdk/golang/cosmos/util"
Expand All @@ -15,8 +17,24 @@ import (
const (
ConfigPrefix = "swingset"
FlagSlogfile = ConfigPrefix + ".slogfile"

SnapshotRetentionOptionDebug = "debug"
SnapshotRetentionOptionOperational = "operational"

TranscriptRetentionOptionArchival = "archival"
TranscriptRetentionOptionOperational = "operational"
)

var snapshotRetentionValues []string = []string{
SnapshotRetentionOptionDebug,
SnapshotRetentionOptionOperational,
}

var transcriptRetentionValues []string = []string{
TranscriptRetentionOptionArchival,
TranscriptRetentionOptionOperational,
}

// DefaultConfigTemplate defines a default TOML configuration section for the SwingSet VM.
// Values are pulled from a "Swingset" property, in accord with CustomAppConfig from
// ../../daemon/cmd/root.go.
Expand All @@ -27,30 +45,74 @@ const DefaultConfigTemplate = `
###############################################################################
[swingset]
# slogfile is the path at which a SwingSet log "slog" file should be written.
# The path at which a SwingSet log "slog" file should be written.
# If relative, it is interpreted against the application home directory
# (e.g., ~/.agoric).
# May be overridden by a SLOGFILE environment variable, which if relative is
# interpreted against the working directory.
slogfile = "{{ .Swingset.SlogFile }}"
# The maximum number of vats that the SwingSet kernel will bring online. A lower number
# requires less memory but may have a negative performance impact if vats need to
# be frequently paged out to remain under this limit.
max_vats_online = {{ .Swingset.MaxVatsOnline }}
max-vats-online = {{ .Swingset.MaxVatsOnline }}
# Retention of vat snapshots, with values analogous to those of export
# 'artifactMode' (cf.
# https://github.com/Agoric/agoric-sdk/blob/master/packages/swing-store/docs/data-export.md#optional--historical-data ).
# * "debug": keep all snapshots
# * "operational": keep only the last snapshot
vat-snapshot-retention = "{{ .Swingset.VatSnapshotRetention }}"
# Retention of vat transcript spans, with values analogous to those of export
# 'artifactMode' (cf.
# https://github.com/Agoric/agoric-sdk/blob/master/packages/swing-store/docs/data-export.md#optional--historical-data ).
# * "archival": keep all transcript spans
# * "operational": keep only necessary transcript spans (i.e., since the
# last snapshot of their vat)
# * "default": determined by 'pruning' ("archival" if 'pruning' is "nothing",
# otherwise "operational")
vat-transcript-retention = "{{ .Swingset.VatTranscriptRetention }}"
`

// SwingsetConfig defines configuration for the SwingSet VM.
// "mapstructure" tag data is used to direct reads from app.toml;
// "json" tag data is used to populate init messages for the VM.
// This should be kept in sync with SwingsetConfigShape in
// ../../../../packages/cosmic-swingset/src/chain-main.js.
// TODO: Consider extensions from docs/env.md.
type SwingsetConfig struct {
// SlogFile is the absolute path at which a SwingSet log "slog" file should be written.
// SlogFile is the path at which a SwingSet log "slog" file should be written.
// If relative, it is interpreted against the application home directory
SlogFile string `mapstructure:"slogfile" json:"slogfile,omitempty"`

// MaxVatsOnline is the maximum number of vats that the SwingSet kernel will have online
// at any given time.
MaxVatsOnline int `mapstructure:"max_vats_online" json:"maxVatsOnline,omitempty"`
MaxVatsOnline int `mapstructure:"max-vats-online" json:"maxVatsOnline,omitempty"`

// VatSnapshotRetention controls retention of vat snapshots,
// and has values analogous to those of export `artifactMode` (cf.
// ../../../../packages/swing-store/docs/data-export.md#optional--historical-data ).
// * "debug": keep all snapshots
// * "operational": keep only the last snapshot
VatSnapshotRetention string `mapstructure:"vat-snapshot-retention" json:"vatSnapshotRetention,omitempty"`

// VatTranscriptRetention controls retention of vat transcript spans,
// and has values analogous to those of export `artifactMode` (cf.
// ../../../../packages/swing-store/docs/data-export.md#optional--historical-data ).
// * "archival": keep all transcript spans
// * "operational": keep only necessary transcript spans (i.e., since the
// last snapshot of their vat)
// * "default": determined by `pruning` ("archival" if `pruning` is
// "nothing", otherwise "operational")
VatTranscriptRetention string `mapstructure:"vat-transcript-retention" json:"vatTranscriptRetention,omitempty"`
}

var DefaultSwingsetConfig = SwingsetConfig{
SlogFile: "",
MaxVatsOnline: 50,
SlogFile: "",
MaxVatsOnline: 50,
VatSnapshotRetention: "operational",
VatTranscriptRetention: "default",
}

func SwingsetConfigFromViper(resolvedConfig servertypes.AppOptions) (*SwingsetConfig, error) {
Expand All @@ -66,11 +128,44 @@ func SwingsetConfigFromViper(resolvedConfig servertypes.AppOptions) (*SwingsetCo
return nil, nil
}
v.MustBindEnv(FlagSlogfile, "SLOGFILE")
wrapper := struct{ Swingset SwingsetConfig }{}
if err := v.Unmarshal(&wrapper); err != nil {
// See CustomAppConfig in ../../daemon/cmd/root.go.
type ExtendedConfig struct {
serverconfig.Config `mapstructure:",squash"`
Swingset SwingsetConfig `mapstructure:"swingset"`
}
extendedConfig := ExtendedConfig{}
if err := v.Unmarshal(&extendedConfig); err != nil {
return nil, err
}
ssConfig := &extendedConfig.Swingset

// Validate vat snapshot retention only if non-empty (because otherwise it
// it will be omitted, leaving the VM to apply its own defaults).
if ssConfig.VatSnapshotRetention != "" {
if util.IndexOf(snapshotRetentionValues, ssConfig.VatSnapshotRetention) == -1 {
err := fmt.Errorf(
"value for vat-snapshot-retention must be in %q",
snapshotRetentionValues,
)
return nil, err
}
}

// Default/validate vat transcript retention.
if ssConfig.VatTranscriptRetention == "" || ssConfig.VatTranscriptRetention == "default" {
if extendedConfig.Pruning == pruningtypes.PruningOptionNothing {
ssConfig.VatTranscriptRetention = TranscriptRetentionOptionArchival
} else {
ssConfig.VatTranscriptRetention = TranscriptRetentionOptionOperational
}
}
if util.IndexOf(transcriptRetentionValues, ssConfig.VatTranscriptRetention) == -1 {
err := fmt.Errorf(
"value for vat-transcript-retention must be in %q",
transcriptRetentionValues,
)
return nil, err
}
config := &wrapper.Swingset

// Interpret relative paths from config files against the application home
// directory and from other sources (e.g. env vars) against the current
Expand Down Expand Up @@ -101,11 +196,11 @@ func SwingsetConfigFromViper(resolvedConfig servertypes.AppOptions) (*SwingsetCo
return filepath.Abs(path)
}

resolvedSlogFile, err := resolvePath(config.SlogFile, FlagSlogfile)
resolvedSlogFile, err := resolvePath(ssConfig.SlogFile, FlagSlogfile)
if err != nil {
return nil, err
}
config.SlogFile = resolvedSlogFile
ssConfig.SlogFile = resolvedSlogFile

return config, nil
return ssConfig, nil
}
7 changes: 4 additions & 3 deletions packages/cosmic-swingset/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@
"author": "Agoric",
"license": "Apache-2.0",
"dependencies": {
"@endo/errors": "^1.2.5",
"@agoric/builders": "^0.1.0",
"@agoric/cosmos": "^0.34.1",
"@agoric/deploy-script-support": "^0.10.3",
Expand All @@ -34,15 +33,17 @@
"@agoric/vm-config": "^0.1.0",
"@endo/bundle-source": "^3.4.0",
"@endo/env-options": "^1.1.6",
"@endo/far": "^1.1.5",
"@endo/errors": "^1.2.5",
"@endo/import-bundle": "^1.2.2",
"@endo/init": "^1.1.4",
"@endo/far": "^1.1.5",
"@endo/marshal": "^1.5.3",
"@endo/nat": "^5.0.10",
"@endo/patterns": "^1.4.3",
"@endo/promise-kit": "^1.1.5",
"@iarna/toml": "^2.2.3",
"@opentelemetry/sdk-metrics": "~1.9.0",
"@opentelemetry/api": "~1.3.0",
"@opentelemetry/sdk-metrics": "~1.9.0",
"anylogger": "^0.21.0",
"deterministic-json": "^1.0.5",
"import-meta-resolve": "^2.2.1",
Expand Down
59 changes: 38 additions & 21 deletions packages/cosmic-swingset/src/chain-main.js
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@ import { fork } from 'node:child_process';

import { Fail, q } from '@endo/errors';
import { E } from '@endo/far';
import { makeMarshal } from '@endo/marshal';
import { isNat } from '@endo/nat';
import { M, mustMatch } from '@endo/patterns';
import engineGC from '@agoric/internal/src/lib-nodejs/engine-gc.js';
import { waitUntilQuiescent } from '@agoric/internal/src/lib-nodejs/waitUntilQuiescent.js';
import {
Expand All @@ -25,7 +28,6 @@ import {
makeChainStorageRoot,
makeSerializeToStorage,
} from '@agoric/internal/src/lib-chainStorage.js';
import { makeMarshal } from '@endo/marshal';
import { makeShutdown } from '@agoric/internal/src/node/shutdown.js';

import * as STORAGE_PATH from '@agoric/internal/src/chain-storage-paths.js';
Expand Down Expand Up @@ -72,7 +74,27 @@ const toNumber = specimen => {
* @typedef {object} CosmosSwingsetConfig
* @property {string} [slogfile]
* @property {number} [maxVatsOnline]
* @property {'debug' | 'operational'} [vatSnapshotRetention]
* @property {'archival' | 'operational'} [vatTranscriptRetention]
*/
const SwingsetConfigShape = M.splitRecord(
// All known properties are optional, but unknown properties are not allowed.
{},
{
slogfile: M.string(),
maxVatsOnline: M.number(),
vatSnapshotRetention: M.or('debug', 'operational'),
vatTranscriptRetention: M.or('archival', 'operational'),
},
{},
);
const validateSwingsetConfig = swingsetConfig => {
mustMatch(swingsetConfig, SwingsetConfigShape);
const { maxVatsOnline } = swingsetConfig;
maxVatsOnline === undefined ||
(isNat(maxVatsOnline) && maxVatsOnline > 0) ||
Fail`maxVatsOnline must be a positive integer`;
};

/**
* A boot message consists of cosmosInitAction fields that are subject to
Expand Down Expand Up @@ -100,19 +122,6 @@ const makeBootMsg = initAction => {
};
};

/**
* Extract local Swingset-specific configuration which is
* not part of the consensus.
*
* @param {CosmosSwingsetConfig} [resolvedConfig]
*/
const makeSwingsetConfig = resolvedConfig => {
const { maxVatsOnline } = resolvedConfig || {};
return {
maxVatsOnline,
};
};

/**
* @template {unknown} [T=unknown]
* @param {(req: string) => string} call
Expand Down Expand Up @@ -301,13 +310,24 @@ export default async function main(progname, args, { env, homedir, agcc }) {
// here so 'sendToChainStorage' can close over the single mutable instance,
// when we updated the 'portNums.storage' value each time toSwingSet was called.
async function launchAndInitializeSwingSet(initAction) {
const { XSNAP_KEEP_SNAPSHOTS, NODE_HEAP_SNAPSHOTS = -1 } = env;

/** @type {CosmosSwingsetConfig} */
const swingsetConfig = harden(initAction.resolvedConfig || {});
validateSwingsetConfig(swingsetConfig);
const { slogfile, vatSnapshotRetention, vatTranscriptRetention } =
swingsetConfig;
const keepSnapshots = vatSnapshotRetention
? vatSnapshotRetention !== 'operational'
: ['1', 'true'].includes(XSNAP_KEEP_SNAPSHOTS);
const keepTranscripts = vatTranscriptRetention
? vatTranscriptRetention !== 'operational'
: false;

// As a kludge, back-propagate selected configuration into environment variables.
const { slogfile } = initAction.resolvedConfig || {};
// eslint-disable-next-line dot-notation
if (slogfile) env['SLOGFILE'] = slogfile;

const swingsetConfig = makeSwingsetConfig(initAction.resolvedConfig);

const sendToChainStorage = msg => chainSend(portNums.storage, msg);
// this object is used to store the mailbox state.
const fromBridgeMailbox = data => {
Expand Down Expand Up @@ -442,7 +462,6 @@ export default async function main(progname, args, { env, homedir, agcc }) {
serviceName: TELEMETRY_SERVICE_NAME,
});

const { XSNAP_KEEP_SNAPSHOTS, NODE_HEAP_SNAPSHOTS = -1 } = env;
const slogSender = await makeSlogSender({
stateDir: stateDBDir,
env,
Expand All @@ -455,9 +474,6 @@ export default async function main(progname, args, { env, homedir, agcc }) {
trueValue: pathResolve(stateDBDir, 'store-trace.log'),
});

const keepSnapshots =
XSNAP_KEEP_SNAPSHOTS === '1' || XSNAP_KEEP_SNAPSHOTS === 'true';

const nodeHeapSnapshots = Number.parseInt(NODE_HEAP_SNAPSHOTS, 10);

let lastCommitTime = 0;
Expand Down Expand Up @@ -539,6 +555,7 @@ export default async function main(progname, args, { env, homedir, agcc }) {
swingStoreExportCallback,
swingStoreTraceFile,
keepSnapshots,
keepTranscripts,
afterCommitCallback,
swingsetConfig,
});
Expand Down
2 changes: 2 additions & 0 deletions packages/cosmic-swingset/src/launch-chain.js
Original file line number Diff line number Diff line change
Expand Up @@ -331,6 +331,7 @@ export async function launch({
swingStoreTraceFile,
swingStoreExportCallback,
keepSnapshots,
keepTranscripts,
afterCommitCallback = async () => ({}),
swingsetConfig,
}) {
Expand Down Expand Up @@ -373,6 +374,7 @@ export async function launch({
traceFile: swingStoreTraceFile,
exportCallback: swingStoreExportSyncCallback,
keepSnapshots,
keepTranscripts,
});
const { kvStore, commit } = hostStorage;

Expand Down
12 changes: 6 additions & 6 deletions packages/swing-store/docs/data-export.md
Original file line number Diff line number Diff line change
Expand Up @@ -188,13 +188,13 @@ Once the new SwingStore is fully populated with the previously-exported data, th
Some of the data maintained by SwingStore is not strictly necessary for kernel execution, at least under normal circumstances. For example, once a vat worker performs a heap snapshot, we no longer need the transcript entries from before the snapshot was taken, since vat replay will start from the snapshot point. We split each vat's transcript into "spans", delimited by heap snapshot events, and the "current span" is the most recent one (still growing), whereas the "historical spans" are all closed and immutable. Likewise, we only really need the most recent heap snapshot for each vat: older snapshots might be interesting for experiments that replay old transcripts with different versions of the XS engine, but no normal kernel will ever need them.
Most validators would prefer to prune this data, to reduce their storage needs. But we can imagine some [extreme upgrade scenarios](https://github.com/Agoric/agoric-sdk/issues/1691) that would require access to these historical transcript spans. Our compromise is to record *validation data* for these historical spans in the export data, but omit the spans themselves from the export artifacts. Validators can delete the old spans at will, and if we ever need them in the future, we can add code that will fetch copies from an archive service, validate them against the export data hashes, and re-insert the relevant entries into the SwingStore.
Most blockchain validator nodes would prefer to prune this data, to reduce their storage needs. But we can imagine some [extreme upgrade scenarios](https://github.com/Agoric/agoric-sdk/issues/1691) that would require access to these historical transcript spans. Our compromise is to record *validation data* for these historical spans in the export data, but omit the spans themselves from the export artifacts. Validators can delete the old spans at will, and if we ever need them in the future, we can add code that will fetch copies from an archive service, validate them against the export data hashes, and re-insert the relevant entries into the SwingStore.
Likewise, each time a heap snapshot is recorded, we cease to need any previous snapshot. And again, as a hedge against even more drastic recovery scenarios, we strike a compromise between minimizing retained data and the ability to validate old snapshots, by retaining only their hashes.
As a result, for each active vat, the first-stage Export Data contains a record for every old transcript span, plus one for the current span. It also contains a record for every old heap snapshot, plus one for the most recent heap snapshot, plus a `.current` record that points to the most recent snapshot. However the exported artifacts may or may not include blobs for the old transcript spans, or for the old heap snapshots.
As a result, for each active vat, the first-stage Export Data contains a record for every old heap snapshot, plus one for the most recent heap snapshot, plus a `.current` record that points to the most recent snapshot. It also contains a record for every old transcript span, plus one for the current span. However the exported artifacts may or may not include blobs for the old heap snapshots, or for the old transcript spans.
The `openSwingStore()` function has an option named `keepTranscripts` (which defaults to `true`), which causes the transcriptStore to retain the old transcript items. A second option named `keepSnapshots` (which defaults to `false`) causes the snapStore to retain the old heap snapshots. Opening the swingStore with a `false` option does not necessarily delete the old items immediately, but they'll probably get deleted the next time the kernel triggers a heap snapshot or transcript-span rollover. Validators who care about minimizing their disk usage will want to set both to `false`. In the future, we will arrange the SwingStore SQLite tables to provide easy `sqlite3` CLI commands that will delete the old data, so validators can also periodically use the CLI command to prune it.
The `openSwingStore()` function has an option named `keepSnapshots` (which defaults to `false`), which causes the snapStore to retain the old heap snapshots. A second option named `keepTranscripts` (which defaults to `true`) causes the transcriptStore to retain the old transcript items. Opening the swingStore with a `false` option does not necessarily delete the old items immediately, but they may get deleted the next time the kernel triggers a heap snapshot or transcript-span rollover. Hosts who care about minimizing their disk usage will want to set both to `false`. In the future, we will arrange the SwingStore SQLite tables to provide easy `sqlite3` CLI commands that will delete the old data, for use in periodic pruning.
When exporting, the `makeSwingStoreExporter()` function takes an `artifactMode` option (in an options bag). This serves to both limit, and provide some minimal guarantees about, the set of artifacts that will be provided in the export. The defined values of `artifactMode` each build upon the previous one:
Expand All @@ -218,9 +218,9 @@ While `importSwingStore()`'s options bag accepts the same options as `openSwingS
So, to avoid pruning current-incarnation historical transcript spans when exporting from one swingstore to another, you must set (or avoid overriding) the following options along the way:
* the original swingstore must not be opened with `{ keepTranscripts: false }`, otherwise the old spans will be pruned immediately
* the export must use `makeSwingStoreExporter(dirpath, { artifactMode: 'replay'})`, otherwise the export will omit the old spans
* the import must use `importSwingStore(exporter, dirPath, { artifactMode: 'replay'})`, otherwise the import will ignore the old spans
* subsequent `openSwingStore` calls must not use `keepTranscripts: false`, otherwise the new swingstore will prune historical spans as new ones are created (during `rolloverSpan`).
* the export must use `makeSwingStoreExporter(dirpath, { artifactMode: 'replay' })`, otherwise the export will omit the old spans
* the import must use `importSwingStore(exporter, dirPath, { artifactMode: 'replay' })`, otherwise the import will ignore the old spans
* subsequent `openSwingStore` calls must not use `keepTranscripts: false`, otherwise the new swingstore will prune historical spans they are replaced during `rolloverSpan`.
## Implementation Details
Expand Down
Loading

0 comments on commit d6f50e3

Please sign in to comment.