Skip to content

Latest commit

 

History

History
400 lines (317 loc) · 19.9 KB

rfc-019-config-version.md

File metadata and controls

400 lines (317 loc) · 19.9 KB

RFC 019: Configuration File Versioning

Changelog

  • 19-Apr-2022: Initial draft (@creachadair)
  • 20-Apr-2022: Updates from review feedback (@creachadair)

Abstract

Updating configuration settings is an essential part of upgrading an existing node to a new version of the Tendermint software. Unfortunately, it is also currently a very manual process. This document discusses some of the history of changes to the config format, actions we've taken to improve the tooling for configuration upgrades, and additional steps we may want to consider.

Background

A Tendermint node reads configuration settings at startup from a TOML formatted text file, typically named config.toml. The contents of this file are defined by the github.com/tendermint/tendermint/config.

Although many settings in this file remain valid from one version of Tendermint to the next, new versions of Tendermint often add, update, and remove settings. These changes often require manual intervention by operators who are upgrading their nodes.

I propose we should provide better tools and documentation to help operators make configuration changes correctly during version upgrades. Ideally, as much as possible of any configuration file update should be automated, and where that is not possible or practical, we should provide clear, explicit directions for what steps need to be taken manually. Moreover, when the node discovers incorrect or invalid configuration, we should improve the diagnostics it emits so that the operator can quickly and easily find the relevant documentation, without having to grep through source code.

Discussion

By convention, we are supposed to document required changes to the config file in the UPGRADING.md file for the release that introduces them. Although we have mostly done this, the level of detail in the upgrading instructions is often insufficient for an operator to correctly update their file.

The updates vary widely in complexity: Operators may need to add new required settings, update obsolete values for existing settings, move or rename existing settings within the file, or remove obsolete settings (which are thus invalid). Here are a few examples of each of these cases:

  • New required settings: Tendermint v0.35 added a new top-level mode setting that determines whether a node runs as a validator, a full node, or a seed node. The default value is "full", which means the operator of a validator must manually add mode = "validator" (or set the --mode flag on the command line) for their node to come up in the correct mode.

  • Updated obsolete values: Tendermint v0.35 removed support for versions "v1" and "v2" of the blocksync (formerly "fastsync") protocol, requiring any node using either of those values to update to "v0".

  • Moved/renamed settings: Version v0.34 moved the top-level pprof_laddr setting under the [rpc] section.

    Version v0.35 renamed every setting in the file from snake_case to kebab-case, moved the top-level fast_sync setting into the [blocksync] section as (itself renamed from [fastsync]), and moved all the top-level priv-validator-* settings under a new [priv-validator] section with their prefix trimmed off.

  • Removed obsolete settings: Version v0.34 removed the index_all_keys and index_keys settings from the [tx_index] section; version v0.35 removed the wal-dir setting from the [mempool] section, and version v0.36 removed the [blocksync] section entirely.

While many of these changes are mentioned in the config section of the upgrade instructions, some are not mentioned at all, or are hidden in other parts of the doc. For instance, the v0.34 pprof_laddr change was documented only as an RPC flag change. (A savvy reader might realize that the flag --rpc.pprof_laddr implies a corresponding config section, but it omits the related detail that there was a top-level setting that's been renamed). The lesson here is not that the docs are bad, but to point out that prose is not the most efficient format to convey detailed changes like this. The upgrading instructions are still valuable for the human reader to understand what to expect.

Concrete Steps

As part of the v0.36 development cycle, we spent some time reverse-engineering the configuration changes since the v0.34 release and built an experimental command-line tool called confix, whose job it is to automatically update the settings in a config.toml file to the latest version. We also backported a version of this tool into the v0.35.x branch at release v0.35.4.

This tool should work fine for configuration files created by Tendermint v0.34 and later, but does not (yet) know how to handle changes from prior versions of Tendermint. Part of the difficulty for older versions is simply logistical: To figure out which changes to apply, we need to understand something about the version that made the file, as well as the version we're converting it to.

Discussion point: In the future we might want to consider incorporating this into the node CLI directly, but we're keeping it separate for now until we can get some feedback from operators.

For the experiment, we handled this by carefully searching the history of config format changes for shibboleths to bound the version: For example, the [fastsync] section was added in Tendermint v0.32 and renamed [blocksync] in Tendermint v0.35. So if we see a [fastsync] section, we have some confidence that the file was created by v0.32, v0.33, or v0.34.

But such signals are delicate: The [blocksync] section was removed in v0.36, so if we do not find [fastsync], we cannot conclude from that alone that the file is from v0.31 or earlier -- we have to look for corroborating details. While such "sniffing" tactics are fine for an experiment, they aren't as robust as we might like.

This is especially relevant for configuration files that may have already been manually upgraded across several versions by the time we are asked to update them again. Another related concern is that we'd like to make sure conversion is idempotent, so that it would be safe to rerun the tool over an already-converted file without breaking anything.

Config Versioning

One obvious tactic we could use for future releases is add a version marker to the config file. This would give tools like confix (and the node itself) a way to calibrate their expectations. Rather than being a version for the file itself, however, this version marker would indicate which version of Tendermint is needed to read the file.

Provisionally, this might look something like:

# THe minimum version of Tendermint compatible with the contents of
# this configuration file.
config-version = 'v0.35'

When initializing a new node, Tendermint would populate this field with its own version (e.g., v0.36). When conducting an upgrade, tools like confix can then use this to decide which conversions are valid, and then update the value accordingly. After converting a file marked 'v0.35' to'v0.37', the conversion tool sets the file's config-version to reflect its compatibility.

Discussion point: This example presumes we would keep config files compatible within a given release cycle, e.g., all of v0.36.x. We could also use patch numbers here, if we think there's some reason to permit changes that would require config file edits at that granularity. I don't think we should, but that's a design question to consider.

Upon seeing an up-to-date version marker, the conversion tool can simply exit with a diagnostic like "this file is already up-to-date", rather than sniffing the keyspace and potentially introducing errors. In addition, this would let a tool detect config files that are newer than the one it understands, and issue a safe diagnostic rather than doing something wrong. Plus, besides avoiding potentially unsafe conversions, this would also serve as human-readable documentation that the file is up-to-date for a given version.

Adding a config version would not address the problem of how to convert files created by older versions of Tendermint, but it would at least help us build more robust config tooling going forward.

Stability and Change

In light of the discussion so far, it is natural to examine why we make so many changes to the configuration file from one version to the next, and whether we could reduce friction by being more conservative about what we make configurable, what config changes we make over time, and how we roll them out.

Some changes, like renaming everything from snake case to kebab case, are entirely gratuitous. We could safely agree not to make those kinds of changes. Apart from that obvious case, however, many other configuration settings provide value to node operators in cases where there is no simple, universal setting that matches every application.

Taking a high-level view, there are several broad reasons why we might want to make changes to configuration settings:

  • Lessons learned: Configuration settings are a good way to try things out in production, before making more invasive changes to the consensus protocol.

    For example, up until Tendermint v0.35, consensus timeouts were specified as per-node configuration settings (e.g., timeout-precommit et al.). This allowed operators to tune these values for the needs of their network, but had the downside that individually-misconfigured nodes could stall consensus.

    Based on that experience, these timeouts have been deprecated in Tendermint v0.36 and converted to consensus parameters, to be consistent across all nodes in the network.

  • Migration & experimentation: Introducing new features and updating old features can complicate migration for existing users of the software. Temporary or "experimental" configuration settings can be a valuable way to mitigate that friction.

    For example, Tendermint v0.36 introduces a new RPC event subscription endpoint (see ADR 075) that will eventually replace the existing webwocket-based interface. To give users time to migrate, v0.36 adds an experimental-disable-websocket setting, defaulted to false, that allows operators to selectively disable the websocket API for testing purposes during the conversion. This setting is designed to be removed in v0.37, when the old interface is no longer supported.

  • Ongoing maintenance: Sometimes configuration settings become obsolete, and the cost of removing them trades off against the potential risks of leaving a non-functional or deprecated knob hooked up indefinitely.

    For example, Tendermint v0.35 deprecated two alternate implementations of the blocksync protocol, one of which was deleted entirely (v1) and one of which was scheduled for removal (v2). The blocksync.version setting, which had been added as a migration aid, became obsolete and needed to be updated.

    Despite our best intentions, sometimes engineering designs do not work out. It's just as important to leave room to back out of changes we have since reconsidered, as it is to support migrations forward onto new and improved code.

  • Clarity and legibility: Besides configuring the software, another important purpose of a config file is to document intent for the humans who operate and maintain the software. Operators need adjust settings to keep the node running, and developers need to know what options were in use when something goes wrong so they can diagnose and fix bugs. The legibility of a config file as a human artifact is also thus important.

    For example, Tendermint v0.35 moved settings related to validator private keys from the top-level section of the configuration file to their own designated [priv-validator] section. Although this change did not make any difference to the meaning of those settings, it made the organization of the file easier to understand, and allowed the names of the individual settings to be simplified (e.g., priv-validator-key-file became simply key-file in the new section).

    Although such changes are "gratuitous" with respect to the software, there is often value in making things more legible for the humans. While there is no simple rule to define the line, the Potter Stewart principle can be used with due care.

Keeping these examples in mind, we can and should take reasonable steps to avoid churn in the configuration file across versions where we can. However, we must also accept that part of the reason for having a config file is to allow us flexibility elsewhere in the design. On that basis, we should not attempt to be too dogmatic about config changes either. Unlike changes in the block protocol, for example, which affect every user of every network that adopts them, config changes are relatively self-contained.

There are few guiding principles I think we can use to strike a sensible balance:

  1. No gratuitous changes. Aesthetic changes that do not enhance legibility, avert confusion, or clarity documentation, should be entirely avoided.

  2. Prefer mechanical changes. Whenever it is practical, change settings in a way that can be updated by a tool without operator judgement. This implies finding safe, universal defaults for new settings, and not changing the default values of existing settings.

    Even if that means we have to make multiple changes (e.g., add a new setting in the current version, deprecate the old one, and remove the old one in the next version) it's preferable if we can mechanize each step.

  3. Clearly signal intent. When adding temporary or experimental settings, they should be clearly named and documented as such. Use long names and suggestive prefixes (e.g., experimental-*) so that they stand out when read in the config file or printed in logs.

    Relatedly, using temporary or experimental settings should cause the software to emit diagnostic logs at runtime. These log messages should be easy to grep for, and should contain pointers to more complete documentation (say, issue numbers or URLs) that the operator can read, as well as a hint about when the setting is expected to become invalid. For example:

    WARNING: Websocket RPC access is deprecated and will be removed in
    Tendermint v0.37. See https://tinyurl.com/adr075 for more information.
    
  4. Consider both directions. When adding a configuration setting, take some time during the implementation process to think about how the setting could be removed, as well as how it will be rolled out. This applies even for settings we imagine should be permanent. Experience may cause is to rethink our original design intent more broadly than we expected.

    This does not mean we have to spend a long time picking nits over the design of every setting; merely that we should convince ourselves we could undo it without making too big a mess later. Even a little extra effort up front can sometimes save a lot.

References

Appendix: Research Notes

Discovering when various configuration settings were added, updated, and removed turns out to be surprisingly tedious. To solve this puzzle, we had to answer the following questions:

  1. What changes were made between v0.x and v0.y? This is further complicated by cases where we have backported config changes into the middle of an earlier release cycle (e.g., psql-conn from v0.35.x into v0.34.13).

  2. When during the development cycle were those changes made? This allows us to recognize features that were backported into a previous release.

  3. What were the default values of the changed settings, and did they change at all during or across the release boundary?

Each step of the configuration update plan is commented with a link to one or more PRs where that change was made. The sections below discuss how we found these references.

Tracking Changes Across Releases

To figure out what changed between two releases, we built a tool called condiff, which performs a "keyspace" diff of two TOML documents. This diff respects the structure of the TOML file, but ignores comments, blank lines, and configuration values, so that we can see what was added and removed.

To use it, run:

go run ./scripts/confix/condiff old.toml new.toml

This tool works on any TOML documents, but for our purposes we needed Tendermint config.toml files. The easiest way to get these is to build the node binary for your version of interest, run tendermint init on a clean home directory, and copy the generated config file out. The testdata directory for the confix tool has configs generated from the heads of each release branch from v0.31 through v0.35.

If you want to reproduce this yourself, it looks something like this:

# Example for Tendermint v0.32.
git checkout --track origin/v0.32.x
go get golang.org/x/sys/unix
go mod tidy
make build
rm -fr -- tmhome
./build/tendermint --home=tmhome init
cp tmhome/config/config.toml config-v32.toml

Be advised that the further back you go, the more idiosyncrasies you will encounter. For example, Tendermint v0.31 and earlier predate Go modules (v0.31 used dep), and lack backport branches. And you may need to do some editing of Makefile rules once you get back into the 20s.

Note that when diffing config files across the v0.34/v0.35 gap, the swap from snake_case to kebab-case makes it look like everything changed. The condiff tool has a -desnake flag that normalizes all the keys to kebab case in both inputs before comparison.

Locating Additions and Deletions

To figure out when a configuration setting was added or removed, your tool of choice is git bisect. The only tricky part is finding the endpoints for the search. If the transition happened within a release, you can use that release's backport branch as the endpoint (if it has one, e.g., v0.35.x).

However, the start point can be more problematic. The backport branches are not ancestors of master or of each other, which means you need to find some point in history prior to the change but still attached to the mainline. For recent releases there is a dev root (e.g., v0.35.0-dev, v0.34.0-dev1, etc.). These are not named consistently, but you can usually grep the output of git tag to find them.

In the worst case you could try starting from the root commit of the repo, but that turns out not to work in all cases. We've done some branching shenanigans over the years that mean the root is not a direct ancestor of all our release branches. When you find this you will probably swear a lot. I did.

Once you have a start and end point (say, v0.35.0-dev and master), you can bisect in the usual way. I use git grep on the config directory to check whether the case I am looking for is present. For example, to find when the [fastsync] section was removed:

# Setup:
git checkout master
git bisect start
git bisect bad                 # it's not present on tip of master.
git bisect good v0.34.0-dev1   # it was present at the start of v0.34.
# Now repeat this until it gives you a specific commit:
if git grep -q '\[fastsync\]' config ; then git bisect good ; else git bisect bad ; fi

The above example finds where a config was removed: To find where a setting was added, do the same thing except reverse the sense of the test (if ! git grep -q ...).