From ef0b3cc78e588e9ca61efaf3bbfe1fe8188476e6 Mon Sep 17 00:00:00 2001 From: mpoke Date: Wed, 15 Nov 2023 12:14:29 +0100 Subject: [PATCH] adding context on state-compatibility --- RELEASES.md | 22 ++--- STATE-COMPATIBILITY.md | 212 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 223 insertions(+), 11 deletions(-) create mode 100644 STATE-COMPATIBILITY.md diff --git a/RELEASES.md b/RELEASES.md index acea9528d6..92662b3318 100644 --- a/RELEASES.md +++ b/RELEASES.md @@ -2,6 +2,7 @@ - [Releases](#releases) - [Semantic Versioning](#semantic-versioning) + - [Breaking Changes](#breaking-changes) - [Release Cycle](#release-cycle) - [Stable Release Policy](#stable-release-policy) - [Version Matrix](#version-matrix) @@ -15,17 +16,16 @@ Interchain Security (ICS) follows [semantic versioning](https://semver.org), but - A state-machine breaking change will result in an increase of the MINOR version number (x.Y.z | x > 0). - Any other changes (including node API breaking changes) will result in an increase of the PATCH version number (x.y.Z | x > 0). -> **More context on breaking changes:** -> - A change is considered to be _library API breaking_ if it modifies the integration of ICS on either consumer or provider chains (i.e., it changes the way ICS is used as a library). -> Note that bumping the major version of [Cosmos SDK](https://github.com/cosmos/cosmos-sdk) or [IBC](https://github.com/cosmos/ibc-go) will be considered as a library API breaking change. -> - A change is considered to be _state-machine breaking_ if it requires a coordinated upgrade and/or state migration for either consumer or provider chains. -> Note that when bumping the dependencies of [Cosmos SDK](https://github.com/cosmos/cosmos-sdk) and [IBC](https://github.com/cosmos/ibc-go) we will only treat patch releases as non state-machine breaking. -> - A change is considered to be _node API breaking_ if it modifies the API provided by a node of either consumer or provider chains. -> This includes events, queries, CLI interfaces. - -❗**State compatibility**: It is critical for the patch releases to be state-machine compatible with prior releases in the same minor version. -For example, v3.2.1 must be state-machine compatible with v3.2.0. -This is to ensure **determinism**, i.e., given the same input, the nodes will always produce the same output. +### Breaking Changes + +A change is considered to be ***library API breaking*** if it modifies the integration of ICS on either consumer or provider chains (i.e., it changes the way ICS is used as a library). +Note that bumping the major version of [Cosmos SDK](https://github.com/cosmos/cosmos-sdk) or [IBC](https://github.com/cosmos/ibc-go) will be considered as a library API breaking change. + +A change is considered to be ***state-machine breaking*** if it requires a coordinated upgrade and/or state migration for either consumer or provider chains in order to preserve [state compatibility](./STATE-COMPATIBILITY.md). +Note that when bumping the dependencies of [Cosmos SDK](https://github.com/cosmos/cosmos-sdk) and [IBC](https://github.com/cosmos/ibc-go) we will only treat patch releases as non state-machine breaking. + +A change is considered to be ***node API breaking*** if it modifies the API provided by a node of either consumer or provider chains. +This includes events, queries, CLI interfaces. ## Release Cycle diff --git a/STATE-COMPATIBILITY.md b/STATE-COMPATIBILITY.md new file mode 100644 index 0000000000..b443a11aa5 --- /dev/null +++ b/STATE-COMPATIBILITY.md @@ -0,0 +1,212 @@ +# State-Compatibility + +- [State-Compatibility](#state-compatibility) + - [Scope](#scope) + - [Validating State-Compatibility](#validating-state-compatibility) + - [AppHash](#apphash) + - [LastResultsHash](#lastresultshash) + - [Major Sources of State-incompatibility](#major-sources-of-state-incompatibility) + - [Creating Additional State](#creating-additional-state) + - [Changing Proto Field Definitions](#changing-proto-field-definitions) + - [Returning Different Errors Given Same Input](#returning-different-errors-given-same-input) + - [Variability in Gas Usage](#variability-in-gas-usage) + - [Secondary Limitations To Keep In Mind](#secondary-limitations-to-keep-in-mind) + - [Network Requests to External Services](#network-requests-to-external-services) + - [Randomness](#randomness) + - [Parallelism and Shared State](#parallelism-and-shared-state) + - [Hardware Errors](#hardware-errors) + + +It is critical for the patch releases to be state-machine compatible with prior releases in the same minor version. +For example, v3.2.1 must be state-machine compatible with v3.2.0. + +This is to ensure **determinism**, i.e., given the same input, the nodes will always produce the same output. + +State-incompatibility is allowed for major upgrades because all nodes in either the consumer or provider networks +perform it at the same time. Therefore, after the upgrade, the nodes continue functioning in a deterministic way. + +## Scope + +The state-machine scope includes the following areas: + +- All ICS messages including: + - Every msg's ValidateBasic method + - Every msg's MsgServer method + - Net gas usage, in all execution paths + - Error result returned + - State changes (namely every store write) +- AnteHandlers in "DeliverTx" mode +- All `BeginBlock`/`EndBlock` logic + +The following are **NOT** in the state-machine scope: + +- Events +- Queries that are not whitelisted +- CLI interfaces + +## Validating State-Compatibility + +CometBFT ensures state compatibility by validating a number of hashes that can be found [here](https://github.com/cometbft/cometbft/blob/9f76e8da150414ce73eed2c4f248947b657c7587/proto/tendermint/types/types.proto#L70-L77). + +`AppHash` and `LastResultsHash` are the common sources of problems steaming from our work. +To avoid these problems, let's now examine how these hashes work. + +### AppHash + +**Note:** The following explanation is simplified for clarity. + +An app hash is a hash of hashes of every store's Merkle root that is returned by ABCI's `Commit()` from Cosmos-SDK to CometBFT. +Cosmos-SDK [takes an app hash of the application state](https://github.com/osmosis-labs/cosmos-sdk/blob/5c9a51c277d067e0ec5cf48df30a85fae95bcd14/store/rootmulti/store.go#L430), and propagates it to CometBFT which, in turn, compares it to the app hash of the rest of the network. +Then, CometBFT ensures that the app hash of the local node matches the app hash of the network. + +### LastResultsHash + +`LastResultsHash` is the root hash of all results from the transactions in the block returned by the ABCI's `DeliverTx`. + +The [`LastResultsHash`](https://github.com/cometbft/cometbft/blob/v0.34.29/types/results.go#L47-L54) +in CometBFT [v0.34.29](https://github.com/cometbft/cometbft/releases/tag/v0.34.29) contains: + +1. Tx `GasWanted` + +2. Tx `GasUsed` + > `GasUsed` being Merkelized means that we cannot freely reorder methods that consume gas. + > We should also be careful of modifying any validation logic since changing the + > locations where we error or pass might affect transaction gas usage. + > + > There are plans to remove this field from being Merkelized in a subsequent CometBFT release, + > at which point we will have more flexibility in reordering operations / erroring. + +3. Tx response `Data` + + > The `Data` field includes the proto marshalled Tx response. Therefore, we cannot + > change these in patch releases. + +4. Tx response `Code` + + > This is an error code that is returned by the transaction flow. In the case of + > success, it is `0`. On a general error, it is `1`. Additionally, each module + > defines its custom error codes. + > For example, `x/provider` currently has [these error codes](./x/ccv/provider/types/errors.go) defined. + > + > As a result, it is important to avoid changing custom error codes or change + > the semantics of what is valid logic in transaction flows. + +Note that all of the above stem from `DeliverTx` execution path, which handles: + +- `AnteHandler`'s marked as deliver tx +- `msg.ValidateBasic` +- execution of a message from the message server + +The `DeliverTx` return back to the CometBFT is defined [here](https://github.com/cosmos/cosmos-sdk/blob/d11196aad04e57812dbc5ac6248d35375e6603af/baseapp/abci.go#L293-L303). + +## Major Sources of State-incompatibility + +### Creating Additional State + +By erroneously creating database entries that exist in Version A but not in +Version B, we can cause the app hash to differ across nodes running +these versions in the network. Therefore, this must be avoided. + +### Changing Proto Field Definitions + +For example, if we change a field that gets persisted to the database, +the app hash will differ across nodes running these versions in the network. + +Additionally, this affects `LastResultsHash` because it contains a `Data` field that is a marshaled proto message. + +### Returning Different Errors Given Same Input + +```go +// Version A +func (sk Keeper) validateAmount(ctx context.Context, amount math.Int) error { + if amount.IsNegative() { + return sdkerrors.Wrap(sdkerrors.ErrInvalidRequest, "amount must be positive or zero") + } + return nil +} +``` + +```go +// Version B +func (sk Keeper) validateAmount(ctx context.Context, amount math.Int) error { + if amount.IsNegative() || amount.IsZero() { + return sdkerrors.Wrap(sdkerrors.ErrInvalidRequest, "amount must be positive") + } + return nil +} +``` + +Note that now an amount of 0 can be valid in "Version A", but invalid in "Version B". +Therefore, if some nodes are running "Version A" and others are running "Version B", +the final app hash might not be deterministic. + +Additionally, a different error message does not matter because it +is not included in any hash. However, an error code `sdkerrors.ErrInvalidRequest` does. +It translates to the `Code` field in the `LastResultsHash` and participates in +its validation. + +### Variability in Gas Usage + +For transaction flows (or any other flow that consumes gas), it is important +that the gas usage is deterministic. + +Currently, gas usage is being Merklized in the state. As a result, reordering functions +becomes risky. + +Suppose my gas limit is 2000 and 1600 is used up before entering +`someInternalMethod`. Consider the following: + +```go +func someInternalMethod(ctx sdk.Context) { + object1 := readOnlyFunction1(ctx) # consumes 1000 gas + object2 := readOnlyFunction2(ctx) # consumes 500 gas + doStuff(ctx, object1, object2) +} +``` + +- It will run out of gas with `gasUsed = 2600` where 2600 getting merkelized +into the tx results. + +```go +func someInternalMethod(ctx sdk.Context) { + object2 := readOnlyFunction2(ctx) # consumes 500 gas + object1 := readOnlyFunction1(ctx) # consumes 1000 gas + doStuff(ctx, object1, object2) +} +``` + +- It will run out of gas with `gasUsed = 2100` where 2100 is getting merkelized +into the tx results. + +Therefore, we introduced a state-incompatibility by merklezing diverging gas +usage. + +## Secondary Limitations To Keep In Mind + +### Network Requests to External Services + +It is critical to avoid performing network requests to external services +since it is common for services to be unavailable or rate-limit. + +Imagine a service that returns exchange rates when clients query its HTTP endpoint. +This service might experience downtime or be restricted in some geographical areas. + +As a result, nodes may get diverging responses where some +get successful responses while others errors, leading to state breakage. + +### Randomness + +Randomness cannot be used in the state machine, as the state machine must be deterministic. + +**Note:** Iteration order over `map`s is non-deterministic, so to be deterministic +you must gather the keys, and sort them all prior to iterating over all values. + +### Parallelism and Shared State + +Threads and Goroutines might preempt differently in different hardware. Therefore, +they should be avoided for the sake of determinism. Additionally, it is hard +to predict when the multi-threaded state can be updated. + +### Hardware Errors + +This is out of the developer's control but is mentioned for completeness. \ No newline at end of file