Skip to content

Commit

Permalink
adding context on state-compatibility
Browse files Browse the repository at this point in the history
  • Loading branch information
mpoke committed Nov 15, 2023
1 parent ae334c4 commit ef0b3cc
Show file tree
Hide file tree
Showing 2 changed files with 223 additions and 11 deletions.
22 changes: 11 additions & 11 deletions RELEASES.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

- [Releases](#releases)
- [Semantic Versioning](#semantic-versioning)
- [Breaking Changes](#breaking-changes)
- [Release Cycle](#release-cycle)
- [Stable Release Policy](#stable-release-policy)
- [Version Matrix](#version-matrix)
Expand All @@ -15,17 +16,16 @@ Interchain Security (ICS) follows [semantic versioning](https://semver.org), but
- A state-machine breaking change will result in an increase of the MINOR version number (x.Y.z | x > 0).
- Any other changes (including node API breaking changes) will result in an increase of the PATCH version number (x.y.Z | x > 0).

> **More context on breaking changes:**
> - A change is considered to be _library API breaking_ if it modifies the integration of ICS on either consumer or provider chains (i.e., it changes the way ICS is used as a library).
> Note that bumping the major version of [Cosmos SDK](https://github.com/cosmos/cosmos-sdk) or [IBC](https://github.com/cosmos/ibc-go) will be considered as a library API breaking change.
> - A change is considered to be _state-machine breaking_ if it requires a coordinated upgrade and/or state migration for either consumer or provider chains.
> Note that when bumping the dependencies of [Cosmos SDK](https://github.com/cosmos/cosmos-sdk) and [IBC](https://github.com/cosmos/ibc-go) we will only treat patch releases as non state-machine breaking.
> - A change is considered to be _node API breaking_ if it modifies the API provided by a node of either consumer or provider chains.
> This includes events, queries, CLI interfaces.
**State compatibility**: It is critical for the patch releases to be state-machine compatible with prior releases in the same minor version.
For example, v3.2.1 must be state-machine compatible with v3.2.0.
This is to ensure **determinism**, i.e., given the same input, the nodes will always produce the same output.
### Breaking Changes

A change is considered to be ***library API breaking*** if it modifies the integration of ICS on either consumer or provider chains (i.e., it changes the way ICS is used as a library).
Note that bumping the major version of [Cosmos SDK](https://github.com/cosmos/cosmos-sdk) or [IBC](https://github.com/cosmos/ibc-go) will be considered as a library API breaking change.

A change is considered to be ***state-machine breaking*** if it requires a coordinated upgrade and/or state migration for either consumer or provider chains in order to preserve [state compatibility](./STATE-COMPATIBILITY.md).
Note that when bumping the dependencies of [Cosmos SDK](https://github.com/cosmos/cosmos-sdk) and [IBC](https://github.com/cosmos/ibc-go) we will only treat patch releases as non state-machine breaking.

A change is considered to be ***node API breaking*** if it modifies the API provided by a node of either consumer or provider chains.
This includes events, queries, CLI interfaces.

## Release Cycle

Expand Down
212 changes: 212 additions & 0 deletions STATE-COMPATIBILITY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
# State-Compatibility

- [State-Compatibility](#state-compatibility)
- [Scope](#scope)
- [Validating State-Compatibility](#validating-state-compatibility)
- [AppHash](#apphash)
- [LastResultsHash](#lastresultshash)
- [Major Sources of State-incompatibility](#major-sources-of-state-incompatibility)
- [Creating Additional State](#creating-additional-state)
- [Changing Proto Field Definitions](#changing-proto-field-definitions)
- [Returning Different Errors Given Same Input](#returning-different-errors-given-same-input)
- [Variability in Gas Usage](#variability-in-gas-usage)
- [Secondary Limitations To Keep In Mind](#secondary-limitations-to-keep-in-mind)
- [Network Requests to External Services](#network-requests-to-external-services)
- [Randomness](#randomness)
- [Parallelism and Shared State](#parallelism-and-shared-state)
- [Hardware Errors](#hardware-errors)


It is critical for the patch releases to be state-machine compatible with prior releases in the same minor version.
For example, v3.2.1 must be state-machine compatible with v3.2.0.

This is to ensure **determinism**, i.e., given the same input, the nodes will always produce the same output.

State-incompatibility is allowed for major upgrades because all nodes in either the consumer or provider networks
perform it at the same time. Therefore, after the upgrade, the nodes continue functioning in a deterministic way.

## Scope

The state-machine scope includes the following areas:

- All ICS messages including:
- Every msg's ValidateBasic method
- Every msg's MsgServer method
- Net gas usage, in all execution paths
- Error result returned
- State changes (namely every store write)
- AnteHandlers in "DeliverTx" mode
- All `BeginBlock`/`EndBlock` logic

The following are **NOT** in the state-machine scope:

- Events
- Queries that are not whitelisted
- CLI interfaces

## Validating State-Compatibility

CometBFT ensures state compatibility by validating a number of hashes that can be found [here](https://github.com/cometbft/cometbft/blob/9f76e8da150414ce73eed2c4f248947b657c7587/proto/tendermint/types/types.proto#L70-L77).

`AppHash` and `LastResultsHash` are the common sources of problems steaming from our work.
To avoid these problems, let's now examine how these hashes work.

### AppHash

**Note:** The following explanation is simplified for clarity.

An app hash is a hash of hashes of every store's Merkle root that is returned by ABCI's `Commit()` from Cosmos-SDK to CometBFT.
Cosmos-SDK [takes an app hash of the application state](https://github.com/osmosis-labs/cosmos-sdk/blob/5c9a51c277d067e0ec5cf48df30a85fae95bcd14/store/rootmulti/store.go#L430), and propagates it to CometBFT which, in turn, compares it to the app hash of the rest of the network.
Then, CometBFT ensures that the app hash of the local node matches the app hash of the network.

### LastResultsHash

`LastResultsHash` is the root hash of all results from the transactions in the block returned by the ABCI's `DeliverTx`.

The [`LastResultsHash`](https://github.com/cometbft/cometbft/blob/v0.34.29/types/results.go#L47-L54)
in CometBFT [v0.34.29](https://github.com/cometbft/cometbft/releases/tag/v0.34.29) contains:

1. Tx `GasWanted`

2. Tx `GasUsed`
> `GasUsed` being Merkelized means that we cannot freely reorder methods that consume gas.
> We should also be careful of modifying any validation logic since changing the
> locations where we error or pass might affect transaction gas usage.
>
> There are plans to remove this field from being Merkelized in a subsequent CometBFT release,
> at which point we will have more flexibility in reordering operations / erroring.
3. Tx response `Data`

> The `Data` field includes the proto marshalled Tx response. Therefore, we cannot
> change these in patch releases.
4. Tx response `Code`

> This is an error code that is returned by the transaction flow. In the case of
> success, it is `0`. On a general error, it is `1`. Additionally, each module
> defines its custom error codes.
> For example, `x/provider` currently has [these error codes](./x/ccv/provider/types/errors.go) defined.
>
> As a result, it is important to avoid changing custom error codes or change
> the semantics of what is valid logic in transaction flows.
Note that all of the above stem from `DeliverTx` execution path, which handles:

- `AnteHandler`'s marked as deliver tx
- `msg.ValidateBasic`
- execution of a message from the message server

The `DeliverTx` return back to the CometBFT is defined [here](https://github.com/cosmos/cosmos-sdk/blob/d11196aad04e57812dbc5ac6248d35375e6603af/baseapp/abci.go#L293-L303).

## Major Sources of State-incompatibility

### Creating Additional State

By erroneously creating database entries that exist in Version A but not in
Version B, we can cause the app hash to differ across nodes running
these versions in the network. Therefore, this must be avoided.

### Changing Proto Field Definitions

For example, if we change a field that gets persisted to the database,
the app hash will differ across nodes running these versions in the network.

Additionally, this affects `LastResultsHash` because it contains a `Data` field that is a marshaled proto message.

### Returning Different Errors Given Same Input

```go
// Version A
func (sk Keeper) validateAmount(ctx context.Context, amount math.Int) error {
if amount.IsNegative() {
return sdkerrors.Wrap(sdkerrors.ErrInvalidRequest, "amount must be positive or zero")
}
return nil
}
```

```go
// Version B
func (sk Keeper) validateAmount(ctx context.Context, amount math.Int) error {
if amount.IsNegative() || amount.IsZero() {
return sdkerrors.Wrap(sdkerrors.ErrInvalidRequest, "amount must be positive")
}
return nil
}
```

Note that now an amount of 0 can be valid in "Version A", but invalid in "Version B".
Therefore, if some nodes are running "Version A" and others are running "Version B",
the final app hash might not be deterministic.

Additionally, a different error message does not matter because it
is not included in any hash. However, an error code `sdkerrors.ErrInvalidRequest` does.
It translates to the `Code` field in the `LastResultsHash` and participates in
its validation.

### Variability in Gas Usage

For transaction flows (or any other flow that consumes gas), it is important
that the gas usage is deterministic.

Currently, gas usage is being Merklized in the state. As a result, reordering functions
becomes risky.

Suppose my gas limit is 2000 and 1600 is used up before entering
`someInternalMethod`. Consider the following:

```go
func someInternalMethod(ctx sdk.Context) {
object1 := readOnlyFunction1(ctx) # consumes 1000 gas
object2 := readOnlyFunction2(ctx) # consumes 500 gas
doStuff(ctx, object1, object2)
}
```

- It will run out of gas with `gasUsed = 2600` where 2600 getting merkelized
into the tx results.

```go
func someInternalMethod(ctx sdk.Context) {
object2 := readOnlyFunction2(ctx) # consumes 500 gas
object1 := readOnlyFunction1(ctx) # consumes 1000 gas
doStuff(ctx, object1, object2)
}
```

- It will run out of gas with `gasUsed = 2100` where 2100 is getting merkelized
into the tx results.

Therefore, we introduced a state-incompatibility by merklezing diverging gas
usage.

## Secondary Limitations To Keep In Mind

### Network Requests to External Services

It is critical to avoid performing network requests to external services
since it is common for services to be unavailable or rate-limit.

Imagine a service that returns exchange rates when clients query its HTTP endpoint.
This service might experience downtime or be restricted in some geographical areas.

As a result, nodes may get diverging responses where some
get successful responses while others errors, leading to state breakage.

### Randomness

Randomness cannot be used in the state machine, as the state machine must be deterministic.

**Note:** Iteration order over `map`s is non-deterministic, so to be deterministic
you must gather the keys, and sort them all prior to iterating over all values.

### Parallelism and Shared State

Threads and Goroutines might preempt differently in different hardware. Therefore,
they should be avoided for the sake of determinism. Additionally, it is hard
to predict when the multi-threaded state can be updated.

### Hardware Errors

This is out of the developer's control but is mentioned for completeness.

0 comments on commit ef0b3cc

Please sign in to comment.