Merge branch 'main' into shawn/throttle-with-retries-consumer-changes

cosmos · Jun 20, 2023 · 81d88e3 · 81d88e3
2 parents 4b133ac + 827eece
commit 81d88e3
Show file tree

Hide file tree

Showing 8 changed files with 303 additions and 15 deletions.
diff --git a/docs/docs/adrs/adr-008-throttle-retries.md b/docs/docs/adrs/adr-008-throttle-retries.md
@@ -0,0 +1,105 @@
+---
+sidebar_position: 7
+title: Throttle with retries
+---
+
+## ADR 008: Throttle with retries
+
+## Changelog
+
+* 6/9/23: Initial draft
+
+## Status
+
+Accepted
+
+## Context
+
+For context on why the throttling mechanism exists, see [ADR 002](./adr-002-throttle.md).
+
+Note the terms slash throttling and jail throttling are synonymous, since in replicated security a `SlashPacket` simply jails a validator for downtime infractions.  
+
+Currently the throttling mechanism is designed so that provider logic (slash meter, etc.) dictates how many slash packets can be handled over time. Throttled slash packets are persisted on the provider, leading to multiple possible issues. Namely:
+
+* If slash or vsc matured packets are actually throttled/queued on the provider, state can grow and potentially lead to a DoS attack. We have short term solutions around this, but overall they come with their own weaknesses. See [#594](https://github.com/cosmos/interchain-security/issues/594).
+* If a jailing attack described in [ADR 002](adr-002-throttle.md) were actually to be carried out with the current throttling design, we'd likely have to halt the provider, and perform an emergency upgrade and/or migration to clear the queues of slash packets that were deemed to be malicious. Alternatively, validators would just have to _tough it out_ and wait for the queues to clear, during which all/most validators would be jailed. Right after being jailed, vals would have to unjail themselves promptly to ensure safety. The synchronous coordination required to maintain safety in such a scenario is not ideal.
+
+So what's the solution? We can improve the throttling mechanism to instead queue/persist relevant data on each consumer, and have consumers retry slash requests as needed.
+
+## Decision
+
+### Consumer changes
+
+Note the consumer already queues up both slash and vsc matured packets via `AppendPendingPacket`. Those packets are dequeued every endblock in `SendPackets` and sent to the provider.
+
+Instead, we will now introduce the following logic on endblock:
+
+* Slash packets will always be sent to the provider once they're at the head of the queue. However, once sent, the consumer will not send any trailing vsc matured packets from the queue until the provider responds with an ack that the slash packet has been handled (ie. val was jailed). That is, slash packets block the sending of trailing vsc matured packets in the consumer queue.
+* If two slash packets are at the head of the queue, the consumer will send the first slash packet, and then wait for a success ack from the provider before sending the second slash packet. This seems like it'd simplify implementation.
+* VSC matured packets at the head of the queue (ie. NOT trailing a slash packet) can be sent immediately, and do not block any other packets in the queue, since the provider always handles them immediately.
+
+To prevent the provider from having to keep track of what slash packets have been rejected, the consumer will have to retry the sending of slash packets over some period of time. This can be achieved with an on-chain consumer param. The suggested param value would probably be 1/2 of the provider's `SlashMeterReplenishmentPeriod`, although it doesn't matter too much as long as the param value is sane.
+
+Note to prevent weird edge case behavior, a retry would not be attempted until either a success ack or failure ack has been recv from the provider.
+
+With the behavior described, we maintain very similar behavior to the current throttling mechanism regarding the timing that slash and vsc matured packets are handled on the provider. Obviously the queueing and blocking logic is moved, and the two chains would have to send more messages between one another (only in the case the throttling mechanism is triggered).
+
+In the normal case, when no or a few slash packets are being sent, the VSCMaturedPackets will not be delayed, and hence unbonding will not be delayed.
+
+### Provider changes
+
+The main change needed for the provider is the removal of queuing logic for slash and vsc matured packets upon being received.
+
+Instead, the provider will consult the slash meter to determine if a slash packet can be handled immediately. If not, the provider will return an ack message to the consumer communicating that the slash packet could not be handled, and needs to be sent again in the future (retried).
+
+VSCMatured packets will always be handled immediately upon being received by the provider.
+
+Note [spec](https://github.com/cosmos/ibc/blob/main/spec/app/ics-028-cross-chain-validation/system_model_and_properties.md#consumer-initiated-slashing). Specifically the section on _VSC Maturity and Slashing Order_. Previously the onus was on the provider to maintain this property via queuing packets and handling them FIFO.
+
+Now this property will be maintained by the consumer sending packets in the correct order, and blocking the sending of VSCMatured packets as needed. Then, the ordered IBC channel will ensure that Slash/VSCMatured packets are received in the correct order on the provider.
+
+The provider's main responsibility regarding throttling will now be to determine if a recv slash packet can be handled via slash meter etc., and appropriately ack to the sending consumer.
+
+### Why the provider can handle VSCMatured packets immediately
+
+First we answer, what does a VSCMatured packet communicate to the provider? A VSCMatured packet communicates that a VSC has been applied to a consumer long enough that infractions committed on the consumer could have been submitted.
+
+If the consumer is following the queuing/blocking protocol described. No bad behavior occurs, `VSC Maturity and Slashing Order` property is maintained.
+
+If a consumer sends VSCMatured packets too leniently: The consumer is malicious and sending duplicate vsc matured packets, or sending the packets sooner than the ccv protocol specifies. In this scenario, the provider needs to handle vsc matured packets immediately to prevent DOS, state bloat, or other issues. The only possible negative outcome is that the malicious consumer may not be able to jail a validator who should have been jailed. The malicious behavior only creates a negative outcome for the chain that is being malicious.
+
+If a consumer blocks the sending of VSCMatured packets: The consumer is malicious and blocking vsc matured packets that should have been sent. This will block unbonding only up until the VSC timeout period has elapsed. At that time, the consumer is removed. Again the malicious behavior only creates a negative outcome for the chain that is being malicious.
+
+### Splitting of PRs
+
+We could split this feature into two PRs, one affecting the consumer and one affecting the provider, along with a third PR which could setup a clever way to upgrade the provider in multiple steps, ensuring that queued slash packets at upgrade time are handled properly.
+
+## Consequences
+
+* Consumers will now have to manage their own queues, and retry logic.
+* Consumers still aren't trustless, but the provider is now less susceptible to mismanaged or malicious consumers.
+* Recovering from the "jailing attack" is more elegant.
+* Some issues like [#1001](https://github.com/cosmos/interchain-security/issues/1001) will now be handled implicitly by the improved throttling mechanism.
+* Slash and vsc matured packets can be handled immediately once recv by the provider if the slash meter allows.
+* In general, we reduce the amount of computation that happens in the provider end-blocker.
+
+### Positive
+
+* We no longer have to reason about a "global queue" and a "chain specific queue", and keeping those all in-sync. Now slash and vsc matured packet queuing is handled on each consumer individually.
+* Due to the above, the throttling protocol becomes less complex overall.
+* We no longer have to worry about throttle related DoS attack on the provider, since no queuing exists on the provider.
+
+### Negative
+
+* Increased number of IBC packets being relayed anytime throttling logic is triggered.
+* Consumer complexity increases, since consumers now have manage queuing themselves, and implement packet retry logic.
+
+### Neutral
+
+* Core throttling logic on the provider remains unchanged, ie. slash meter, replenishment cycles, etc.
+
+## References
+
+* [EPIC](https://github.com/cosmos/interchain-security/issues/713) tracking the changes proposed by this ADR
+* [ADR 002: Jail Throttling](./adr-002-throttle.md)
+* [#594](https://github.com/cosmos/interchain-security/issues/594)
diff --git a/docs/docs/adrs/adr-009-soft-opt-out.md b/docs/docs/adrs/adr-009-soft-opt-out.md
@@ -0,0 +1,48 @@
+---
+sidebar_position: 10
+title: Soft Opt-Out
+---
+## ADR 009: Soft Opt-Out
+
+## Changelog
+
+* 6/13/23: Initial draft of ADR. Feature already implemented and in production.
+
+## Status
+
+Accepted
+
+## Context
+
+Some small validators may not have the resources needed to validate all consumer chains. Therefore a need exists to allow the bottom `x%` of validators to opt-out of validating a consumer chain. Meaning downtime infractions for these validators are dropped without ever reaching the provider.
+
+This document specifies a modification to the ccv protocol which allows the bottom x% of the validator set by power to opt out of validating consumer chains without being jailed or otherwise punished for it. The feature is implemented with entirely consumer-side code.
+
+## Decision
+
+A consumer param exists, known as `SoftOptOutThreshold`, which is a string decimal in the range of [0, 0.2], that determines the portion of validators which are allowed to opt out of validating that specific consumer.
+
+In every consumer beginblocker, a function is ran which determines the so called  _smallest non opt-out voting power_. Validators with voting power greater than or equal to this value must validate the consumer chain, while validators below this value may opt out of validating the consumer chain.
+
+The smallest non opt-out voting power is recomputed every beginblocker in `UpdateSmallestNonOptOutPower()`. In a nutshell, the method obtains the total voting power of the consumer, iterates through the full valset (ordered power ascending) keeping track of a power sum, and when `powerSum / totalPower > SoftOptOutThreshold`, the `SmallestNonOptOutPower` is found and persisted.
+
+Then, whenever the `Slash()` interface is executed on the consumer, if the voting power of the relevant validator being slashed is less than `SmallestNonOptOutPower` for that block, the slash request is dropped and never sent to the provider.
+
+## Consequences
+
+### Positive
+
+* Small validators can opt out of validating specific consumers without being punished for it.
+
+### Negative
+
+* The bottom `x%` is still part of the total voting power of the consumer chain. This means that if the soft opt-out threshold is set to `10%` for example, and every validator in the bottom `10%` opts out from validating the consumer, then a `24%` downtime of the remaining voting power would halt the chain. This may be especially problematic during consumer upgrades.
+* In nominal scenarios, consumers with soft opt out enabled will be constructing slash packets for small vals, which may be dropped. This is wasted computation, but necessary to keep implementation simple. Note that the sdk's [full downtime logic](https://github.com/cosmos/cosmos-sdk/blob/d3f09c222243bb3da3464969f0366330dcb977a8/x/slashing/keeper/infractions.go#L75) is always executed on the consumer, which can be computationally expensive and slow down certain blocks.
+
+### Neutral
+
+* Validators in the bottom of the valset who don't have to validate, may receive large delegation(s) which suddenly boost the validator to the subset that has to validate. This may catch the validator off guard.
+
+## References
+
+* Original issue with some napkin math [#784](https://github.com/cosmos/interchain-security/issues/784)
diff --git a/docs/docs/adrs/intro.md b/docs/docs/adrs/intro.md
@@ -36,4 +36,6 @@ To suggest an ADR, please make use of the [ADR template](./adr-template.md) prov
 | [002](./adr-002-throttle.md) | Jail Throttling | Accepted, Implemented |
 | [003](./adr-003-equivocation-gov-proposal.md) | Equivocation governance proposal | Accepted, Implemented |
 | [004](./adr-004-denom-dos-fixes) | Denom DOS fixes | Accepted, Implemented |
-| [007](./adr-007-pause-unbonding-on-eqv-prop.md) | Pause validator unbonding during equivocation proposal | Proposed |
+| [007](./adr-007-pause-unbonding-on-eqv-prop.md) | Pause validator unbonding during equivocation proposal | Proposed |
+| [008](./adr-008-throttle-retries.md) | Throttle with retries | Accepted, In-progress |
+| [009](./adr-009-soft-opt-out.md) | Soft Opt-out | Accepted, Implemented |
diff --git a/sonar-project.properties b/sonar-project.properties
@@ -14,7 +14,7 @@ sonar.organization=cosmos
 # All golang artifacts
 sonar.sources=.
 # Do not calculate coverage metrics for statements in these files
-sonar.exclusions=**/vendor/**,**/*.pb.go,**/*.pb.gw.go,proto,**/*_test.go,tests/**,testutil/**,legacy_ibc_testing/**,app/consumer/app.go
+sonar.exclusions=**/vendor/**,**/*.pb.go,**/*.pb.gw.go,proto,**/*_test.go,tests/**,testutil/**,legacy_ibc_testing/**,docs/**,app/**,cmd/**,
 sonar.tests=.
 # Run unit and integration tests, but not E2E tests
 sonar.test.inclusions=**/*_test.go

diff --git a/tests/e2e/actions.go b/tests/e2e/actions.go
@@ -5,6 +5,7 @@ import (
 	"encoding/json"
 	"fmt"
 	"log"
+	"math"
 	"os/exec"
 	"strconv"
 	"strings"
@@ -72,7 +73,7 @@ type StartChainValidator struct {
 	stake      uint
 }
 
-func (tr TestRun) startChain(
+func (tr *TestRun) startChain(
 	action StartChainAction,
 	verbose bool,
 ) {
@@ -171,6 +172,14 @@ func (tr TestRun) startChain(
 		chain:     action.chain,
 		validator: action.validators[0].id,
 	}, verbose)
+
+	// store the fact that we started the chain
+	tr.runningChains[action.chain] = true
+	fmt.Println("Started chain", action.chain)
+	if tr.timeOffset != 0 {
+		// advance time for this chain so that it is in sync with the rest of the network
+		tr.AdvanceTimeForChain(action.chain, tr.timeOffset)
+	}
 }
 
 type submitTextProposalAction struct {
@@ -489,7 +498,7 @@ type voteGovProposalAction struct {
 	propNumber uint
 }
 
-func (tr TestRun) voteGovProposal(
+func (tr *TestRun) voteGovProposal(
 	action voteGovProposalAction,
 	verbose bool,
 ) {
@@ -521,7 +530,7 @@ func (tr TestRun) voteGovProposal(
 	}
 
 	wg.Wait()
-	time.Sleep(time.Duration(tr.chainConfigs[action.chain].votingWaitTime) * time.Second)
+	tr.WaitTime(time.Duration(tr.chainConfigs[action.chain].votingWaitTime) * time.Second)
 }
 
 type startConsumerChainAction struct {
@@ -531,7 +540,7 @@ type startConsumerChainAction struct {
 	genesisChanges string
 }
 
-func (tr TestRun) startConsumerChain(
+func (tr *TestRun) startConsumerChain(
 	action startConsumerChainAction,
 	verbose bool,
 ) {
@@ -1219,8 +1228,8 @@ func (tr TestRun) transferChannelComplete(
 	executeCommand(chanOpenConfirmCmd, "transferChanOpenConfirm")
 }
 
-func executeCommand(cmd *exec.Cmd, cmdName string) {
-	if verbose != nil && *verbose {
+func executeCommandWithVerbosity(cmd *exec.Cmd, cmdName string, verbose bool) {
+	if verbose {
 		fmt.Println(cmdName+" cmd:", cmd.String())
 	}
 
@@ -1238,7 +1247,7 @@ func executeCommand(cmd *exec.Cmd, cmdName string) {
 
 	for scanner.Scan() {
 		out := scanner.Text()
-		if verbose != nil && *verbose {
+		if verbose {
 			fmt.Println(cmdName + ": " + out)
 		}
 	}
@@ -1247,6 +1256,11 @@ func executeCommand(cmd *exec.Cmd, cmdName string) {
 	}
 }
 
+// Executes a command with verbosity specified by CLI flag
+func executeCommand(cmd *exec.Cmd, cmdName string) {
+	executeCommandWithVerbosity(cmd, cmdName, *verbose)
+}
+
 type relayPacketsAction struct {
 	chainA  chainID
 	chainB  chainID
@@ -1284,6 +1298,8 @@ func (tr TestRun) relayPacketsGorelayer(
 	if err != nil {
 		log.Fatal(err, "\n", string(bz))
 	}
+
+	tr.waitBlocks(action.chainA, 1, 30*time.Second)
 }
 
 func (tr TestRun) relayPacketsHermes(
@@ -1466,13 +1482,29 @@ func (tr TestRun) redelegateTokens(action redelegateTokensAction, verbose bool)
 	if err != nil {
 		log.Fatal(err, "\n", string(bz))
 	}
+
+	tr.waitBlocks(action.chain, 1, 10*time.Second)
 }
 
 type downtimeSlashAction struct {
 	chain     chainID
 	validator validatorID
 }
 
+// takes a string representation of the private key like
+// `{"address":"DF090A4880B54CD57B2A79E64D9E969BD7514B09","pub_key":{"type":"tendermint/PubKeyEd25519","value":"ujY14AgopV907IYgPAk/5x8c9267S4fQf89nyeCPTes="},"priv_key":{"type":"tendermint/PrivKeyEd25519","value":"TRJgf7lkTjs/sj43pyweEOanyV7H7fhnVivOi0A4yjW6NjXgCCilX3TshiA8CT/nHxz3brtLh9B/z2fJ4I9N6w=="}}`
+// and returns the value of the "address" field
+func (tr TestRun) getValidatorKeyAddressFromString(keystring string) string {
+	var key struct {
+		Address string `json:"address"`
+	}
+	err := json.Unmarshal([]byte(keystring), &key)
+	if err != nil {
+		log.Fatal(err)
+	}
+	return key.Address
+}
+
 func (tr TestRun) invokeDowntimeSlash(action downtimeSlashAction, verbose bool) {
 	// Bring validator down
 	tr.setValidatorDowntime(action.chain, action.validator, true, verbose)
@@ -1491,6 +1523,30 @@ func (tr TestRun) setValidatorDowntime(chain chainID, validator validatorID, dow
 		lastArg = "up"
 	}
 
+	if tr.useCometmock {
+		// send set_signing_status either to down or up for validator
+		var validatorAddress string
+		if chain == chainID("provi") {
+			validatorAddress = tr.getValidatorKeyAddressFromString(tr.validatorConfigs[validator].privValidatorKey)
+		} else {
+			var valAddressString string
+			if tr.validatorConfigs[validator].useConsumerKey {
+				valAddressString = tr.validatorConfigs[validator].consumerPrivValidatorKey
+			} else {
+				valAddressString = tr.validatorConfigs[validator].privValidatorKey
+			}
+			validatorAddress = tr.getValidatorKeyAddressFromString(valAddressString)
+		}
+
+		method := "set_signing_status"
+		params := fmt.Sprintf(`{"private_key_address":"%s","status":"%s"}`, validatorAddress, lastArg)
+		address := tr.getQueryNodeRPCAddress(chain)
+
+		tr.curlJsonRPCRequest(method, params, address)
+		tr.waitBlocks(chain, 1, 10*time.Second)
+		return
+	}
+
 	//#nosec G204 -- Bypass linter warning for spawning subprocess with cmd arguments.
 	cmd := exec.Command(
 		"docker",
@@ -1764,6 +1820,8 @@ func (tr TestRun) assignConsumerPubKey(action assignConsumerPubKeyAction, verbos
 
 		// TODO: @MSalopek refactor this so test config is not changed at runtime
 		// make the validator use consumer key
+		// @POfftermatt I am currently using this for downtime slashing with cometmock
+		// (I need to find the currently used validator key address)Í
 		valCfg.useConsumerKey = true
 		tr.validatorConfigs[action.validator] = valCfg
 	}
@@ -1828,3 +1886,34 @@ func (tr TestRun) GetPathNameForGorelayer(chainA, chainB chainID) string {
 
 	return pathName
 }
+
+// WaitTime waits for the given duration.
+// The CometMock version of this takes a pointer to the TestRun as it needs to manipulate
+// information in the testrun that stores how much each chain has waited, to keep times in sync.
+// Be careful that all functions calling WaitTime should therefore also take a pointer to the TestRun.
+func (tr *TestRun) WaitTime(duration time.Duration) {
+	if !tr.useCometmock {
+		time.Sleep(duration)
+	} else {
+		tr.timeOffset += duration
+		for chain, running := range tr.runningChains {
+			if !running {
+				continue
+			}
+			tr.AdvanceTimeForChain(chain, duration)
+		}
+	}
+}
+
+func (tr TestRun) AdvanceTimeForChain(chain chainID, duration time.Duration) {
+	// cometmock avoids sleeping, and instead advances time for all chains
+	method := "advance_time"
+	params := fmt.Sprintf(`{"duration_in_seconds": "%d"}`, int(math.Ceil(duration.Seconds())))
+
+	address := tr.getQueryNodeRPCAddress(chain)
+
+	tr.curlJsonRPCRequest(method, params, address)
+
+	// wait for 1 block of the chain to get a block with the advanced timestamp
+	tr.waitBlocks(chain, 1, time.Minute)
+}