- 2022-07-15
Accepted
Vaults containing all network funds are composed of keyshares generated by the member nodes of an Asgard at each churn interval, and stored on Bifrost's persistent disk. There are a number of factors to consider that could result in the complete loss of this file that we must consider, to name a few:
- Compromised (not necessarily malicious) infrastructure, tooling, operator machines
- Forced provider shutdown due to censorship, unpaid accounts, etc
- Human error during operation
In order to ensure there is no period of time in which loss of keyshares would incur loss of network funds, operators must immediately back up their keyshares after each churn. Currently the official mechanism for this backup is the utility command make backup
in the node-launcher
repo, which will copy the keyshares to the operator's local machine. This approach requires responsive and proactive node operators to continuously backup to protect the network, and there is no way for external persons to verify existence of node backups.
Since moving away from Yggdrasil vaults in favor of a greater number of Asgards, some risk is reduced since loss of funds requires losing a supermajority of members, but risk remains. In the ideal scenario, a node operator should be able to securely backup only their mnemonic once and leverage it to recover their node and any corresponding funds.
TBD - there have been many discussions around this, and the options listed in alternatives are still relevant.
The proposed design extends the TssPool
message sent after vault creation to include a keyshares_backup
field, which contains the bytes for the newly created keyshares after churn, compressed with lzma
(to reduce chain bloat), and symetrically encrypted using the node's mnemonic as the passphrase (the same mnemonic generated at node creation used for the thornode
private key). The initial pass of this implementation began before the introduction of the ADR process and is currently under review at https://gitlab.com/thorchain/thornode/-/merge_requests/2235. These keyshares will intentionally skip storage in a KV store in the thornode
application state to avoid further bloat, instead a CLI utility will be provided to via tci to pull and decrypt the latest keyshare backup for the node from an RPC endpoint, via tci nodes recover-keyshares --address <node-address>
Sanity checks against mnemnonics before encryption:
- Validate BIP39 mnemnonic.
- Validate the entropy of the byte-wise probability distribution of the mnemnonic (greater than the minimum of 1e8 randomly generated mnemnonics).
Sanity checks against encrypted payload before send:
- Check that encrypted output is not equal to input.
- Check that decrypted output equals the input.
- Check that the output does NOT contain the input.
- Check that the output does NOT contain the passphrase.
- Check that the output does NOT contain any word of the passphrase.
- Publishing the encrypted keyshares to the chain allows anyone to verify that a sufficient number of keyshares have been preserved such that loss of funds is not possible, so long as NOs have backed up their mnemonic.
- Embedding the shares in the
TssPool
messages ensures that the shares are preserved immediately at the time of creation.
- Although we compress the shares before encrypting to reduce size, this results in some bloat in chain state. This size is dependent on the number of members in an Asgard, but is on the order of 100Kb in current conditions - breaking the same set of nodes into more asgards reduces the aggregate size of this bloat.
- Although we add a significant number of checks to prevent it, there is some risk in publishing these keyshares to a location that is publicly visible. Note that most of the vectors we consider a malicious actor could take (infra, supply chain) would result in them having access to the keyshares before they are encrypted and published anyway.
- Only backup some sample (like 50%) of the keyshares in this form - this mitigates some of the unease in negative #2, and still provides a safety net to reduce the likelihood of losing funds if a large percentage of the network was lost.
The main tradeoff is whether or not to publish the encrypted payload somewhere publicly visible - this is a positive since any person can verify and backup the encrypted keyshares of nodes, and a negative since publishing this data could potentially carry some security risk and also adds to bloat. We will outline the alternatives under consideration below in 2 categories to represent this tradeoff and ignore it in the positives and negatives - in all cases the backup is encrypted.
We could deploy a Postfix instance in the cluster to send an email with the encrypted shares to an address the NO configures, or have the NO pass in something like an S3 endpoint and auth token that would be used to push them to the target service.
- Additional setup and reliance on external services (the provider for the mail server, S3 API, etc).
This would keep the current approach to backup creation and extend make backup
to also send a transaction with a "heartbeat" message - after a certain buffer of blocks after the churn, nodes which have not sent the heartbeat will begin receiving slash points.
- Requires active participation from node operators to secure backups, could still lose funds if nodes were lost before a supermajority of all vaults engage.
This would basically require node operators to manage a machine that has persistent authorization to their Kubernetes cluster, and adding TC_NO_CONFIRM=true NAME=thornode make backup
to a crontab.
- Node operator must maintain, monitor, and secure (it has all the keys) the backup machine separately, since it cannot be on the same infrastructure provider as the node, and must have persistent authorization to the cluster in order to create the backups, which creates additional security risk.
This would be similar to the proposed design, but Bifrost would be extended to handle distribution of the encrypted keyshares to other active nodes instead of posting them on chain. Recovery would require cooperation from the nodes that held the backup. There could be a variant of this approach to only send keyshares to a subset of other nodes - these nodes could be randomly selected or perhaps the other members of the same vault. An additional variant could extend this pattern with a verification message posted on chain, so that one node could signal to the network that it has persisted the encrypted keyshares of another node.
- Additional complexity to add more P2P logic into Bifrost.
Same as proposed designed, but we push backups to IPFS and record the key in the TssPool
message.
- Additional dependency, complexity, backup point of failure for IPFS integration.
The following questions are generally relevant for any approach taken.
- Symmetric encryption with mnemonic or assymetric with key (generated from mnemonic)?
Update: It seems devs are mostly satisfied with currently proposed symmetric approach.
- In either case for #1, which encryption library to prefer (stdlib vs something like
age
)?Update: It seems devs are mostly satisfied with currently proposed usage of
age
.
- ...