Skip to content

Commit

Permalink
Merge pull request #1086 from input-output-hk/jpraynaud/add-productio…
Browse files Browse the repository at this point in the history
…n-runbook

Add network production runbooks for Aggregator
  • Loading branch information
jpraynaud authored Aug 10, 2023
2 parents f02a536 + d166b86 commit 09dc5e6
Show file tree
Hide file tree
Showing 9 changed files with 350 additions and 1 deletion.
19 changes: 19 additions & 0 deletions docs/runbook/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Mithril network runbook :shield:

This page gathers the available guides to operate a Mithril network.

:fire: This guides are intended to be used by expert users, and could lead to irreversible damages or loss for a network.

# Guides

| Operation | Location | Description
|------------|------------|------------
| **Genesis manually** | [manual-genesis](./genesis-manually/README.md) | Proceed to manual (re)genesis of the aggregator certificate chain.
| **Era markers** | [era-markers](./era-markers/README.md) | Create and update era markers on the Cardano chain.
| **Signer registrations monitoring** | [registrations-monitoring](./registrations-monitoring/README.md) | Gather aggregated data about signer registrations (versions, stake, ...).
| **Update protocol parameters** | [protocol-parameters](./protocol-parameters/README.md) | Update the protocol parameters of a Mithril network.
| **Recompute certificates hash** | [recompute-certificates-hash](./recompute-certificates-hash/README.md) | Recompute the certificates has of an aggregator.
| **Fix terraform lock** | [terraform-lock](./terraform-lock/README.md) | Fix a terraform lock in CD workflows.
| **Manage SSH access to infrastructure** | [ssh-access](./ssh-access/README.md) | Manage SSH access on the VM of the infrastructure for a user.


File renamed without changes.
91 changes: 91 additions & 0 deletions docs/runbook/genesis-manually/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Manual genesis of production Mithril network

## Configure environment variables
Export the environment variables:
```bash
export MITHRIL_VM=**MITHRIL_VM**
export CARDANO_NETWORK=**CARDANO_NETWORK**
```

Here is an example for the `release-mainnet` network:
```bash
export MITHRIL_VM=aggregator.release-mainnet.api.mithril.network
export CARDANO_NETWORK=mainnet
```

## Export the genesis payload to sign

Connect to the aggregator VM:
```bash
ssh curry@$MITHRIL_VM
```

Once connected to the aggregator VM, export the environment variables:
```bash
export CARDANO_NETWORK=**CARDANO_NETWORK**
```

And create genesis dir:
```bash
mkdir -p /home/curry/data/$CARDANO_NETWORK/mithril-aggregator/mithril/genesis
```
And connect to the aggregator container:
```bash
docker exec -it mithril-aggregator bash
```

Once connected to the aggregator container, export the genesis payload to sign:
```bash
/app/bin/mithril-aggregator -vvv genesis export --target-path /mithril-aggregator/mithril/genesis/genesis-payload-to-sign.txt
```

Then disconnect from the aggregator container:
```bash
exit
```

Then disconnect from the aggregator VM:
```bash
exit
```

## Sign the genesis payload

Once on your local machine, copy the genesis payload to sign from the aggregator VM:
```bash
scp curry@$MITHRIL_VM:/home/curry/data/$CARDANO_NETWORK/mithril-aggregator/mithril/genesis/genesis-payload-to-sign.txt .
```

Download or build the aggregator on your local machine as explained in this [documentation](https://mithril.network/doc/manual/developer-docs/nodes/mithril-aggregator#download-source)

Then, sign the payload with the genesis secret key:
```bash
./mithril-aggregator -vvv genesis sign --to-sign-payload-path genesis-payload-to-sign.txt --target-signed-payload-path genesis-payload-signed.txt --genesis-secret-key-path genesis.sk
```

## Import the signed genesis payload

Then, copy the signed genesis payload back to the aggregator VM:
```bash
scp ./genesis-payload-signed.txt curry@$MITHRIL_VM:/home/curry/data/$CARDANO_NETWORK/mithril-aggregator/mithril/genesis/genesis-payload-signed.txt
```

Then, connect back to the aggregator VM:
```bash
ssh curry@$MITHRIL_VM
```

Export the environment variable:
```bash
export CARDANO_NETWORK=**CARDANO_NETWORK**
```

And connect back to the aggregator container:
```bash
docker exec -it mithril-aggregator bash
```

Once connected to the aggregator container, import the signed genesis payload:
```bash
/app/bin/mithril-aggregator -vvv genesis import --signed-payload-path /mithril-aggregator/mithril/genesis/genesis-payload-signed.txt
```
71 changes: 71 additions & 0 deletions docs/runbook/protocol-parameters/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Update the protocol parameters of a Mithril network

## Introduction

The protocol parameters of a network are currently defined when starting the aggregator of the network.
During startup, the aggregator will store the parameters in its stores, and will use them **3** epochs later. The protocol parameters are broadcasted by the aggregator to the signers of the network through the `/epoch-settings` route.

## Update parameters of a Mithril network
The aggregator has the following configuration parameter used to set the protocol parameters: `protocol_parameters` which is a JSON representation of the `ProtocolParameter` type:
```bash
pub struct ProtocolParameters {
/// Quorum parameter
pub k: u64,

/// Security parameter (number of lotteries)
pub m: u64,

/// f in phi(w) = 1 - (1 - f)^w, where w is the stake of a participant
pub phi_f: f64,
}
```

Each parameter can also be set via an environment variable:
- `PROTOCOL_PARAMETERS__K` for `k`
- `PROTOCOL_PARAMETERS__M` for `m`
- `PROTOCOL_PARAMETERS__PHI_F` for `phi-f`

When setting up a Mithril network with a `terraform` deployment, the protocol parameters are set with a JSON definition.

## Find the workflow used to deploy a Mithril network

Currently, the following [Mithril networks](https://mithril.network/doc/manual/developer-docs/references#mithril-networks) are generally available, and deployed with `terraform`:
- `testing-preview`: with the workflow [`.github/workflows/ci.yml`](../../github/workflows/ci.yml)
- `pre-release-preview`: with the workflow [`.github/workflows/pre-release.yml`](../../github/workflows/pre-release.yml)
- `release-preprod`: with the workflow [`.github/workflows/release.yml`](../../github/workflows/release.yml)
- `release-mainnet`: with the workflow [`.github/workflows/release.yml`](../../github/workflows/release.yml)

## Update the protocol parameters

Update the following value of the targeted network in the deployment matrix with the new values that need to be used:
```bash
mithril_protocol_parameters: |
{
k = 5
m = 100
phi_f = 0.6
}
```

Which will be replaced eg with:
```bash
mithril_protocol_parameters: |
{
k = 2422
m = 20973
phi_f = 0.2
}
```

The modifications should be created in a dedicated PR, and the result of the **Plan** job of the terraform deployment should be analyzed precisely to make sure that the change has been taken into consideration.

## Deployment of the new protocol parameters

The update of the new protocol parameters will take place as detailed in the following table:
| Workflow | Deployed at | Effective at
|------------|------------|------------
| [`.github/workflows/ci.yml`](../../github/workflows/ci.yml) | Merge on `main` branch | **3** epochs later
| [`.github/workflows/pre-release.yml`](../../github/workflows/pre-release.yml) | Pre-release of a distribution | **3** epochs later
| [`.github/workflows/release.yml`](../../github/workflows/release.yml) | Release of a distribution | **3** epochs later

For more information about the CD, please refer to [Release process and versioning](https://mithril.network/doc/adr/3).
92 changes: 92 additions & 0 deletions docs/runbook/recompute-certificates-hash/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Recompute the certificates hashes of Mithril aggregator

## Configure environment variables
Export the environment variables:
```bash
export MITHRIL_VM=**MITHRIL_VM**
export CARDANO_NETWORK=**CARDANO_NETWORK**
```

Here is an exmaple for the `release-mainnet` network:
```bash
export MITHRIL_VM=aggregator.release-mainnet.api.mithril.network
export CARDANO_NETWORK=mainnet
```

## Make a backup of the aggregator database

Connect to the aggregator VM:
```bash
ssh curry@$MITHRIL_VM
```

Once connected to the aggregator VM, export the environment variables:
```bash
export CARDANO_NETWORK=**CARDANO_NETWORK**
```

And copy the SQLite database file `aggregator.sqlite3`:
```bash
cp /home/curry/data/$CARDANO_NETWORK/mithril-aggregator/mithril/stores/aggregator.sqlite3 cp /home/curry/data/$CARDANO_NETWORK/mithril-aggregator/mithril/stores/aggregator.sqlite3.bak.$(date +%Y-%m-%d)
```

And connect to the aggregator container:
```bash
docker exec -it mithril-aggregator bash
```

Once connected to the aggregator container, recompute the certificates hashes:
```bash
/app/bin/mithril-aggregator -vvv tools recompute-certificates-hash
```

Then disconnect from the aggregator container:
```bash
exit
```

## Restart the aggregator

Restart the aggregator to make sure that the certificate chain is valid:
```bash
docker restart mithril-aggregator
```

Make sure that the certificate chain is valid (wait for the state machiene to go into the state `READY`):
```bash
docker logs -f --tail 1000 mithril-aggregator
```

Then disconnect from the aggregator VM:
```bash
exit
```

## Rollback procedure

If the recomputation fails, you can rollback the database.

First, stop the aggregator:
```bash
docker stop mithril-aggregator
```

Then, restore the backed up database:
```bash
cp /home/curry/data/$CARDANO_NETWORK/mithril-aggregator/mithril/stores/aggregator.sqlite3.sqlite3.bak.$(date +%Y-%m-%d) cp /home/curry/data/$CARDANO_NETWORK/mithril-aggregator/mithril/stores/aggregator
```

Then, start the aggregator:
```bash
docker start mithril-aggregator
```

Make sure that the certificate chain is valid (wait for the state machiene to go into the state `READY`):
```bash
docker logs -f --tail 1000 mithril-aggregator
```

Then disconnect from the aggregator VM:
```bash
exit
```
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ query for that.
```sh
$> sqlite3 -table -batch \
$DATA_STORES_DIRECTORY/monitoring.sqlite3 \
< mithril-aggregator/utils/monitoring/stake_signer_version.sql
< stake_signer_version.sql
```

The variable `$DATA_STORES_DIRECTORY` should point to the directory where the
Expand Down
51 changes: 51 additions & 0 deletions docs/runbook/ssh-access/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Manage SSH access to infrastructure

## Add access to a user

### Create a SSH keypair for a user (if needed)

Create a new SSH keypair, with `ed25519` cryptography for maximum security:
```bash
ssh-keygen -t ed25519 -C "[email protected]"
```

Then, add your keypair to the ssh-agent:
```bash
ssh-add ~/.ssh/id_ed25519
```

### Retrieve the public key of your SSH keypair

Run the following command to retrieve your public key:
```bash
cat ~/.ssh/id_ed25519.pub
```

### Declare the public key

Add a line with the format `**REMOTE_USER**:*PUBLIC_KEY**` in the `mithril-infra/assets/ssh_keys` file for each:
```bash
echo "curry:ssh-ed25519 AAAE53AC3NzQ2vlZDI1aC1O4CpX+S2y1X9NTB4rv4k3pAAAAIF3b7L9sPV5ZiGgogmko [email protected]" >> **REPOSITORY_PATH**/mithril-infra/assets/ssh_keys
```

Then, create a PR with the updated `ssh_keys` file.

## Remove access to a user

To remove an access, simply remove the line(s) related to this user.

Then, create a PR with the updated `ssh_keys` file.

## When are the modifications applied?

The modifications will be applied the next time the terraform deployment is done:
- next **merge** in `main` branch for `testing-preview`
- next **pre-release** created for `pre-release-preview`
- next **release** created for `release-preprod`
- next **release** created for `release-mainnet`

When the modifications are applied, the VM is updated in place by terraform.

:warning: In case of emergency, the SSH keys can be modified by an administrator:
- In GCP [**Compute Engine**](https://console.cloud.google.com/compute/instances)
- The SSH keys can be edited in the targeted VM(s)
25 changes: 25 additions & 0 deletions docs/runbook/terraform-lock/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Fix terraform deployment lock

## Introduction

When the CI cancels a job that is in the middle of a terraform deployment, there is a chance that the lock file used by terraform under the hood to avoid concurrent deployment is not removed. In that cas, the next time a CI job tries to deploy, it will receive an error stating that there is a lock that prevents the deployment to be operated.

## Find the workflow used to deploy a Mithril network

Currently, the following [Mithril networks](https://mithril.network/doc/manual/developer-docs/references#mithril-networks) are generally available, and deployed with `terraform`:
- `testing-preview`: with the workflow [`.github/workflows/ci.yml`](../../github/workflows/ci.yml)
- `pre-release-preview`: with the workflow [`.github/workflows/pre-release.yml`](../../github/workflows/pre-release.yml)
- `release-preprod`: with the workflow [`.github/workflows/release.yml`](../../github/workflows/release.yml)
- `release-mainnet`: with the workflow [`.github/workflows/release.yml`](../../github/workflows/release.yml)


## Identify the terraform backend bucket
In the workflow file, there is a `terraform_backend_bucket` that details the GCP bucket that is used by terraform to store the state of the deployment.

## Reset the terraform lock

A user with administrator rights can simply remove the lock file:
- In GCP [**Cloud Storage**](https://console.cloud.google.com/storage/browser)
- In the terraform administration bucket that you have identified earlier, the file that needs to be removed is at path `**TERRAFORM_BACKEND_BUCKET**/terraform/mithril-**MITHRIL_NETWORK_IDENTIFIER**/.terraform.lock.hcl` (e.g. `mithril-terraform-prod/terraform/mithril-release-mainnet/terraform.lock.hcl`)

:warning: never delete/modify the `**TERRAFORM_BACKEND_BUCKET**/terraform/mithril-**MITHRIL_NETWORK_IDENTIFIER**/default.tfstate` file.

0 comments on commit 09dc5e6

Please sign in to comment.