Skip to content

Commit

Permalink
docs: GDE Argo disaster recovery runbook TDE-1281 (#814)
Browse files Browse the repository at this point in the history
#### Motivation

Create a disaster recovery Runbook for restoring Argo Workflows with EKS
and RDS so that the cluster can be rebuilt easily.

Moved from linz/topo-aws-infrastructure#366 -
should address all the issues brought up there.

#### Checklist

- [ ] Tests updated
- [x] Docs updated
- [x] Issue linked in Title

Co-authored-by: Alice Fage <[email protected]>
  • Loading branch information
l0b0 and amfage authored Oct 15, 2024
1 parent b851ea5 commit 1b8b0ab
Showing 1 changed file with 150 additions and 0 deletions.
150 changes: 150 additions & 0 deletions docs/infrastructure/gde-argo-runbook.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
# Disaster recovery runbook

**Warning:** If going through this process when the cluster is still functional, make sure to agree on a time for the deployment with the end users. Teardown and restore should not take more than 4 hours.

## Purpose

Rebuild the Argo Workflows cluster from scratch, restoring existing database contents.

## Prerequisites

1. [`node`](https://nodejs.org/)
2. [`helm`](https://helm.sh/docs/intro/install/)
3. [`kubectl`](https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/) - should be the same version as the EKS version of the original cluster. At time of writing, this is only available by looking for `KubernetesVersion.of('VERSION')` in the code (for example, `KubernetesVersion.of('1.30')`).
4. [`argo`](https://github.com/argoproj/argo-workflows/releases/) - should be the same version as the Argo Workflows Server version of the original cluster. At time of writing, this is only available by looking for `appVersion = 'vVERSION'` in the code (for example, `appVersion = 'v3.5.5'`).
5. You need to be able to log in using the following AWS accounts and roles to restore production:
- LI Topo production account as admin
- ODR access account as admin using the admin profile

## Setup

We need to make sure we're starting from a sane repository state. Skip any steps you're _sure_ you don't need to do:

1. Clone the [Open Data Registry repo](https://github.com/linz/open-data-registry-cdk/): `git clone [email protected]:linz/open-data-registry-cdk.git`
2. Go into the Open Data Registry repo: `cd open-data-registry-cdk`.
3. Install dependencies: `npm install`.
4. Exit the Open Data Registry repo: `cd ..`.
5. Clone the [Topo AWS infrastructure repo](https://github.com/linz/topo-aws-infrastructure/): `git clone [email protected]:linz/topo-aws-infrastructure.git`
6. Clone [this repo](https://github.com/linz/topo-workflows/): `git clone [email protected]:linz/topo-workflows.git`
7. Go into the Topo workflows repo: `cd topo-workflows`
8. Clean the repository of any generated files: `git clean -d --force -x`
9. Reset any changes to files: `git reset --hard HEAD`
10. Check out the relevant commit: `git checkout ID`. This could be `origin/master`, the commit used to deploy the old production cluster,
11. Install dependencies: `npm install`
12. Log into the LI Topo production account as admin

## [Teardown existing cluster](./destroy.md)

If any of the cluster infrastructure exists but is not functional, see the above link for how to tear it down completely.

## Update database version if necessary

1. Get the details of the most recent production database snapshot: `aws rds describe-db-snapshots --output json --query="sort_by(DBSnapshots[?contains(DBSnapshotIdentifier,'workflows-argodb')], &SnapshotCreateTime)[-1]"`
2. Compare `EngineVersion` from the above output to `PostgresEngineVersion.VER_` in the code.
3. Update `PostgresEngineVersion.VER_` in the code with the snapshot `EngineVersion`.
4. Git commit and push the change above (if applicable).

## Deployment of new cluster

1. Set AWS Account ID for CDK: `export CDK_DEFAULT_ACCOUNT="$(aws sts get-caller-identity --query Account --output text)"`.
2. Deploy prod cluster using all the relevant roles as maintainers:

```
ci_role="$(aws iam list-roles --output=text --query="Roles[?starts_with(RoleName, 'CiTopoProd-CiRole')].Arn")"
admin_role="arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):role/AccountAdminRole"
workflow_maintainer_role="$(aws cloudformation describe-stacks --output=text --query="Stacks[].Outputs[].OutputValue" --stack-name=TopographicSharedResourcesProd)"
npx cdk deploy --context=maintainer-arns="${ci_role},${admin_role},${workflow_maintainer_role}" Workflows
```

3. Deploy Argo Workflows without archiving:

1. Connect AWS CLI to the new cluster: `aws eks update-kubeconfig --name=Workflows`.
2. Create the Argo Workflows configuration files: `npx cdk8s synth`.
3. Remove the `persistence` section of `dist/0005-argo-workflows.k8s.yaml` to disable workflow archiving to database. For example:

```patch
--- dist/0005-argo-workflows.k8s.yaml.orig
+++ dist/0005-argo-workflows.k8s.yaml
@@ -88,26 +88,6 @@
keyFormat: "{{workflow.creationTimestamp.Y}}-{{workflow.creationTimestamp.m}}/{{workflow.creationTimestamp.d}}-{{workflow.name}}/{{pod.name}}"
region: ap-southeast-2
useSDKCreds: true
- persistence:
- [redacted]
- tableName: argo_workflows
workflowDefaults:
spec:
parallelism: 3
```

4. Apply the configuration files twice (may fail the first time due to [CRD async behaviour](initial.deployment.md#custom-resource-definitions)): `kubectl apply --filename=dist/`.

4. Create a temporary RDS database from the snapshot identified when finding the engine version above:
1. Get details of the new cluster database: `aws rds describe-db-instances --query="DBInstances[?DBName=='argo'].{EndpointAddress: Endpoint.Address, DBSubnetGroupName: DBSubnetGroup.DBSubnetGroupName, VpcSecurityGroupIds: VpcSecurityGroups[].VpcSecurityGroupId}"`.
2. Go to https://ap-southeast-2.console.aws.amazon.com/rds/home?region=ap-southeast-2#db-snapshot:engine=postgres;id=ID, replacing "ID" with the `DBSnapshotIdentifier`.
3. Click on _Actions_ → _Restore snapshot_.
4. Under _Availability and durability_: select _Single DB Instance_.
5. Under _Settings_ set _DB instance identifier_ to "temp-argo-db".
6. Under _Instance configuration_: select _Burstable classes_ and _db.t3.micro_.
7. Under _Connectivity_ → _DB subnet group_: select the DB subnet group of the new cluster.
8. Under _Connectivity_ → _Existing VPC security groups_: select the VPC security group of the new cluster.
9. Click _Restore DB instance_.
10. Wait for the temporary DB to get to the "Available" state.
5. Dump the temporary database to the new Argo database:

1. Submit a ["sleep" workflow](../../workflows/test/sleep.yml) to the new Argo Workflows installation to spin up a pod:
`argo submit --namespace=argo workflows/test/sleep.yml`. This will be used to connect to RDS to dump the database to a file.
2. Connect to the sleep pod (it can take a while for the pod to spin up, so you might have to retry the second command):

```
pod_name="$(kubectl --namespace=argo get pods --output=name | grep --only-matching 'test-sleep-.*')"
kubectl --namespace=argo exec --stdin --tty "$pod_name" -- /bin/bash
```

3. Install the PostgreSQL client:

```
apt update
apt install -y postgresql-client
```

4. Get the temporary db endpoint address: `aws rds describe-db-instances --query="DBInstances[?DBName=='temp-argo-db'].Endpoint.Address"`.
5. Dump the database from the temporary database, replacing ENDPOINT with the temp-argo-db endpoint address: `pg_dump --host=ENDPOINT --username=argo_user --dbname=argo > argodbdump`.
You will be prompted for a password, get the password from the [AWS Systems Manager Parameter Store](https://ap-southeast-2.console.aws.amazon.com/systems-manager/parameters/%252Feks%252Fargo%252Fpostgres%252Fpassword/description?region=ap-southeast-2&tab=Table).
6. Load the database into the new Argo database, replacing ENDPOINT with the new cluster endpoint address:
`psql --host=ENDPOINT --username=argo_user --dbname=argo < argodbdump`.
You will be prompted for a password, get the password from the [AWS Systems Manager Parameter Store](https://ap-southeast-2.console.aws.amazon.com/systems-manager/parameters/%252Feks%252Fargo%252Fpostgres%252Fpassword/description?region=ap-southeast-2&tab=Table).

6. Redeploy the cluster configuration files to enable the connection to the database and turn on workflow archiving:

1. Run `npx cdk8s synth` to recreate the `persistence` section in `dist/0005-argo-workflows.k8s.yaml`.
2. Redeploy the Argo config file: `kubectl replace --filename=dist/0005-argo-workflows.k8s.yaml`.
3. Restart the workflow controller and the server:

```
kubectl --namespace=argo rollout restart deployment argo-workflows-workflow-controller
kubectl --namespace=argo rollout restart deployment argo-workflows-server
```

7. Trigger deployment of Argo workflows. If you created a pull request [above](#update-database-version-if-necessary), merging it will trigger the job. Otherwise you have to trigger the [main workflow](https://github.com/linz/topo-workflows/actions/workflows/main.yml) manually.

## Open Data Registry

**Warning:** This section is for _production only._ When developing in non-prod environments, skip to the next section.

1. Go to the repo with the ODR configuration: `cd ../topo-aws-infrastructure`.
2. Copy the [LINZ Open Data Registry account](https://github.com/linz/topo-aws-infrastructure/blob/master/src/accounts/odr/README.md) CDK context declaration here: `cp ../topo-aws-infrastructure/src/accounts/odr/cdk.context.json .`.
3. Update the ARN of the role with a name starting with "Workflows-EksWorkflowsArgoRunnerServiceAccountRole" in `cdk.context.json` to the output of `aws iam list-roles --output=text --query="Roles[?contains(RoleName, 'Workflows-EksWorkflowsArgoRunnerServiceAccountRole')].RoleName"`.
4. Log into ODR access account as admin using the admin profile.
5. Deploy the ODR datasets stack: `npx cdk deploy Datasets`.
6. Commit the updated CDK context:
1. `cp cdk.context.json ../topo-aws-infrastructure/src/accounts/odr/cdk.context.json`
2. `cd ../topo-aws-infrastructure`
3. Commit, push, and create a pull request for this branch.

## Finalise

1. Let the users know that Argo is once again available.
2. Tidy up
1. Delete the _temporary_ database in the AWS web console → RDS or with `aws rds delete-db-instance --db-instance-identifier=ID --skip-final-snapshot`
2. Terminate the sleep workflow: `argo --namespace=argo stop "$(argo --namespace=argo list --output=name)"`

0 comments on commit 1b8b0ab

Please sign in to comment.