This document covers the Disaster Recovery testing procedure for applications hosted on the Teacher Services AKS clusters based on scenarios detailed in the Disaster recovery document.
- Identified environment for the test e.g. qa, staging, test, etc
- Identified scenario(s) that are to be tested
- Repository workflows that should utilise existing DFE github-actions
- Deploy selected env
- Backup postgres database to Azure storage [required for scenario 1 above]
- Restore database from Azure storage [required for scenario 1 above]
- Restore database from point in time to new database server [required for scenario 2 above]
- Repo workflows to enable and disable the maintenance page.
- see https://github.com/DFE-Digital/teacher-services-cloud/blob/main/documentation/maintenance-page.md
- confirm workflows exists for the selected environment to be tested. Examples:
- an app url that identifies the current docker image sha. Can be part of the healthcheck e.g. https://github.com/sdglhm/okcomputer/blob/master/lib/ok_computer/built_in_checks/app_version_check.rb
- Identify the technical and non technical stakeholders who will participate in the test, based on the Teacher services list
Copy the template DR testing document which will be a record of the scenarios run, time taken, and any issues.
Participants must have access to Github and the repositories.
Schedule virtual meeting for the test to take place
- teams or slack
- invite the relevant stakeholders
Regularly provide updates on the service Slack channel to keep product owners abreast of developments.
See DR scenario 1.
Note that you must have a previously created backup on azure storage before starting this step. If not, create one now before continuing.
- Delete the existing postgres database
- manually delete via UI https://portal.azure.com/#browse/Microsoft.DBforPostgreSQL%2FflexibleServers
- Confirm it's deleted
- Check and delete any postgres diagnostics remaining for the deleted instance in https://portal.azure.com/#view/Microsoft_Azure_Monitoring/AzureMonitoringBrowseBlade/~/diagnosticsLogs as the later deploy to rebuild postgres will fail if it remains. e.g. search using subscription s189-teacher-services-cloud-test and resource group s189t01-ittms-stg-pg and look for enabled Diagnostic settings.
Follow the disaster recovery instructions.
See DR scenario 2.
Make a note of the time this step is being started as the restore point must be before you delete any data.
- Delete a table manually
- connect via konduit and delete the table
- it must be possible to confirm the data has been deleted either within the app, by errors messages being logged, the app crashing or users observing inconsistent content.
Follow the disaster recovery instructions.
- Complete the DR testing document and save in the DR test Reports folder
- Update the service on the infra team sharepoint service list with the DR date and status (success/fail)
- Review the just completed DR test, and raise trello cards for any process improvements.
- Review the contact list in the Teacher services list