Technical Specification

Build a backup service to handle compression, encryption, monitoring, offsite storage, and rotation.

This central service will be monitored as any Dropwizard service, as well as providing health checks to monitor durations since backups for specific services. It will provide an easy to use API for receiving backups over HTTP.

Backups will be generated using simple scripts, scheduled with crontab. The scripts start by calling the start backup API and receiving a job ID. This allows us to mark a backup as in progress while the script actually creates the tar (or whatever it is creating). After sending the file(s) to the backup API, it calls the backup finish API with any output generated during the run, and a success or failure code.

Backups will be verified on a schedule (perhaps weekly) using a separate set of scripts, scheduled with crontab. These scripts will call the latest backup API to receive the backed up file, and then test a restore. How a restore is tested is very domain specific.

Goals, Metrics for success

Backups now meet our retention policy.
We have backups for all services we should have.
Backups are all verified as restorable.
Backups are monitored and alerted upon if they either haven't ran, or fail.
Backups are geo-redundant.
Setting up backup of a new service is trivial, and self-serve.

Requirements

Compression: Gzip, Snappy
- Both GZIP and Snappy compression are supported, and can be defined in the configuration. The current compression codec implementation uses the Snappy compression algorithm.
Encryption: AES
- The current encryption codec implementation uses a 128 bit AES/CBC/PKCS5Padding algorithm.
Retention:
- 7 daily, 4 weekly, 6 monthly.
- This is achieved using a scheduled background cleaner task that removes out of policy backups.
Offsite Storage: Local and offsite.
- Azure
Monitoring: Dropwizard style with Health checks.
Verification: Via external scripts.
Authentication:
- Authentications of services is achieved with the use of an Authorization header with the schema Token, for example Authorization: Token abcd1234. The token used in the first POST /api/backup/{service} for a given service is registered for that service, so that every request after the first one to any authenticated API endpoint for that given service will be unauthorised if a different header is used.
- Authentication of users is done using the dropwizard-auth-ldap authenticator. This authentication is only used when attempting to download a backup through the UI (at /download/{service}/{id}/{filename}). When an authenticated user attempts to download a backup, we generate a temporary token and redirect to the /api/backup/{service}/{id}/{filename} endpoint. This allows the user to generate a download URL for a backup, that is valid for 1 hour, without exposing their LDAP credentials.

Issues

Redeploy will kill any running backups.
Nodes are stateful because they use local storage. Requests for backups must be directed to the correct node manually.

Security

Downloading backups requires LDAP authentication.
Services require an API key to perform/access backups.
Transport encryption - SSL termination is performed within Dropwizard itself.
File integrity - Support Content-MD5 header for verification of received and stored files.

Future work

If a backup fails the entire script must restart, we have no ability to resume a specific chunk. The same is true for verifications.
Bandwidth throttling to Azure
Let the service decide when a backup/verification is needed by making backupsclients scripts run periodically (i.e: once every 5 mins) polling the server to /api/backup/{service}/needed or similar
Let the service decide which backup node should the client push the data to based on available resources.

Provide feedback

Saved searches