Skip to content

Commit

Permalink
[sync] cleanup Readme.md
Browse files Browse the repository at this point in the history
  • Loading branch information
jbygdell committed Oct 30, 2023
1 parent b19b796 commit 95ab192
Showing 1 changed file with 83 additions and 121 deletions.
204 changes: 83 additions & 121 deletions sda/cmd/sync/sync.md
Original file line number Diff line number Diff line change
@@ -1,172 +1,134 @@
# sda-pipeline: backup
# Sync

Moves data to backup storage and optionally merges it with the encryption header.
Moves data to sync storage and optionally merges it with the encryption header.

## Configuration

There are a number of options that can be set for the backup service.
There are a number of options that can be set for the sync service.
These settings can be set by mounting a yaml-file at `/config.yaml` with settings.

ex.

```yaml
log:
level: "debug"
format: "json"
```
They may also be set using environment variables like:
```bash
export LOG_LEVEL="debug"
export LOG_FORMAT="json"
```

### Backup specific settings

- `BACKUP_COPYHEADER`: if `true`, the backup service will reencrypt and add headers to the backup files.

#### Keyfile settings
### Keyfile settings

These settings control which crypt4gh keyfile is loaded.
These settings are only needed is `copyheader` is `true`.

- `C4GH_FILEPATH`: path to the crypt4gh keyfile
- `C4GH_PASSPHRASE`: pass phrase to unlock the keyfile
- `C4GH_BACKUPPUBKEY`: path to the crypt4gh public key to use for reencrypting file headers.
- `C4GH_FILEPATH`: path to the crypt4gh keyfile
- `C4GH_PASSPHRASE`: pass phrase to unlock the keyfile
- `C4GH_SYNCPUBKEY`: path to the crypt4gh public key to use for reencrypting file headers.

### RabbitMQ broker settings

These settings control how backup connects to the RabbitMQ message broker.

- `BROKER_HOST`: hostname of the rabbitmq server

- `BROKER_PORT`: rabbitmq broker port (commonly `5671` with TLS and `5672` without)

- `BROKER_QUEUE`: message queue to read messages from (commonly `backup`)

- `BROKER_ROUTINGKEY`: message queue to write success messages to (commonly `completed`)

- `BROKER_USER`: username to connect to rabbitmq

- `BROKER_PASSWORD`: password to connect to rabbitmq

- `BROKER_PREFETCHCOUNT`: Number of messages to pull from the message server at the time (default to 2)

### PostgreSQL Database settings:

- `DB_HOST`: hostname for the postgresql database

- `DB_PORT`: database port (commonly 5432)
These settings control how sync connects to the RabbitMQ message broker.

- `DB_USER`: username for the database
- `BROKER_HOST`: hostname of the rabbitmq server
- `BROKER_PORT`: rabbitmq broker port (commonly `5671` with TLS and `5672` without)
- `BROKER_QUEUE`: message queueor stream to read messages from (commonly `completed_stream`)
- `BROKER_USER`: username to connect to rabbitmq
- `BROKER_PASSWORD`: password to connect to rabbitmq
- `BROKER_PREFETCHCOUNT`: Number of messages to pull from the message server at the time (default to 2)

- `DB_PASSWORD`: password for the database
### PostgreSQL Database settings

- `DB_DATABASE`: database name
- `DB_HOST`: hostname for the postgresql database
- `DB_PORT`: database port (commonly 5432)
- `DB_USER`: username for the database
- `DB_PASSWORD`: password for the database
- `DB_DATABASE`: database name
- `DB_SSLMODE`: The TLS encryption policy to use for database connections. Valid options are:
- `disable`
- `allow`
- `prefer`
- `require`
- `verify-ca`
- `verify-full`

- `DB_SSLMODE`: The TLS encryption policy to use for database connections.
Valid options are:
- `disable`
- `allow`
- `prefer`
- `require`
- `verify-ca`
- `verify-full`
More information is available [in the postgresql documentation](https://www.postgresql.org/docs/current/libpq-ssl.html#LIBPQ-SSL-PROTECTION)

More information is available
[in the postgresql documentation](https://www.postgresql.org/docs/current/libpq-ssl.html#LIBPQ-SSL-PROTECTION)
Note that if `DB_SSLMODE` is set to anything but `disable`, then `DB_CACERT` needs to be set, and if set to `verify-full`, then `DB_CLIENTCERT`, and `DB_CLIENTKEY` must also be set

Note that if `DB_SSLMODE` is set to anything but `disable`, then `DB_CACERT` needs to be set,
and if set to `verify-full`, then `DB_CLIENTCERT`, and `DB_CLIENTKEY` must also be set

- `DB_CLIENTKEY`: key-file for the database client certificate

- `DB_CLIENTCERT`: database client certificate file

- `DB_CACERT`: Certificate Authority (CA) certificate for the database to use
- `DB_CLIENTKEY`: key-file for the database client certificate
- `DB_CLIENTCERT`: database client certificate file
- `DB_CACERT`: Certificate Authority (CA) certificate for the database to use

### Storage settings

Storage backend is defined by the `ARCHIVE_TYPE`, and `BACKUP_TYPE` variables.
Valid values for these options are `S3` or `POSIX`
(Defaults to `POSIX` on unknown values).
Storage backend is defined by the `ARCHIVE_TYPE`, and `SYNC_DESTINATION_TYPE` variables.
Valid values for these options are `S3` or `POSIX` for `ARCHIVE_TYPE` and `POSIX`, `S3` or `SFTP` for `SYNC_DESTINATION_TYPE`.

The value of these variables define what other variables are read.
The same variables are available for all storage types, differing by prefix (`ARCHIVE_`, or `BACKUP_`)
The same variables are available for all storage types, differing by prefix (`ARCHIVE_`, or `SYNC_DESTINATION_`)

if `*_TYPE` is `S3` then the following variables are available:
- `*_URL`: URL to the S3 system
- `*_ACCESSKEY`: The S3 access and secret key are used to authenticate to S3,
[more info at AWS](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html#access-keys-and-secret-access-keys)
- `*_SECRETKEY`: The S3 access and secret key are used to authenticate to S3,
[more info at AWS](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html#access-keys-and-secret-access-keys)
- `*_BUCKET`: The S3 bucket to use as the storage root
- `*_PORT`: S3 connection port (default: `443`)
- `*_REGION`: S3 region (default: `us-east-1`)
- `*_CHUNKSIZE`: S3 chunk size for multipart uploads.
# CA certificate is only needed if the S3 server has a certificate signed by a private entity
- `*_CACERT`: Certificate Authority (CA) certificate for the storage system

and if `*_TYPE` is `POSIX`:
- `*_LOCATION`: POSIX path to use as storage root

### Logging settings:

- `LOG_FORMAT` can be set to “json” to get logs in json format.
All other values result in text logging

- `LOG_LEVEL` can be set to one of the following, in increasing order of severity:
- `trace`
- `debug`
- `info`
- `warn` (or `warning`)
- `error`
- `fatal`
- `panic`

## Service Description
The backup service copies files from the archive storage to backup storage. If a public key is supplied and the copyHeader option is enabled the header will be re-encrypted and attached to the file before writing it to backup storage.

When running, backup reads messages from the configured RabbitMQ queue (default "backup").
For each message, these steps are taken (if not otherwise noted, errors halts progress, the message is Nack'ed, and the service moves on to the next message):

1. The message is validated as valid JSON that matches either the "ingestion-completion" or "ingestion-accession" schema (based on configuration).
If the message can’t be validated it is discarded with an error message in the logs.
- `*_URL`: URL to the S3 system
- `*_ACCESSKEY`: The S3 access and secret key are used to authenticate to S3, [more info at AWS](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html#access-keys-and-secret-access-keys)
- `*_SECRETKEY`: The S3 access and secret key are used to authenticate to S3, [more info at AWS](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html#access-keys-and-secret-access-keys)
- `*_BUCKET`: The S3 bucket to use as the storage root
- `*_PORT`: S3 connection port (default: `443`)
- `*_REGION`: S3 region (default: `us-east-1`)
- `*_CHUNKSIZE`: S3 chunk size for multipart uploads.
- `*_CACERT`: Certificate Authority (CA) certificate for the storage system, CA certificate is only needed if the S3 server has a certificate signed by a private entity

1. The file path and file size is fetched from the database.
1. In case the service is configured to copy headers, the path is replaced by the one of the incoming message and it is the original location where the file was uploaded in the inbox.
if `*_TYPE` is `POSIX`:

1. The file size on disk is requested from the storage system.
- `*_LOCATION`: POSIX path to use as storage root

1. The database file size is compared against the disk file size.
and if `*_TYPE` is `SFTP`:

1. A file reader is created for the archive storage file, and a file writer is created for the backup storage file.
- `*_HOST`: URL to the SFTP server
- `*_PORT`: Port of the SFTP server to connect to
- `*_USERNAME`: Username connectin to the SFTP server
- `*_HOSTKEY`: The SFTP server's public key
- `*_PEMKEYPATH`: Path to the ssh private key used to connect to the SFTP server
- `*_PEMKEYPASS`: Passphrase for the ssh private key

1. If the service is configured to copy headers:
### Logging settings

1. The header is read from the database.
On error, the error is written to the logs, but the message continues processing.
- `LOG_FORMAT` can be set to “json” to get logs in json format. All other values result in text logging
- `LOG_LEVEL` can be set to one of the following, in increasing order of severity:
- `trace`
- `debug`
- `info`
- `warn` (or `warning`)
- `error`
- `fatal`
- `panic`

1. The header is decrypted.
If this causes an error, the error is written to the logs, the message is Nack'ed, but message processing continues.

1. The header is reencrypted.
If this causes an error, the error is written to the logs, the message is Nack'ed, but message processing continues.

1. The header is written to the backup file writer.
On error, the error is written to the logs, but the message continues processing.
## Service Description

1. The file data is copied from the archive file reader to the backup file writer.
The sync service copies files from the archive storage to sync storage.

1. A completed message is sent to RabbitMQ, if this fails a message is written to the logs, and the message is neither nack'ed nor ack'ed.
When running, sync reads messages from the "completed" RabbitMQ queue.
For each message, these steps are taken (if not otherwise noted, errors halts progress, the message is Nack'ed, and the service moves on to the next message):

1. The message is Ack'ed.
1. The message is validated as valid JSON that matches either the "ingestion-completion" schema. If the message can’t be validated it is sent to the error queue for later analysis.
2. The archive file path and file size is fetched from the database.
3. The file size on disk is requested from the storage system.
4. The archive file size from the database is compared against the disk file size.
5. A file reader is created for the archive storage file, and a file writer is created for the sync storage file.
1. The header is read from the database.
2. The header is decrypted.
3. The header is reencrypted.
4. The header is written to the sync file writer.
6. The file data is copied from the archive file reader to the sync file writer.
7. The message is Ack'ed.

## Communication

- Backup reads messages from one rabbitmq queue (default `backup`)

- Backup writes messages to one rabbitmq queue (default `completed`)

- Backup optionally reads encryption headers from the database and can not be started without a database connection.
This is done using the `GetArchived`, and `GetHeaderForStableID` functions.

- Backup reads data from archive storage and writes data to backup storage.
- Sync reads messages from one rabbitmq stream (`completed_stream`)
- Sync reads file information and headers from the database and can not be started without a database connection. This is done using the `GetArchived`, and `GetHeaderForStableID` functions.
- Sync reads data from archive storage and writes data to sync destination storage.

0 comments on commit 95ab192

Please sign in to comment.