Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design doc for AWS S3 sync to MIT Engaging HPC #2070

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions doc/design/s3-engaging-backup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# S3 Backup to MIT Engaging

## Proposed Solutions For Initial Backup from S3 to Server
Use s5cmd or Globus to perform a full sync from S3 to the storage server.
### s5cmd

### Globus
[WIP]

## Proposed Solutions for Incremental Sync from S3 to Server

### s5cmd sync <br>
`s5cmd sync [source] [destination]`
**How the `sync` Command Works:** <br>
`sync` command synchronizes S3 buckets, prefixes, directories and files between S3 buckets and prefixes as well. It compares files between source and destination, taking source files as source-of-truth. It makes a one way synchronization from source to destination without modifying any of the source files and deleting any of the destination files (unless `--delete flag` has passed).

It *only* copies files that:
- Do not exist in destination destination
- Sync strategy allows
- Default: By default `s5cmd` compares files' both size *and* modification times, treating source files as source of truth. Any difference in size or modification time would cause s5cmd to copy source object to destination.
- Size only: With `--size-only` flag, it's possible to use the strategy that would only compare file sizes. Source treated as source of truth and any difference in sizes would cause s5cmd to copy source object to destination.


**Scheduling the Sync** <br>
Can automate S3 to Engaging sync with a cron job on the server to run the s5cmd sync command at regular intervals (e.g., daily or hourly).


### Globus
[WIP]

## Proposed Solution for Incremental Sync from Server to S3
### `rsync` with Mirror Directory Before s5cmd sync <br>
```
rsync -av --delete /data/ /backup/s3mirror/
s5cmd sync /backup/s3mirror/ s3://dandi-bucket/
```
**Pros and Cons of `rsync`**
- Pros:

- Efficient Local Comparison with rsync: rsync is highly optimized for local file comparisons, so it can quickly check which files in /data are new or modified and only update those in s3mirror. This makes the sync step more lightweight, as /s3mirror will already contain only the files that need to be checked by s5cmd.
- Reduced S3 API Calls: Since s5cmd is only syncing the files in s3mirror, it’s making fewer API calls and checks to S3, which can save on costs and bandwidth if you’re running frequent backups.
- Works Well for Large Datasets: For extremely large directories, this approach minimizes the scope of files s5cmd needs to process, since rsync has already filtered out unchanged files locally.

- Cons:
- Additional Complexity: You have to manage an extra directory (s3mirror) and an extra rsync step in your workflow.
- Extra Storage Required: You need enough storage on your server to hold the s3mirror directory, which mirrors the contents of /data.

### Globus
[WIP]

## Version Tracking on S3
[WIP]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A tool to provide efficient solution is being implemented by @jwodder in https://github.com/dandi/s3invsync . IMHO it should be the first candidate to try

Loading