From 307eef88b683bf4ccb56567b99c2e12ac6a28047 Mon Sep 17 00:00:00 2001 From: Puja Trivedi Date: Mon, 4 Nov 2024 09:38:29 -0800 Subject: [PATCH 1/3] initial commit --- doc/design/s3-engaging-backup.md | 28 ++++++++++++++++++++++++++++ 1 file changed, 28 insertions(+) create mode 100644 doc/design/s3-engaging-backup.md diff --git a/doc/design/s3-engaging-backup.md b/doc/design/s3-engaging-backup.md new file mode 100644 index 000000000..d7ba1707d --- /dev/null +++ b/doc/design/s3-engaging-backup.md @@ -0,0 +1,28 @@ +# S3 Backup to MIT Engaging + +## Proposed Solution For Initial Backup from S3 to Server +Use s5cmd or Globus to perform a full sync from S3 to the storage server. +### s5cmd + +### Globus + + + +## Proposed Solution for Incremental Sync from S3 to Server + +### s5cmd sync +**How the `sync` Command Works:**
+`sync` command synchronizes S3 buckets, prefixes, directories and files between S3 buckets and prefixes as well. It compares files between source and destination, taking source files as source-of-truth. It makes a one way synchronization from source to destination without modifying any of the source files and deleting any of the destination files (unless `--delete flag` has passed). + +It *only* copies files that: +- Do not exist in destination destination +- Sync strategy allows + - Default: By default `s5cmd` compares files' both size *and* modification times, treating source files as source of truth. Any difference in size or modification time would cause s5cmd to copy source object to destination. + - Size only: With `--size-only` flag, it's possible to use the strategy that would only compare file sizes. Source treated as source of truth and any difference in sizes would cause s5cmd to copy source object to destination. + + +**Scheduling the Sync**
+Can automate S3 to Engaging sync with a cron job on the server to run the s5cmd sync command at regular intervals (e.g., daily or hourly). + +**Limitations**
+s5cmd does not natively support bidirectional sync. However, it is very efficient for one-way sync from S3 to server. \ No newline at end of file From a1205660a7a48c5ff6a148626d30ede8138cd625 Mon Sep 17 00:00:00 2001 From: Puja Trivedi Date: Mon, 4 Nov 2024 10:13:52 -0800 Subject: [PATCH 2/3] created outline for doc, need to fill in remaining information --- doc/design/s3-engaging-backup.md | 23 ++++++++++++++++------- 1 file changed, 16 insertions(+), 7 deletions(-) diff --git a/doc/design/s3-engaging-backup.md b/doc/design/s3-engaging-backup.md index d7ba1707d..960805b01 100644 --- a/doc/design/s3-engaging-backup.md +++ b/doc/design/s3-engaging-backup.md @@ -1,16 +1,15 @@ # S3 Backup to MIT Engaging -## Proposed Solution For Initial Backup from S3 to Server +## Proposed Solutions For Initial Backup from S3 to Server Use s5cmd or Globus to perform a full sync from S3 to the storage server. ### s5cmd ### Globus +[WIP] +## Proposed Solutions for Incremental Sync from S3 to Server - -## Proposed Solution for Incremental Sync from S3 to Server - -### s5cmd sync +### s5cmd sync `s5cmd sync [source] [destination] **How the `sync` Command Works:**
`sync` command synchronizes S3 buckets, prefixes, directories and files between S3 buckets and prefixes as well. It compares files between source and destination, taking source files as source-of-truth. It makes a one way synchronization from source to destination without modifying any of the source files and deleting any of the destination files (unless `--delete flag` has passed). @@ -24,5 +23,15 @@ It *only* copies files that: **Scheduling the Sync**
Can automate S3 to Engaging sync with a cron job on the server to run the s5cmd sync command at regular intervals (e.g., daily or hourly). -**Limitations**
-s5cmd does not natively support bidirectional sync. However, it is very efficient for one-way sync from S3 to server. \ No newline at end of file + +## Globus +[WIP] + +## Proposed Solution for Tracking Changes on the Server +[WIP] + +## Proposed Solution for Incremental Sync from Server to S3 +[WIP] + +## Version Tracking on S3 +[WIP] \ No newline at end of file From 78acf4ce8a851447344cdf48eda5755eea02af3a Mon Sep 17 00:00:00 2001 From: Puja Trivedi Date: Mon, 4 Nov 2024 10:54:54 -0800 Subject: [PATCH 3/3] added info for s5cmd. next step: globus --- doc/design/s3-engaging-backup.md | 25 ++++++++++++++++++++----- 1 file changed, 20 insertions(+), 5 deletions(-) diff --git a/doc/design/s3-engaging-backup.md b/doc/design/s3-engaging-backup.md index 960805b01..73c53ce9c 100644 --- a/doc/design/s3-engaging-backup.md +++ b/doc/design/s3-engaging-backup.md @@ -9,7 +9,8 @@ Use s5cmd or Globus to perform a full sync from S3 to the storage server. ## Proposed Solutions for Incremental Sync from S3 to Server -### s5cmd sync `s5cmd sync [source] [destination] +### s5cmd sync
+`s5cmd sync [source] [destination]` **How the `sync` Command Works:**
`sync` command synchronizes S3 buckets, prefixes, directories and files between S3 buckets and prefixes as well. It compares files between source and destination, taking source files as source-of-truth. It makes a one way synchronization from source to destination without modifying any of the source files and deleting any of the destination files (unless `--delete flag` has passed). @@ -24,13 +25,27 @@ It *only* copies files that: Can automate S3 to Engaging sync with a cron job on the server to run the s5cmd sync command at regular intervals (e.g., daily or hourly). -## Globus -[WIP] - -## Proposed Solution for Tracking Changes on the Server +### Globus [WIP] ## Proposed Solution for Incremental Sync from Server to S3 +### `rsync` with Mirror Directory Before s5cmd sync
+``` +rsync -av --delete /data/ /backup/s3mirror/ +s5cmd sync /backup/s3mirror/ s3://dandi-bucket/ +``` +**Pros and Cons of `rsync`** +- Pros: + + - Efficient Local Comparison with rsync: rsync is highly optimized for local file comparisons, so it can quickly check which files in /data are new or modified and only update those in s3mirror. This makes the sync step more lightweight, as /s3mirror will already contain only the files that need to be checked by s5cmd. + - Reduced S3 API Calls: Since s5cmd is only syncing the files in s3mirror, it’s making fewer API calls and checks to S3, which can save on costs and bandwidth if you’re running frequent backups. + - Works Well for Large Datasets: For extremely large directories, this approach minimizes the scope of files s5cmd needs to process, since rsync has already filtered out unchanged files locally. + +- Cons: + - Additional Complexity: You have to manage an extra directory (s3mirror) and an extra rsync step in your workflow. + - Extra Storage Required: You need enough storage on your server to hold the s3mirror directory, which mirrors the contents of /data. + +### Globus [WIP] ## Version Tracking on S3