-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design doc for AWS S3 sync to MIT Engaging HPC #2070
Draft
puja-trivedi
wants to merge
3
commits into
dandi:master
Choose a base branch
from
puja-trivedi:data_sync_doc
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
# S3 Backup to MIT Engaging | ||
|
||
## Proposed Solutions For Initial Backup from S3 to Server | ||
Use s5cmd or Globus to perform a full sync from S3 to the storage server. | ||
### s5cmd | ||
|
||
### Globus | ||
[WIP] | ||
|
||
## Proposed Solutions for Incremental Sync from S3 to Server | ||
|
||
### s5cmd sync <br> | ||
`s5cmd sync [source] [destination]` | ||
**How the `sync` Command Works:** <br> | ||
`sync` command synchronizes S3 buckets, prefixes, directories and files between S3 buckets and prefixes as well. It compares files between source and destination, taking source files as source-of-truth. It makes a one way synchronization from source to destination without modifying any of the source files and deleting any of the destination files (unless `--delete flag` has passed). | ||
|
||
It *only* copies files that: | ||
- Do not exist in destination destination | ||
- Sync strategy allows | ||
- Default: By default `s5cmd` compares files' both size *and* modification times, treating source files as source of truth. Any difference in size or modification time would cause s5cmd to copy source object to destination. | ||
- Size only: With `--size-only` flag, it's possible to use the strategy that would only compare file sizes. Source treated as source of truth and any difference in sizes would cause s5cmd to copy source object to destination. | ||
|
||
|
||
**Scheduling the Sync** <br> | ||
Can automate S3 to Engaging sync with a cron job on the server to run the s5cmd sync command at regular intervals (e.g., daily or hourly). | ||
|
||
|
||
### Globus | ||
[WIP] | ||
|
||
## Proposed Solution for Incremental Sync from Server to S3 | ||
### `rsync` with Mirror Directory Before s5cmd sync <br> | ||
``` | ||
rsync -av --delete /data/ /backup/s3mirror/ | ||
s5cmd sync /backup/s3mirror/ s3://dandi-bucket/ | ||
``` | ||
**Pros and Cons of `rsync`** | ||
- Pros: | ||
|
||
- Efficient Local Comparison with rsync: rsync is highly optimized for local file comparisons, so it can quickly check which files in /data are new or modified and only update those in s3mirror. This makes the sync step more lightweight, as /s3mirror will already contain only the files that need to be checked by s5cmd. | ||
- Reduced S3 API Calls: Since s5cmd is only syncing the files in s3mirror, it’s making fewer API calls and checks to S3, which can save on costs and bandwidth if you’re running frequent backups. | ||
- Works Well for Large Datasets: For extremely large directories, this approach minimizes the scope of files s5cmd needs to process, since rsync has already filtered out unchanged files locally. | ||
|
||
- Cons: | ||
- Additional Complexity: You have to manage an extra directory (s3mirror) and an extra rsync step in your workflow. | ||
- Extra Storage Required: You need enough storage on your server to hold the s3mirror directory, which mirrors the contents of /data. | ||
|
||
### Globus | ||
[WIP] | ||
|
||
## Version Tracking on S3 | ||
[WIP] | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A tool to provide efficient solution is being implemented by @jwodder in https://github.com/dandi/s3invsync . IMHO it should be the first candidate to try