Skip to content

broadinstitute/imaging-backup-scripts

Repository files navigation

imaging-backup-scripts

Scripts to backup and archive data

Old approach

We used to create tarballs and move to a separate bucket (s3://imaging-platform-cold) that is set to move all contents to Glacier after 7 days. We no longer do this. See instructions for restoring data from Glacier for tarballs that were archived using this method.

New approach

Our primary bucket is now in "Intelligent Tiering" mode. See notes here and read more about intelligent tiering here.

This means that there's no explicit processing for archiving files. They automatically and gradually move into a Glacier'd state after 6 months.

There are two approaches for restoring such files.

  1. For restoring individual files, do this
aws s3api \
  restore-object \
  --bucket BUCKET-NAME \
  --key KEY \
  --restore-request GlacierJobParameters={"Tier"="Standard"}

Run this to check on status

aws s3api \
  head-object \
  --bucket BUCKET-NAME \
  --key KEY

ongoing-request will equal false when the data is ready to be retrieved

{
    "AcceptRanges": "bytes",
    "Restore": "ongoing-request=\"true\"",
    "ArchiveStatus": "ARCHIVE_ACCESS",
    "LastModified": "Thu, 01 Apr 2021 11:24:32 GMT",
    "ContentLength": 4438197,
    "ETag": "\"473ca4d8ad6889a90544f6acff916d31\"",
    "ContentType": "text/csv",
    "Metadata": {},
    "StorageClass": "INTELLIGENT_TIERING"
}

KEY is the unique object identifier for the file you would like to restore. It can be found by clicking on the file in the console so that it shows the Object Overview. It resembles a path with folders even though S3 uses object storage. e.g. projects/PROJECT_NAME/workspace/metadata/BATCH_NAME/metadata.json

  1. For restoring a whole folder, use the restore_intelligent.py script. See the comments in the script for notes on retrieval cost.

Slack discussions

https://broadinstitute.slack.com/archives/C3QFX04P7/p1627496601111300

We've now had some data in Intelligent Tiering long enough that it needs to be restored! Unfortunately, because of the way object stores work, while you can go into the AWS console and restore one file with a point and click, you can't do that with 'folders', because 'folders' in AWS aren't real. Our previous Glacier restore scripts a) were configured to only the file types we typically Glaciered (md5 or .tgz) b) wouldn't work directly out of the box even so because Intelligent Tiering doesn't restore for just X days and c) weren't super full-featured in terms of only letting us grab subsets of things anyway. I've added a new restoration script to our imaging-backup-scripts repo; right now it only supports Intelligent Tiering restoration but if we think we want to in the future it could pretty easily add more features and/or get it to work on Glacier too. https://github.com/broadinstitute/imaging-backup-scripts/blob/master/restore_intelligent.py

https://broadinstitute.slack.com/archives/C3QFX04P7/p1627496974114300

You'll want to do this the night before you need to access any files that have not been used by anyone on AWS in >3 months, since restoration can "on average" take 3-5 hours; if you only need a handful of files you can use the expedited retrieval option, which takes only a few minutes, but in that case you likely want to just use the console anyway. Expedited retrieval is $300/plate for CellPainting data (vs 0 in standard), so please do use Standard for large data sets unless there is a very good reason not to!

https://broadinstitute.slack.com/archives/C3QFX04P7/p1627498625115200?thread_ts=1627496601.111300&cid=C3QFX04P7

[A]s a performance test [I restored] a plate that is 2300 files (aka about 10% the size of a "normal" plate) the restoration request [script] took about 8 minutes in total to run. So if you need to restore many plates, build in time for that [for the restoration request script to run BEFORE the 3-5 hours the restores will take], OR consider using something like parallel to execute the script for each of your many plates.

About

Scripts to backup data for the Imaging Platform

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published