Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inventory-based backup tool #197

Open
yarikoptic opened this issue Nov 14, 2024 · 16 comments
Open

Inventory-based backup tool #197

yarikoptic opened this issue Nov 14, 2024 · 16 comments

Comments

@yarikoptic
Copy link
Member

A solution (or part of it) for

We have inventory which gives us transactional log of what has happened to the bucket. We have it dumped also into the bucket:

dandi@drogon:/tmp$ aws s3 ls s3://dandiarchive/dandiarchive/dandiarchive/ | headtail
                           PRE 2019-10-02T04-00Z/
                           PRE 2019-10-03T04-00Z/
                           PRE 2019-10-04T04-00Z/
....
                           PRE 2024-11-12T01-00Z/
                           PRE 2024-11-13T01-00Z/
                           PRE data/
                           PRE hive/

(note - I do not think that beginning date is the ultimate beginning of the bucket unfortunately but probably ok), where for each dated folder we have

dandi@drogon:/tmp$ aws s3 ls s3://dandiarchive/dandiarchive/dandiarchive/2024-11-06T01-00Z/
2024-11-06 14:31:05         33 manifest.checksum
2024-11-06 14:31:05      72797 manifest.json

dandi@drogon:/tmp$ aws s3 cp s3://dandiarchive/dandiarchive/dandiarchive/2024-11-06T01-00Z/manifest.checksum .
download: s3://dandiarchive/dandiarchive/dandiarchive/2024-11-06T01-00Z/manifest.checksum to ./manifest.checksum
dandi@drogon:/tmp$ aws s3 cp s3://dandiarchive/dandiarchive/dandiarchive/2024-11-06T01-00Z/manifest.json .
download: s3://dandiarchive/dandiarchive/dandiarchive/2024-11-06T01-00Z/manifest.json to ./manifest.json
dandi@drogon:/tmp$ cat manifest.checksum 
03260719490f9dc0665b9d281fd4180a
dandi@drogon:/tmp$ md5sum manifest.json
03260719490f9dc0665b9d281fd4180a  manifest.json

and the manifest.json is actually pointers to the listing of items in the bucket

# dandi@drogon:/tmp$ head -n 15 manifest.json  
{
  "sourceBucket" : "dandiarchive",
  "destinationBucket" : "arn:aws:s3:::dandiarchive",
  "version" : "2016-11-30",
  "creationTimestamp" : "1730854800000",
  "fileFormat" : "CSV",
  "fileSchema" : "Bucket, Key, VersionId, IsLatest, IsDeleteMarker, Size, LastModifiedDate, ETag, IsMultipartUploaded",
  "files" : [ {
    "key" : "dandiarchive/dandiarchive/data/15a3a67a-6dec-44b6-9800-f8ddf5a44870.csv.gz",
    "size" : 169006021,
    "MD5checksum" : "bf6060d5dbf25c6443712b9846e812c2"
  }, {
    "key" : "dandiarchive/dandiarchive/data/0f25f9ec-cfc4-4e5a-932c-9a5ca5c2c8aa.csv.gz",
    "size" : 49364618,
    "MD5checksum" : "6d57525d105ef36e8228a408e4de29cc"

and those .csv is the compressed listings

dandi@drogon:/tmp$ aws s3 cp s3://dandiarchive/dandiarchive/dandiarchive/data/15a3a67a-6dec-44b6-9800-f8ddf5a44870.csv.gz .
download: s3://dandiarchive/dandiarchive/dandiarchive/data/15a3a67a-6dec-44b6-9800-f8ddf5a44870.csv.gz to ./15a3a67a-6dec-44b6-9800-f8ddf5a44870.csv.gz
dandi@drogon:/tmp$ zcat 15a3a67a-6dec-44b6-9800-f8ddf5a44870.csv.gz  | head -n 3
"dandiarchive","zarr/73107a2a-9eb2-47ed-be62-1feef97b8026/1/0/0/5/5/76","J_lZKtM.5fliCCeIGTmufk3R9lc_FmZg","true","false","904","2022-04-09T21:52:14.000Z","fbebe2d44529d0a25247e8b5bca5956b","false"
"dandiarchive","zarr/73107a2a-9eb2-47ed-be62-1feef97b8026/1/0/0/5/5/77","xy7HOh0ZZKXczghVqPCdijvxKQMAG_04","true","false","1548","2022-04-09T21:52:02.000Z","e0c48b2230b51636e3df32f3a15b65b8","false"
"dandiarchive","zarr/73107a2a-9eb2-47ed-be62-1feef97b8026/1/0/0/5/5/78","aVAyFLM0rDCMyaIlIWmn9j3PgANjUPuc","true","false","1578","2022-04-09T21:52:02.000Z","e451a3b1afd3cff713a2a1d7120d68ba","false"

We need a tool which would efficiently download and then incrementally update local backup of the bucket based on those logs. Some features to target/keep in mind:

  • should be efficient, so process multiple paths (keys) at once. So should be smart to not process two transactions for the same key in parallel etc
  • ideally we should make backup immediately usable as a "mirror" of the bucket at current point in time, but also be a true backup going back, so if something was deleted -- we still have it on local drive

one possible "approach" could be

  • reflect mtime for the key from inventory in the filename mtime when generating
  • if key is the latest version -- store under original path
    • store etags and versionids for keys in a folder under .versions.json or alike in that directory
  • when key is deleted or to be replaced with another one:
    • read record from .versions.json for {etag} and {versionid} and mv that {path} to {path}.old.{versionid}.{etag} (should preserve original mtime), and remove entry from .versions.json

this way we

  • have immediate access to all the keys as on the bucket
  • could always get to desired versionId of any key (either via .versions.json or looking through *.old.* files for the path
  • do have mtime (stored within filesystem) so we could prune some "timed out" *.old.* with a simple find command
  • avoid using symlinks etc - only 1 extra inode for that .versions.json in each folder though .

There is also that hive/:

dandi@drogon:/tmp$ aws s3 ls s3://dandiarchive/dandiarchive/dandiarchive/hive/ | head
                           PRE dt=2019-10-02-04-00/
                           PRE dt=2019-10-03-04-00/
                           PRE dt=2019-10-04-04-00/
                           PRE dt=2019-10-05-04-00/
...
dandi@drogon:/tmp$ aws s3 ls s3://dandiarchive/dandiarchive/dandiarchive/hive/dt=2019-10-02-04-00/
2019-10-02 18:36:10         92 symlink.txt
dandi@drogon:/tmp$ aws s3 cp  s3://dandiarchive/dandiarchive/dandiarchive/hive/dt=2019-10-02-04-00/symlink.txt .
download: s3://dandiarchive/dandiarchive/dandiarchive/hive/dt=2019-10-02-04-00/symlink.txt to ./symlink.txt
dandi@drogon:/tmp$ cat symlink.txt 
s3://dandiarchive/dandiarchive/dandiarchive/data/d8dd3e2b-8f74-494b-9370-9e3a6c69e2b0.csv.gz

dandi@drogon:/tmp$ zcat ./d8dd3e2b-8f74-494b-9370-9e3a6c69e2b0.csv.gz
"dandiarchive","dandiarchive/dandiarchive/data/","XhLH9OkJH9aPWihWcbMKDzL5JGtBxgkt","true","false","0","2019-10-02T00:08:02.000Z","d41d8cd98f00b204e9800998ecf8427e","false"

which I do not know yet what it is about.

NB ATM I am running a sync of the inventory under drogon:/mnt/backup/dandi/dandiarchive-inventory

@jwodder
Copy link
Member

jwodder commented Nov 14, 2024

@yarikoptic

  • What's the reason for a single manifest.json listing multiple CSV files? Is each CSV simply the events during a given slice of time for the day (hour?) in question?
  • Does the fileSchema for the CSVs ever differ from that shown above?
  • Do we even have space anywhere to back up the entire bucket at once?
  • Do you want the backup program to verify the checksums of the manifest.json and/or CSV files? What about ETags for keys (assuming they're always MD5 digests)?
  • Do you have any ideas or preferences for how the backup program should handle download failures? What about crashes or a Ctrl-C in the middle of a backup?

ideally we should make backup immediately usable as a "mirror" of the bucket at current point in time, but also be a true backup going back, so if something was deleted -- we still have it on local drive

I assume you mean by the last part that keys deleted from the bucket should not be deleted from the backup. What about if a key is modified — then what should happen to the old version? What does "also be a true backup going back" mean?

reflect mtime for the key from inventory in the filename mtime when generating

I don't know what you're trying to say here (First problem: filename for what?), and as a result I can't make sense of the rest of the source paragraph.

  • I'm guessing that you mean that each synced file should have its mtime set to the mtime of the source key in S3. Furthermore, it seems that this should only be done for the purposes of pruning with find. Is that correct?
  • Regarding the .versions.json and {path}.old.{versionid}.{etag} files: What if an Archive user uploads a file whose name matches one of those patterns? For Zarr entries, at least, the resulting key in S3 would match the uploaded filename, causing problems with the backup.

@yarikoptic
Copy link
Member Author

yarikoptic commented Nov 14, 2024

  • on first two: something to check with AWS docs , e.g. https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory-location.html . I did not dig in yet to provide trustworthy reply here
  • space is available at MIT where such a tool would need to operate. We should first test the tool outside on another sample bucket with inventory. @satra - do we have any other one like that? We might need to populate one if not
  • I think verification of checksums would be a good practice indeed to implement
    • ideally we should even verify the ETag, at least for blobs and zarrs where AFAIK it must match due to our client enforcing consistent chunking to match it with locally computed. I do not think there is generally guarantee to match ETag there.
    • edit: I see IsMultipartUploaded so we could verify ETag for sure for all non-multiparts as well
  • Ideally it should be robust to download failures, and be able to "track back", e.g. if it was under git like with backups2datalad -- git reset --hard; git clean -dfx . But here it is not feasible, hence we might want to explicitly "model" the "rollback" regime to get back to prior state -- remove fresh (complete or incomplete) downloads, undo mv's already done -- might be worth keeping a journal of operations or just being able to take prior state and "match" it.

I assume you mean by the last part that keys deleted from the bucket should not be deleted from the backup.

correct!

What about if a key is modified — then what should happen to the old version?

  1. the same as if it was deleted:
    1. mv {path} {path}.old.{versionid}.{etag}
    2. remove {path} entry from .versions.json
  2. download a new version into {path} and add corresponding metadata to .versions.json

NB. This whole design is just an idea ATM. So if you see shortcomings or have recommendations -- we can improve!

What does "also be a true backup going back" mean?

Good point. Let me state it as a requirement that we should be able to identify/recover any version of any file uploaded to the archive before not just current version. ( we might later do pruning though of .old.s thus relaxing this requirement).

reflect mtime for the key from inventory in the filename mtime when generating

I don't know what you're trying to say here (First problem: filename for what?)

I was trying to say that we should adjust mtime for the downloaded key to match mtime present in the inventory for it for that version, like wget etc do. E.g. for the

dandi@drogon:/tmp$ zcat 15a3a67a-6dec-44b6-9800-f8ddf5a44870.csv.gz  | head -n 3
"dandiarchive","zarr/73107a2a-9eb2-47ed-be62-1feef97b8026/1/0/0/5/5/76","J_lZKtM.5fliCCeIGTmufk3R9lc_FmZg","true","false","904","2022-04-09T21:52:14.000Z","fbebe2d44529d0a25247e8b5bca5956b",

we download zarr/73107a2a-9eb2-47ed-be62-1feef97b8026/1/0/0/5/5/76 into matching path locally, from versionId=J_lZKtM.5fliCCeIGTmufk3R9lc_FmZg and then adjust that file 76 mtime to be 2022-04-09T21:52:14.000Z. This way we do not need to store mtime anywhere in e.g. .versions.json but would have it available as mtime of that file even when we mv it AFAIK.

@yarikoptic
Copy link
Member Author

  • space is available at MIT where such a tool would need to operate. We should first test the tool outside on another sample bucket with inventory. @satra - do we have any other one like that? We might need to populate one if not

to test etc, would be worth to add a few options:

  • --path-filter REGEX to operate only on the paths matching the regex (while going through the entire inventory)
  • --start-date DATE and --end-date DATE (dates matching the folder names in inventory) to limit only to some range of dates.

overall -- I think the tool should not be dandi specific at all, and thus could be of general interest to public. (but that is why worth checking even more if something like that exist -- I failed to find so far)

@jwodder
Copy link
Member

jwodder commented Nov 14, 2024

@yarikoptic

  • It seems that the value of fileSchema depends on how the Archive configured its inventory generation. @dandi/archive-admin Can I rely on this value always being the same as shown above?

  • (Repeating from a recent edit to my previous comment) Regarding the .versions.json and {path}.old.{versionid}.{etag} files: What if an Archive user uploads a file whose name matches one of those patterns? For Zarr entries, at least, the resulting key in S3 would match the uploaded filename, causing problems with the backup.

Ideally it should be robust to download failures, and be able to "track back", e.g. if it was under git like with backups2datalad -- git reset --hard; git clean -dfx . But here it is not feasible, hence we might want to explicitly "model" the "rollback" regime to get back to prior state -- remove fresh (complete or incomplete) downloads, undo mv's already done -- might be worth keeping a journal of operations or just being able to take prior state and "match" it.

Please elaborate on exactly what behavior you want.

@jwodder
Copy link
Member

jwodder commented Nov 14, 2024

@dandi/archive-admin Question: I'm looking at s3://dandiarchive/dandiarchive/dandiarchive/2024-11-06T01-00Z/manifest.json, which was apparently generated on 2024 Nov 6, yet the first CSV file listed in it contains only entries with mtimes on 2022 April 09. Why the discrepancy?

@satra
Copy link
Member

satra commented Nov 14, 2024

those files are generated automatically by AWS Cloud Inventory. the CSVs together should be an inventory of all objects in that bucket. it's not a diff, but a total reflection of the bucket generated nightly.

@jwodder
Copy link
Member

jwodder commented Nov 14, 2024

@yarikoptic Given Satra's comment above, should the backup program just process the latest manifest.json file (or the manifest for a single date, if one is given on the command line)?

@yarikoptic
Copy link
Member Author

  • It seems that the value of fileSchema depends on how the Archive configured its inventory generation. @dandi/archive-admin Can I rely on this value always being the same as shown above?

Aiming for a generic tool, why not to make it flexible -- read fileSchema from each manifest.json file and map .csv columns correspondingly, and error out if some expected column is missing? That would make it easier for others to adopt. Related -- I do not think it is worth implementing at this point support for any other fileFormat other than CSV.

(Repeating from a recent edit to my previous comment) Regarding the .versions.json and {path}.old.{versionid}.{etag} files: What if an Archive user uploads a file whose name matches one of those patterns? For Zarr entries, at least, the resulting key in S3 would match the uploaded filename, causing problems with the backup.

Let's

  • minimize possibility of collision: make .versions.json to become .dandi-s3-backup-versions.json to minimize possibility of conflict
  • detect collision: if path in received data points to .dandi-s3-backup-versions.json or a file matching regex of .*\.old.{versionid_regex}.{etag_regex} -- error out due to collision. If we ever run into any -- we will decide what to do, e.g. to skip, obfuscate or something else.

Please elaborate on exactly what behavior you want.

rollback or match the prior state: add a function which would ensure that current tree is matching specific inventory one:

  • for a folder, go through the union of paths found in inventory, .dandi-s3-backup-versions.json, and on the drive (excluding .old. ones)
    • if file is on drive and/or .dandi-s3-backup-versions.json but not in inventory - remove file at {path} and from .dandi-s3-backup-versions.json
    • if file record in inventory not matching the one in .dandi-s3-backup-versions.json - remove from drive and the .dandi-s3-backup-versions.json
    • if file is not present on drive but in inventory - if there is a corresponding {path}.old.{versionid}.{etag} -- rename it to {path}, adjust .dandi-s3-backup-versions.json accordingly
      • if there is no .old. file -- fail, shouldn't happen

But while thinking about it, I realized that overall approach does not cover the case of key switching between being a file and directory.

  • When file becomes a directory -- all is easy, prior version gets renamed into {path}.old.{versionid}.{etag}, and for new {path}/ folder created.
  • When directory becomes a file -- just rename directory to {path}.old.dandi-s3-backup if such does not exist yet. If exists already -- nothing to be done.
    • need to add check for path to not end with .old.dandi-s3-backup into conflict detection above
    • to reconstruct some prior key for versionId we would need to inspect all parents to potentially carrying the .old.dandi-s3-backup suffix now

@yarikoptic Given Satra's comment above, should the backup program just process the latest manifest.json file (or the manifest for a single date, if one is given on the command line)?

In principle - yes, it could be coded first so it just processes the given (e.g. latest) manifest.json file.
But then we need a wrapper which would

  • keep track of what was the latest processed "ok" and whether it was processing some other already (when interrupted),
  • if found to be interrupted -- do the "match the prior state" to match "the latest processed" first
  • then in clean state -- process next one and mark upon completion that that next one is "the last processed"
  • do that in the loop until no new state found (until --end-date).

@jwodder
Copy link
Member

jwodder commented Nov 14, 2024

@yarikoptic

Aiming for a generic tool, why not to make it flexible -- read fileSchema from each manifest.json file and map .csv columns correspondingly, and error out if some expected column is missing?

That would be rather tedious to implement. Do you need this for the first working version?

@jwodder
Copy link
Member

jwodder commented Nov 14, 2024

@yarikoptic

for a folder, go through the union of paths found in inventory,

Because each set of inventories lists every single item in the bucket, this won't scale well. Just a single CSV file from the manifest you showed in the original comment contains three million entries.

@jwodder
Copy link
Member

jwodder commented Nov 14, 2024

@yarikoptic

In principle - yes, it could be coded first so it just processes the given (e.g. latest) manifest.json file.

Are you actually saying "no" here?

But then we need a wrapper which would

  • keep track of what was the latest processed "ok" and whether it was processing some other already (when interrupted),
  • if found to be interrupted -- do the "match the prior state" to match "the latest processed" first
  • then in clean state -- process next one and mark upon completion that that next one is "the last processed"
  • do that in the loop until no new state found (until --end-date).

If there are multiple inventories between the last processed and the most recent, why would we process each one? As Satra said, they're not diffs; each set of inventory files is a complete listing of the bucket.

@yarikoptic
Copy link
Member Author

In principle - yes, it could be coded first so it just processes the given (e.g. latest) manifest.json file.

Are you actually saying "no" here?

I am saying "no, it is not enough"

If there are multiple inventories between the last processed and the most recent, why would we process each one? As Satra said, they're not diffs; each set of inventory files is a complete listing of the bucket.

then you could potentially miss some versions of the files, which would be renamed into .old. if replaced with newer ones in the latest set.

@yarikoptic
Copy link
Member Author

@yarikoptic

for a folder, go through the union of paths found in inventory,

Because each set of inventories lists every single item in the bucket, this won't scale well. Just a single CSV file from the manifest you showed in the original comment contains three million entries.

yes, there is a scalability concern as we are expecting hundreds of millions entries (e.g https://github.com/dandisets/000108 alone accounts for 300 million files across its zarrs). If those lists are sorted though -- might be quite easy since then all files in a folder would be a sequential batch and we would process that sequential batch from inventory + files on drive and in .dandi-s3-backup-versions.json only for that folder, which would be either tiny or some thousands -- not more at once.

@jwodder
Copy link
Member

jwodder commented Nov 15, 2024

If there are multiple inventories between the last processed and the most recent, why would we process each one? As Satra said, they're not diffs; each set of inventory files is a complete listing of the bucket.

then you could potentially miss some versions of the files, which would be renamed into .old. if replaced with newer ones in the latest set.

It appears that the inventories include old/non-latest versions of keys along with deleted keys. What's left to miss?

@jwodder
Copy link
Member

jwodder commented Nov 15, 2024

yes, there is a scalability concern as we are expecting hundreds of millions entries (e.g https://github.com/dandisets/000108 alone accounts for 300 million files across its zarrs). If those lists are sorted though -- might be quite easy since then all files in a folder would be a sequential batch and we would process that sequential batch from inventory + files on drive and in .dandi-s3-backup-versions.json only for that folder, which would be either tiny or some thousands -- not more at once.

But wouldn't the rollback have to be run against the entire backup tree, which would be huge? Doing that in response to an error or Ctrl-C seems absurdly time-consuming.

@yarikoptic
Copy link
Member Author

then you could potentially miss some versions of the files, which would be renamed into .old. if replaced with newer ones in the latest set.

It appears that the inventories include old/non-latest versions of keys along with deleted keys. What's left to miss?

oh, I was not aware of that! If I am understanding correctly that every next date contains information about all versions of all keys, and unless GC picked up any, would be a superset over prior ones. As we do have now "trailing delete" enabled for S3, then for the initial run we do not even want to go through prior snapshots -- we better start with the latest one since it would be a better chance to have access to all versions of the keys. Great.

But it might be useful to start from a few dates back, so we could test correct operation while going to the "next" inventory version, as we would need to do daily (that is the motivation here - efficient daily backup of new changes, without explicitly listing entire bucket every day).

But wouldn't the rollback have to be run against the entire backup tree, which would be huge? Doing that in response to an error or Ctrl-C seems absurdly time-consuming.

Indeed. FWIW, aiming for incremental changes, I think we can minimize the amount of time when interruption would lead to requiring such a full blown roll back. E.g. if we

  1. do full analysis of what keys need to be downloaded, renamed, and removed without any changes to the data on drive. Interruption would result in non need for roll back or any other actions to cleanup
  2. do all necessary downloads into temporary space, e.g. .dandi-s3-backup-downloads/{versionid}.{etag} at the top folder. If interrupted in this stage -- would just need to rm -rf .dandi-s3-backup-downloads/.
  3. final stage: "expensive" to recover from if interrupted, hence interruption should be guarded (e.g. at least 3 CTrl-C's within 5 seconds, otherwise do not react): perform all planned rm, and mvs, and remove empty .dandi-s3-backup-downloads/ at the end (all files should be gone)
    • actually if we establish journal of those rm and mvs we could probably play them back as well leading to "lighter" way to recover, but I am not sure I would 100% trust it, thus running full fsck would still be desired for paranoid me.

WDYT? may be some better ways?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants