-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add data mover service #721
Comments
Also cc'ing @andrewpatto here |
The Steps S3 copy service does copies of objects listed in a source CSV - fanning out jobs to Fargate with an adjustable fan size. It uses rclone under the hood. It's main advantage is that you can specify the full list of files in CSV of arbitrary size and it can consume using a Steps Distributed Map. So is good for selective copying. If you just want to do a |
Because you specify each file in the source CSV - it can also do selective tasks on each individual file. So it can thaw files from Glacier.. or could be taught to decompress oras etc. |
It is also possible that rclone is not quite right for the task - am toying with a true S3 clone CLI tool as part of GUARDIANs (and building on the cloud checksum tool). But that would be way in the future. For now rclone is fine. |
Happy to use steps-s3-copy for the first part - it seems like it does this and more?
Does it also support copying all objects via a key prefix, E.g. copy from Also, it's just doing the copy right? If I wanted to delete objects at the source (to make it a move) that's separate? |
I can definitely see this is an extension of cloud-checksum. Especially using a similar timed Lambda approach, which could copy objects in parts. |
It passes the args (each line of CSV) through to rclone - so because rclone can support wildcards it can to. But that for instance then has to skip the thawing stage (because the thawing doesn't understand wildcards). So basically yes for prefix copies but it can possibly do with some refactoring to support it better.
Yeah it just does copies - but could easily have a delete on successful copy lambda at the end. |
I'd argue that a simple fargate that takes in a source and destination and simply runs |
Also
Yields 363 I'm not sure we need to fan-out 363 fargate jobs when the median file size is less than 2 KB.
Yields 1595 |
Create a new service which can move data from one place to another. For now, this is meant to move data based on a
portalRunId
in the cache bucket, to destination directory with the sameportalRunId
in the archive bucket.There should be two parts to this service:
sync
in Bash/Python. This is probably the simplest.ArchiveData
(name pending) event and starts the mover task. There should also be an event that triggers when the archiving is complete.Tasks
@reisingerf @alexiswl @victorskl let me know if this sounds right or if anything should be added/changed.
The text was updated successfully, but these errors were encountered: