Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data mover service #721

Closed
2 tasks
mmalenic opened this issue Nov 26, 2024 · 9 comments · Fixed by #726, #735, #749 or #754
Closed
2 tasks

Add data mover service #721

mmalenic opened this issue Nov 26, 2024 · 9 comments · Fixed by #726, #735, #749 or #754
Assignees
Labels

Comments

@mmalenic
Copy link
Member

mmalenic commented Nov 26, 2024

Create a new service which can move data from one place to another. For now, this is meant to move data based on a portalRunId in the cache bucket, to destination directory with the same portalRunId in the archive bucket.

There should be two parts to this service:

  • The actual mover task/job. This can be a fargate task that does the move. Some options for how to implement:
    • Script the AWS CLI to sync in Bash/Python. This is probably the simplest.
    • Use rclone?
  • The event trigger/wrapper which consumes an ArchiveData (name pending) event and starts the mover task. There should also be an event that triggers when the archiving is complete.
    • This could be implemented as step functions, which trigger the task and wait until it is completed.

Tasks

  • Implement mover task/job.
  • implement event trigger component.

@reisingerf @alexiswl @victorskl let me know if this sounds right or if anything should be added/changed.

@mmalenic mmalenic self-assigned this Nov 26, 2024
@mmalenic mmalenic added the feature New feature label Nov 26, 2024
@ohofmann
Copy link
Member

Also cc'ing @andrewpatto here

@andrewpatto
Copy link
Member

The Steps S3 copy service does copies of objects listed in a source CSV - fanning out jobs to Fargate with an adjustable fan size.

It uses rclone under the hood.

It's main advantage is that you can specify the full list of files in CSV of arbitrary size and it can consume using a Steps Distributed Map. So is good for selective copying. If you just want to do a rclone src/* dest/ then possibly other options are more straightforward.

@andrewpatto
Copy link
Member

Because you specify each file in the source CSV - it can also do selective tasks on each individual file. So it can thaw files from Glacier.. or could be taught to decompress oras etc.

@andrewpatto
Copy link
Member

It is also possible that rclone is not quite right for the task - am toying with a true S3 clone CLI tool as part of GUARDIANs (and building on the cloud checksum tool). But that would be way in the future. For now rclone is fine.

@mmalenic
Copy link
Member Author

Happy to use steps-s3-copy for the first part - it seems like it does this and more?

The Steps S3 copy service does copies of objects listed in a source CSV - fanning out jobs to Fargate with an adjustable fan size.

Does it also support copying all objects via a key prefix, E.g. copy from <bucket1>/<prefix>/* to <bucket2>/<prefix>/?

Also, it's just doing the copy right? If I wanted to delete objects at the source (to make it a move) that's separate?

@mmalenic
Copy link
Member Author

mmalenic commented Nov 26, 2024

am toying with a true S3 clone CLI tool as part of GUARDIANs (and building on the cloud checksum tool).

I can definitely see this is an extension of cloud-checksum. Especially using a similar timed Lambda approach, which could copy objects in parts.

@andrewpatto
Copy link
Member

The Steps S3 copy service does copies of objects listed in a source CSV - fanning out jobs to Fargate with an adjustable fan size.

Does it also support copying all objects via a key prefix, E.g. copy from <bucket1>/<prefix>/* to <bucket2>/<prefix>/?

It passes the args (each line of CSV) through to rclone - so because rclone can support wildcards it can to. But that for instance then has to skip the thawing stage (because the thawing doesn't understand wildcards). So basically yes for prefix copies but it can possibly do with some refactoring to support it better.

Also, it's just doing the copy right? If I wanted to delete objects at the source (to make it a move) that's separate?

Yeah it just does copies - but could easily have a delete on successful copy lambda at the end.

@alexiswl
Copy link
Member

I'd argue that a simple fargate that takes in a source and destination and simply runs aws s3 mv is probably more appropriate here given the timeframe.
We have ora data on our BYOB already that we want to move into ARCHIVE and still need to move analysis data out of the 'data-migration' (ICAv1 legacy data), into the BYOB to be moved into archive.

@alexiswl
Copy link
Member

Also

aws s3api list-objects-v2 \
  --output=json \
  --bucket pipeline-prod-cache-503977275616-ap-southeast-2 \
  --prefix byob-icav2/production/analysis/cttsov2/20241105b58605ff/ | \
jq '.Contents | length'

Yields

363

I'm not sure we need to fan-out 363 fargate jobs when the median file size is less than 2 KB.

aws s3api list-objects-v2 \
  --output=json \
  --bucket pipeline-prod-cache-503977275616-ap-southeast-2 \
  --prefix byob-icav2/production/analysis/cttsov2/20241105b58605ff/ | \
jq \
  '
    .Contents | 
    .sort_by(.Size) | 
    .[length/2|floor].Size
  '

Yields

1595

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment