Add data mover service #721

mmalenic · 2024-11-26T01:09:55Z

Create a new service which can move data from one place to another. For now, this is meant to move data based on a portalRunId in the cache bucket, to destination directory with the same portalRunId in the archive bucket.

There should be two parts to this service:

The actual mover task/job. This can be a fargate task that does the move. Some options for how to implement:
- Script the AWS CLI to sync in Bash/Python. This is probably the simplest.
- Use rclone?
The event trigger/wrapper which consumes an ArchiveData (name pending) event and starts the mover task. There should also be an event that triggers when the archiving is complete.
- This could be implemented as step functions, which trigger the task and wait until it is completed.

Tasks

Implement mover task/job.
implement event trigger component.

@reisingerf @alexiswl @victorskl let me know if this sounds right or if anything should be added/changed.

The text was updated successfully, but these errors were encountered:

ohofmann · 2024-11-26T02:09:47Z

Also cc'ing @andrewpatto here

andrewpatto · 2024-11-26T02:25:50Z

The Steps S3 copy service does copies of objects listed in a source CSV - fanning out jobs to Fargate with an adjustable fan size.

It uses rclone under the hood.

It's main advantage is that you can specify the full list of files in CSV of arbitrary size and it can consume using a Steps Distributed Map. So is good for selective copying. If you just want to do a rclone src/* dest/ then possibly other options are more straightforward.

andrewpatto · 2024-11-26T02:26:54Z

Because you specify each file in the source CSV - it can also do selective tasks on each individual file. So it can thaw files from Glacier.. or could be taught to decompress oras etc.

andrewpatto · 2024-11-26T02:28:08Z

It is also possible that rclone is not quite right for the task - am toying with a true S3 clone CLI tool as part of GUARDIANs (and building on the cloud checksum tool). But that would be way in the future. For now rclone is fine.

mmalenic · 2024-11-26T03:15:30Z

Happy to use steps-s3-copy for the first part - it seems like it does this and more?

The Steps S3 copy service does copies of objects listed in a source CSV - fanning out jobs to Fargate with an adjustable fan size.

Does it also support copying all objects via a key prefix, E.g. copy from <bucket1>/<prefix>/* to <bucket2>/<prefix>/?

Also, it's just doing the copy right? If I wanted to delete objects at the source (to make it a move) that's separate?

mmalenic · 2024-11-26T03:17:05Z

am toying with a true S3 clone CLI tool as part of GUARDIANs (and building on the cloud checksum tool).

I can definitely see this is an extension of cloud-checksum. Especially using a similar timed Lambda approach, which could copy objects in parts.

andrewpatto · 2024-11-26T03:26:05Z

The Steps S3 copy service does copies of objects listed in a source CSV - fanning out jobs to Fargate with an adjustable fan size.

Does it also support copying all objects via a key prefix, E.g. copy from <bucket1>/<prefix>/* to <bucket2>/<prefix>/?

It passes the args (each line of CSV) through to rclone - so because rclone can support wildcards it can to. But that for instance then has to skip the thawing stage (because the thawing doesn't understand wildcards). So basically yes for prefix copies but it can possibly do with some refactoring to support it better.

Also, it's just doing the copy right? If I wanted to delete objects at the source (to make it a move) that's separate?

Yeah it just does copies - but could easily have a delete on successful copy lambda at the end.

alexiswl · 2024-11-26T06:23:01Z

I'd argue that a simple fargate that takes in a source and destination and simply runs aws s3 mv is probably more appropriate here given the timeframe.
We have ora data on our BYOB already that we want to move into ARCHIVE and still need to move analysis data out of the 'data-migration' (ICAv1 legacy data), into the BYOB to be moved into archive.

alexiswl · 2024-11-26T08:20:18Z

Also

aws s3api list-objects-v2 \
  --output=json \
  --bucket pipeline-prod-cache-503977275616-ap-southeast-2 \
  --prefix byob-icav2/production/analysis/cttsov2/20241105b58605ff/ | \
jq '.Contents | length'

Yields

363

I'm not sure we need to fan-out 363 fargate jobs when the median file size is less than 2 KB.

aws s3api list-objects-v2 \
  --output=json \
  --bucket pipeline-prod-cache-503977275616-ap-southeast-2 \
  --prefix byob-icav2/production/analysis/cttsov2/20241105b58605ff/ | \
jq \
  '
    .Contents | 
    .sort_by(.Size) | 
    .[length/2|floor].Size
  '

Yields

1595

mmalenic self-assigned this Nov 26, 2024

mmalenic added the feature New feature label Nov 26, 2024

victorskl added the platform label Nov 26, 2024

mmalenic mentioned this issue Nov 28, 2024

feat(data-migrate): data mover task #726

Merged

mmalenic linked a pull request Nov 28, 2024 that will close this issue

feat(data-migrate): data mover task #726

Merged

mmalenic closed this as completed in #726 Nov 29, 2024

mmalenic linked a pull request Nov 29, 2024 that will close this issue

fix(data-migrate): set arch as ARM64 #735

Merged

victorskl linked a pull request Dec 3, 2024 that will close this issue

fix(data-migrate): fix stdout capture and failed step functions #749

Merged

mmalenic linked a pull request Dec 5, 2024 that will close this issue

fix(data-migrate): add structured output to step functions #754

Merged

mmalenic mentioned this issue Dec 5, 2024

fix(data-migrate): add structured output to step functions #754

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add data mover service #721

Add data mover service #721

mmalenic commented Nov 26, 2024 •

edited

Loading

ohofmann commented Nov 26, 2024

andrewpatto commented Nov 26, 2024

andrewpatto commented Nov 26, 2024

andrewpatto commented Nov 26, 2024

mmalenic commented Nov 26, 2024

mmalenic commented Nov 26, 2024 •

edited

Loading

andrewpatto commented Nov 26, 2024

alexiswl commented Nov 26, 2024

alexiswl commented Nov 26, 2024

Add data mover service #721

Add data mover service #721

Comments

mmalenic commented Nov 26, 2024 • edited Loading

Tasks

ohofmann commented Nov 26, 2024

andrewpatto commented Nov 26, 2024

andrewpatto commented Nov 26, 2024

andrewpatto commented Nov 26, 2024

mmalenic commented Nov 26, 2024

mmalenic commented Nov 26, 2024 • edited Loading

andrewpatto commented Nov 26, 2024

alexiswl commented Nov 26, 2024

alexiswl commented Nov 26, 2024

mmalenic commented Nov 26, 2024 •

edited

Loading

mmalenic commented Nov 26, 2024 •

edited

Loading