All large files in cohort folders should be placed on archiveInstant tiering #719

alexiswl · 2024-11-25T01:58:29Z

We have a set of cohort data split into projects enter the following

aws s3 ls s3://pipeline-prod-cache-503977275616-ap-southeast-2/byob-icav2/
                           PRE cohort-apgi-prod/
                           PRE cohort-brca-atlas-prod/
                           PRE cohort-column-pi-prod/
                           PRE cohort-hmf-pdac-prod/
                           PRE cohort-pdac-prod/
                           PRE cohort-super-prod/
                           PRE ctdna-tso-v2-6-validation-prod/
                           PRE external-agrf-prod/
                           PRE production/
                           PRE reference-data/
                           PRE validation-data/
                           PRE wgs-accreditation-prod/

Any bam files in the cohort-* directories should be sent to Intelligent Tiering 'Archive Instant'.

The text was updated successfully, but these errors were encountered:

alexiswl · 2024-11-25T01:58:57Z

@skanwal

mmalenic · 2024-11-25T02:44:30Z

Do these come in as Standard by default, or are they placed into Intelligent Tiering with the default frequent access, and then need to be moved into Archive Instant? Seems like a lot of bams are already in Intelligent Tiering.

victorskl · 2024-11-25T03:08:48Z

Pls do run past check with Flo. Managed by bucket life cycle? @reisingerf

reisingerf · 2024-11-25T03:26:43Z

The bucket (byob prefix) is configured to push everything into IT (see here).

AFAIK we can't decide on the storage tier when it's under IT (this is then handled automatically)

victorskl · 2024-11-25T22:09:20Z

Hang on.

I recall from early discussion, we wish not to put any objects in operational-ready store (like pipeline-cache bucket) into archive tier classes. But to move them into a dedicated archive bucket.

Do we change this view now that - due to complexity/current situation?
Let us catchup again, pls.

mmalenic · 2024-11-25T22:55:32Z

AFAIK we can't decide on the storage tier when it's under IT (this is then handled automatically)

Adding to this, I think we can specify Archive Access or Deep Archive Access, but then it's no longer instant-ish retrieval, and it needs to be restored. But yeah, it looks like the other tiers are handled automatically by AWS.

alexiswl · 2024-11-26T23:51:23Z

What about 'S3 Glacier Instant Retrieval', which we could force .bam files to after say one week?
This would be the same storage pricing as Archive Instant ($5 per Tb per month), but we don't need to wait for 30 days of standard storage ($25 per Tb per month), plus another 60 days of Infrequent Access Tier ($13 per Tb), to get to Archive Instant Retrieval, which is immediately reset to Frequent Access when the data is touched.

The API / retrieval pricing of S3 Glacier Instant Retrieval is $0.03 per Gb, so a 100 Gb bam would cost $3 to retrieve.

The same bam would cost $5.10 in the first 90 days of storage on Intelligent Tiering.

reisingerf · 2024-11-27T00:26:09Z

All good points, but optimisations in my view...

Ultimately, I'd like to get to a point where we have different storage back-ends, with different retention / tiering options, and can choose between them based on use case (project, research, clinical, ... ) and potentially cost attribution.
I think the OrcaBus system can handle that, but it will take some time to get set up and automated.

Having said that: Yes, for well known use cases / projects, we could start by changing the lifecycle configuration and manage it on a per cohort/project prefix rather than for the whole BYOB share.
Note: we need to change and split the current setup (instead of just "overwriting" with more specific configurations).
See: https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-conflicts.html

victorskl · 2024-12-10T01:45:51Z

Need business process strategy to refine cohort data retention policy

alexiswl added the feature New feature label Nov 25, 2024

alexiswl assigned mmalenic Nov 25, 2024

victorskl unassigned mmalenic Dec 10, 2024

victorskl added question Further information is requested investigation Look into how best to approach the problem and removed feature New feature labels Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All large files in cohort folders should be placed on archiveInstant tiering #719

All large files in cohort folders should be placed on archiveInstant tiering #719

alexiswl commented Nov 25, 2024

alexiswl commented Nov 25, 2024

mmalenic commented Nov 25, 2024

victorskl commented Nov 25, 2024

reisingerf commented Nov 25, 2024

victorskl commented Nov 25, 2024

mmalenic commented Nov 25, 2024

alexiswl commented Nov 26, 2024

reisingerf commented Nov 27, 2024

victorskl commented Dec 10, 2024

All large files in cohort folders should be placed on archiveInstant tiering #719

All large files in cohort folders should be placed on archiveInstant tiering #719

Comments

alexiswl commented Nov 25, 2024

alexiswl commented Nov 25, 2024

mmalenic commented Nov 25, 2024

victorskl commented Nov 25, 2024

reisingerf commented Nov 25, 2024

victorskl commented Nov 25, 2024

mmalenic commented Nov 25, 2024

alexiswl commented Nov 26, 2024

reisingerf commented Nov 27, 2024

victorskl commented Dec 10, 2024