Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All large files in cohort folders should be placed on archiveInstant tiering #719

Open
alexiswl opened this issue Nov 25, 2024 · 9 comments
Labels
investigation Look into how best to approach the problem question Further information is requested

Comments

@alexiswl
Copy link
Member

We have a set of cohort data split into projects enter the following

aws s3 ls s3://pipeline-prod-cache-503977275616-ap-southeast-2/byob-icav2/
                           PRE cohort-apgi-prod/
                           PRE cohort-brca-atlas-prod/
                           PRE cohort-column-pi-prod/
                           PRE cohort-hmf-pdac-prod/
                           PRE cohort-pdac-prod/
                           PRE cohort-super-prod/
                           PRE ctdna-tso-v2-6-validation-prod/
                           PRE external-agrf-prod/
                           PRE production/
                           PRE reference-data/
                           PRE validation-data/
                           PRE wgs-accreditation-prod/

Any bam files in the cohort-* directories should be sent to Intelligent Tiering 'Archive Instant'.

@alexiswl alexiswl added the feature New feature label Nov 25, 2024
@alexiswl
Copy link
Member Author

@skanwal

@mmalenic
Copy link
Member

Do these come in as Standard by default, or are they placed into Intelligent Tiering with the default frequent access, and then need to be moved into Archive Instant? Seems like a lot of bams are already in Intelligent Tiering.

@victorskl
Copy link
Member

Pls do run past check with Flo. Managed by bucket life cycle? @reisingerf

@reisingerf
Copy link
Member

The bucket (byob prefix) is configured to push everything into IT (see here).

AFAIK we can't decide on the storage tier when it's under IT (this is then handled automatically)

@victorskl
Copy link
Member

Hang on.

I recall from early discussion, we wish not to put any objects in operational-ready store (like pipeline-cache bucket) into archive tier classes. But to move them into a dedicated archive bucket.

Do we change this view now that - due to complexity/current situation?
Let us catchup again, pls.

@mmalenic
Copy link
Member

AFAIK we can't decide on the storage tier when it's under IT (this is then handled automatically)

Adding to this, I think we can specify Archive Access or Deep Archive Access, but then it's no longer instant-ish retrieval, and it needs to be restored. But yeah, it looks like the other tiers are handled automatically by AWS.

@alexiswl
Copy link
Member Author

What about 'S3 Glacier Instant Retrieval', which we could force .bam files to after say one week?
This would be the same storage pricing as Archive Instant ($5 per Tb per month), but we don't need to wait for 30 days of standard storage ($25 per Tb per month), plus another 60 days of Infrequent Access Tier ($13 per Tb), to get to Archive Instant Retrieval, which is immediately reset to Frequent Access when the data is touched.

The API / retrieval pricing of S3 Glacier Instant Retrieval is $0.03 per Gb, so a 100 Gb bam would cost $3 to retrieve.

The same bam would cost $5.10 in the first 90 days of storage on Intelligent Tiering.

@reisingerf
Copy link
Member

All good points, but optimisations in my view...

Ultimately, I'd like to get to a point where we have different storage back-ends, with different retention / tiering options, and can choose between them based on use case (project, research, clinical, ... ) and potentially cost attribution.
I think the OrcaBus system can handle that, but it will take some time to get set up and automated.

Having said that: Yes, for well known use cases / projects, we could start by changing the lifecycle configuration and manage it on a per cohort/project prefix rather than for the whole BYOB share.
Note: we need to change and split the current setup (instead of just "overwriting" with more specific configurations).
See: https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-conflicts.html

@victorskl victorskl added question Further information is requested investigation Look into how best to approach the problem and removed feature New feature labels Dec 10, 2024
@victorskl
Copy link
Member

Next

  • Need business process strategy to refine cohort data retention policy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
investigation Look into how best to approach the problem question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants