Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For medium to large data dumps - where to host? #7

Open
owhite opened this issue Mar 13, 2018 · 9 comments
Open

For medium to large data dumps - where to host? #7

owhite opened this issue Mar 13, 2018 · 9 comments
Labels
question Further information is requested

Comments

@owhite
Copy link
Contributor

owhite commented Mar 13, 2018

We have several request for data dumps. GitHub may have size limits on the data, and there could be other reasons that it's impractical to store the data set here - but it still would be useful to have the data set versioned and documented at this system. What approach should we use?

@owhite owhite added the question Further information is requested label Mar 13, 2018
@carlkesselman
Copy link

carlkesselman commented Mar 13, 2018 via email

@cmungall
Copy link

cmungall commented Mar 13, 2018

I know it's part of the plan handed down to data stewards to duplicate on amazon and google clouds, but within GO and other projects we have been exploring some other options predating DC.

We like osf.io. OSF provides free storage and guarantees they will keep your stuff up for something like 25 years. It's super easy to distribute your data to OSF thanks to @tcb's group's CLI tool: https://github.com/dib-lab/osf-cli (it should be easy to combine with standards in the BDBag family but we haven't got round to it yet). We're currently exploring use of OSF as a sustainable distribution solution as part of the Open Biomedical Ontologies Foundry. We're using it for MONDO as the moment and are very happy with it.

In GO, we're exploring a dual solution, with S3 as primary distribution (with a cloudfront layer) and OSF for archiving.

In other projects we're also looking at git annex plus OSF or archive.org. I agree with @carlkesselman that GitHub is not a good storage solution of large files, but git can a fantastic tool for managing and versioning complex distributions of files. (A lot of people conflate git annex with git-lfs, which was IMHO a bit of a horrible experience, but git annex seems a lot better).

@ianfoster
Copy link
Contributor

ianfoster commented Mar 13, 2018 via email

@jmcmurry
Copy link
Contributor

All, please be aware that when you comment to an issue via email, please first remove all quoted text and your email signature, especially if you do not want to be spammed.

@krobasky
Copy link

FYI — I’ve used a fused/mounted S3 bucket from an AWS EC instance, it will disappear during heavy compute loads, so take into consideration that S3-served data might need to be mirrored prior to computing over it; there may be faster data delivery tiers for S3 that mitigate this problem, I haven’t seen any but I could do a deeper dive on it if anybody here thought it might be helpful.

@jmcherry-zz
Copy link

You all decide MODs can do whatever.

@Alastair-Thomson-NHLBI
Copy link

So, how about we create a writeable bucket on AWS for this? Or maybe two - one for temp storage and one for persistent?

@clarisca
Copy link

clarisca commented Sep 6, 2018

@AlastairThomson : I think this is a great idea --in particularly for additional data sets that we may want to store in the cloud for data integration activities. Which entity would be responsible for managing this storage and deciding what data sets are registered, etc?

@ashokohio
Copy link

@AlastairThomson, @clarisca as it turns out we are facing temp as well as persistent storage questions on STAGE for the COPDGene image data analysis. Deep Learning on a large group of images will likely need large temp storage; at the same time, there may be many Monte Carlo type runs where the final results may have to be in persistent storage for later analysis. Performance is also an issue, so we are considering EFS on AWS, but that costs much more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

10 participants