For medium to large data dumps - where to host? #7

owhite · 2018-03-13T15:17:19Z

We have several request for data dumps. GitHub may have size limits on the data, and there could be other reasons that it's impractical to store the data set here - but it still would be useful to have the data set versioned and documented at this system. What approach should we use?

carlkesselman · 2018-03-13T15:27:38Z

Personally, I would suggest that we have S3 space or similar object store on cloud somewhere. If it would be helpful, I could set up a simple registry of what data is there. I would also suggest that we mint identifiers (minds?) for whatever data files we place there. I’m not sure GitHub is a good solution for data storage. Carl

cmungall · 2018-03-13T15:42:31Z

I know it's part of the plan handed down to data stewards to duplicate on amazon and google clouds, but within GO and other projects we have been exploring some other options predating DC.

We like osf.io. OSF provides free storage and guarantees they will keep your stuff up for something like 25 years. It's super easy to distribute your data to OSF thanks to @tcb's group's CLI tool: https://github.com/dib-lab/osf-cli (it should be easy to combine with standards in the BDBag family but we haven't got round to it yet). We're currently exploring use of OSF as a sustainable distribution solution as part of the Open Biomedical Ontologies Foundry. We're using it for MONDO as the moment and are very happy with it.

In GO, we're exploring a dual solution, with S3 as primary distribution (with a cloudfront layer) and OSF for archiving.

In other projects we're also looking at git annex plus OSF or archive.org. I agree with @carlkesselman that GitHub is not a good storage solution of large files, but git can a fantastic tool for managing and versioning complex distributions of files. (A lot of people conflate git annex with git-lfs, which was IMHO a bit of a horrible experience, but git annex seems a lot better).

ianfoster · 2018-03-13T17:47:28Z

DCPPC is focused on cloud. Why would we not use S3?

jmcmurry · 2018-03-13T19:10:20Z

All, please be aware that when you comment to an issue via email, please first remove all quoted text and your email signature, especially if you do not want to be spammed.

krobasky · 2018-03-14T10:57:03Z

FYI — I’ve used a fused/mounted S3 bucket from an AWS EC instance, it will disappear during heavy compute loads, so take into consideration that S3-served data might need to be mirrored prior to computing over it; there may be faster data delivery tiers for S3 that mitigate this problem, I haven’t seen any but I could do a deeper dive on it if anybody here thought it might be helpful.

jmcherry-zz · 2018-04-14T00:56:07Z

You all decide MODs can do whatever.

Alastair-Thomson-NHLBI · 2018-09-05T15:58:21Z

So, how about we create a writeable bucket on AWS for this? Or maybe two - one for temp storage and one for persistent?

clarisca · 2018-09-06T03:23:24Z

@AlastairThomson : I think this is a great idea --in particularly for additional data sets that we may want to store in the cloud for data integration activities. Which entity would be responsible for managing this storage and deciding what data sets are registered, etc?

ashokohio · 2018-09-06T03:56:35Z

@AlastairThomson, @clarisca as it turns out we are facing temp as well as persistent storage questions on STAGE for the COPDGene image data analysis. Deep Learning on a large group of images will likely need large temp storage; at the same time, there may be many Monte Carlo type runs where the final results may have to be in persistent storage for later analysis. Performance is also an issue, so we are considering EFS on AWS, but that costs much more.

owhite added the question Further information is requested label Mar 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

For medium to large data dumps - where to host? #7

For medium to large data dumps - where to host? #7

owhite commented Mar 13, 2018

carlkesselman commented Mar 13, 2018 via email •

edited by jmcmurry

Loading

cmungall commented Mar 13, 2018 •

edited

Loading

ianfoster commented Mar 13, 2018 via email •

edited by jmcmurry

Loading

jmcmurry commented Mar 13, 2018

krobasky commented Mar 14, 2018

jmcherry-zz commented Apr 14, 2018

Alastair-Thomson-NHLBI commented Sep 5, 2018

clarisca commented Sep 6, 2018

ashokohio commented Sep 6, 2018

For medium to large data dumps - where to host? #7

For medium to large data dumps - where to host? #7

Comments

owhite commented Mar 13, 2018

carlkesselman commented Mar 13, 2018 via email • edited by jmcmurry Loading

cmungall commented Mar 13, 2018 • edited Loading

ianfoster commented Mar 13, 2018 via email • edited by jmcmurry Loading

jmcmurry commented Mar 13, 2018

krobasky commented Mar 14, 2018

jmcherry-zz commented Apr 14, 2018

Alastair-Thomson-NHLBI commented Sep 5, 2018

clarisca commented Sep 6, 2018

ashokohio commented Sep 6, 2018

carlkesselman commented Mar 13, 2018 via email •

edited by jmcmurry

Loading

cmungall commented Mar 13, 2018 •

edited

Loading

ianfoster commented Mar 13, 2018 via email •

edited by jmcmurry

Loading