-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How do we version the dataset? #126
Comments
This looks promising https://dvc.org/doc/user-guide/data-management/managing-external-data Or the simpler alternatives suggested on that page: |
Related conversations with Erin Chu Jun, 2022 Shantanu: How should we acknowledge RODA in future papers, including the one attached? In the past, we've cited a paper linked to the resource e.g. we deposited images in IDR for this paper and said We deposited raw and illumination-corrected images to the Image Data Resource (https://idr.openmicroscopy.org/) under accession number idr0080 (Williams et al., 2017). Williams et al., 2017 is the paper that announced IDR. We are happy to add you to acknowledgments of course, but I thought it will be a lot more substantial to cite an actual DOI of some sort associated with RODA. Maybe you could consider adding a CITATION.md file to https://github.com/awslabs/open-data-registry and you will be all set (and cited!) But maybe it's only we academics who care about that sort of stuff :) Erin Chu: I LOVE the idea of adding citation information to GH; I do get asked this somewhat commonly. It might become a For your reference we refer that people use the Registry URL, for example:
This could change in the future as we're considering adding DOIs to datasets for better citability (what are your thoughts on this?), but in the meantime please use the above language, always citing the Registry URL. Shantanu: Thanks for clarifying how to cite RODA. We will go with "Data are available at registry.opendata.aws/cell-painting" in most cases until we've figured out DOIs. Regarding DOI's – it definitely seems the way to go. I can't speak for all of RODA, but for roda/cell-painting, it would be much more useful if we can have separate DOIs for datasets within roda/cell-painting In this context, It's worth considering the "data flow" I had in mind For each new Cell Painting dataset that we plan to make public, we will
Ideally, there would be a single DOI that somehow links 2,3,4 but I think that will end up being too complicated. We can instead skip DOIs for the BBBC entry (# 2), have IDR (# 3) generate their DOIs using their own process, and then just create a new process for creating RODA (# 4) DOIs. IDR can then include the RODA DOI as metadata (like they already do for publications – see the panel to the right of the screen on https://doi.org/10.17867/10000153, screenshot below) |
Discussion outcome -
|
Turns out Synapse might be a good option for our needs here https://www.perplexity.ai/page/comparing-synapse-and-zenodo-Yo3npXDzSqSFEOf9Ocln3g Update July 26, 2024: I nixed this idea because it doesn't have any advantage over Zenodo, given that we plan to use manifest files (see next comment) |
@afermg and I discussed that using manifest files to version components of the JUMP dataset is the simplest route. For example, for the "assembled" data (batch corrected, single large parquet file per modalitity), we will create a CSV file that points to the version of the data that we currently recommend using; this file will be versioned using Zenodo. A script within the repository will produce the CSV file, and a GitHub Action will automate the process of uploading new versions to Zenodo, which will create human-readable version numbers. This does make things a bit fragmented and non-uniform because we may end up creating manifests that are not standard across datasets. However, this is exactly how we do it in publications – we version specific data components we care about. Note that because s3://cellpainting-gallery has object-level data versioning enabled, we trivially have access to versioning at that level (per object) of granularity. h/t to @jessica-ewald who talked me out of going down the rabbit-hole of minting DOIs for each object. We can achieve something similar using Quilt packages, but we didn't want to introduce new dependencies given that the solution seems relatively straightforward. Still we should keep Quilt in mind in case we find ourselves adding more "features" to this system of creating manifests. |
I'll add notes here about our how we've create a citable DOI for the https://github.com/jump-cellpainting/datasets as a whole
I just wish there was some method to update a record created via this process. E.g. this was created https://zenodo.org/records/12983164 when I cut this release https://github.com/jump-cellpainting/datasets/releases/tag/v0.6.0. But then I updated the release notes, but the original release notes that get copied over to https://zenodo.org/records/12983164 cannot be edited IIUC. |
Just a quick note: I've been trying to code a way to update versions for our profile_index.csv to Zenodo but since they changed the API I'm unable to create new versions of existing datasets. I'm not the only one by the looks of this geneontology/pipeline#345. If we are not going to be producing new versions very often I'd suggest to just upload them manually. |
Oh so you can create new datasets, but not update an existing dataset, using their API? But you can do so manually? So bizarre |
Actually, it's taken some work but I think I found a way to do it. Some parts of the REST api work fine for curl (bash) and some others work fine for Python. Combining both we can get a functional way to automatically re-upload and re-version things :). |
We do not plan to version the data for now because we haven't thought it through fully (see "Things to keep in mind")
Strawman plan
Level 3 and above data will be versioned using DVC in https://github.com/jump-cellpainting/datasets
Counter-argument
Things to keep in mind
All the data will be released with CC0 1.0 Universal (CC0 1.0). However, please cite the appropriate resources/publications, listed below, when citing individual datasets. For example,
The text was updated successfully, but these errors were encountered: