Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do we version the dataset? #126

Open
shntnu opened this issue Oct 16, 2022 · 11 comments
Open

How do we version the dataset? #126

shntnu opened this issue Oct 16, 2022 · 11 comments

Comments

@shntnu
Copy link
Contributor

shntnu commented Oct 16, 2022

We do not plan to version the data for now because we haven't thought it through fully (see "Things to keep in mind")

Strawman plan

Level 3 and above data will be versioned using DVC in https://github.com/jump-cellpainting/datasets

Counter-argument

  • But even just Level 3 and above will be > 100Gb!
    • Each Level3 is 18Mb and we will have ~2700 such files – this is 50Gb
    • Level 4a, 4b will be probably another 50Gb
    • Maybe we should version only the Level 4b, and do so as a single collated parquet file

Things to keep in mind

All the data will be released with CC0 1.0 Universal (CC0 1.0). However, please cite the appropriate resources/publications, listed below, when citing individual datasets. For example,

We used the dataset cpg0000 (Chandrasekaran et al., 2022), available from the Cell Painting Gallery on the Registry of Open Data on AWS (https://registry.opendata.aws/cellpainting-gallery/).

  • But if we create a zenodo entry, we will want people to cite that resource instead (below, Natoli et al., 2021) but still want the corresponding paper (below, Way et al., 2022) and RODA to be cited (below, cpg0004 available from the Cell Painting Gallery on the Registry of Open Data on AWS (https://registry.opendata.aws/cellpainting-gallery/))

We used the dataset cpg0004 (Way et al., 2022; Natoli et al., 2021), available from the Cell Painting Gallery on the Registry of Open Data on AWS (https://registry.opendata.aws/cellpainting-gallery/).

  • A simple tack would be to cite only the Zenodo DOI, but that would mean we would no longer cite the Cell Painting Gallery nor the paper, and that's undesirable (we want to credit both)
@shntnu
Copy link
Contributor Author

shntnu commented Feb 7, 2023

Related conversations with Erin Chu Jun, 2022

Shantanu:

How should we acknowledge RODA in future papers, including the one attached? In the past, we've cited a paper linked to the resource e.g. we deposited images in IDR for this paper and said We deposited raw and illumination-corrected images to the Image Data Resource (https://idr.openmicroscopy.org/) under accession number idr0080 (Williams et al., 2017). Williams et al., 2017 is the paper that announced IDR. We are happy to add you to acknowledgments of course, but I thought it will be a lot more substantial to cite an actual DOI of some sort associated with RODA. Maybe you could consider adding a CITATION.md file to https://github.com/awslabs/open-data-registry and you will be all set (and cited!) But maybe it's only we academics who care about that sort of stuff :)

Erin Chu:

I LOVE the idea of adding citation information to GH; I do get asked this somewhat commonly. It might become a For your reference we refer that people use the Registry URL, for example:

"Data are available at registry.opendata.aws/cell-painting."
"The Broad Cell Painting Collection was accessed on January 3rd, 2022 from registry.opendata.aws/cell-painting."

This could change in the future as we're considering adding DOIs to datasets for better citability (what are your thoughts on this?), but in the meantime please use the above language, always citing the Registry URL.

Shantanu:

Thanks for clarifying how to cite RODA. We will go with "Data are available at registry.opendata.aws/cell-painting" in most cases until we've figured out DOIs.

Regarding DOI's – it definitely seems the way to go. I can't speak for all of RODA, but for roda/cell-painting, it would be much more useful if we can have separate DOIs for datasets within roda/cell-painting

In this context, It's worth considering the "data flow" I had in mind

For each new Cell Painting dataset that we plan to make public, we will
(Update Feb 2023: our current process is here https://github.com/broadinstitute/cellpainting-gallery/blob/main/.github/ISSUE_TEMPLATE/data-immediately-public.md)

  1. add a row to https://broad.io/profiling_dataset, a spreadsheet we've been maintaining (updated sporadically right now, but will be more regular once we have streamlined the whole process)
  2. upload all components of the dataset to s3://cellpainting-gallery (images + processed data)
  3. create a page in BBBC, which will have all the narrative around it (e.g. https://bbbc.broadinstitute.org/BBBC021) [Update Feb 2022: we decided not to include this in our process]
  4. submit the dataset to IDR (e.g. http://idr.openmicroscopy.org/webclient/?show=screen-2001 is the IDR entry corresponding to BBBC021; the metadata is available on GitHub https://github.com/IDR/idr-metadata/tree/master/idr0035-caie-drugresponse and then ingested into IDR. They now have a different workflow: they create separate repos for each dataset e.g. search for "idr0080" on the page https://github.com/IDR/idr-metadata and it will point you to a repo for that dataset. Each IDR datasets now has its own doi e.g. https://doi.org/10.17867/10000153)

Ideally, there would be a single DOI that somehow links 2,3,4 but I think that will end up being too complicated. We can instead skip DOIs for the BBBC entry (# 2), have IDR (# 3) generate their DOIs using their own process, and then just create a new process for creating RODA (# 4) DOIs. IDR can then include the RODA DOI as metadata (like they already do for publications – see the panel to the right of the screen on https://doi.org/10.17867/10000153, screenshot below)

@bethac07
Copy link
Contributor

bethac07 commented Feb 9, 2023

Discussion outcome -

We used cpg0016 {Chandrasekaran 2023|ZeonodoDOI} hosted at AWS Registry of Open Data.

@bethac07
Copy link
Contributor

bethac07 commented Feb 9, 2023

One additional nice thing with Zenodo- you can add a LARGE list of "alternate identifiers" with an even longer list of "how does that alternate identifier relate to this Zenodo object". So linking the Zenodo to the paper to the IDR to the RODA page to the whatever should be straightforward.
image

image

I added, for example, the bioRxiv DOI to the Zenodo archiving of the Nat Prot paper protocol repo.

https://zenodo.org/record/7267354#.Y-UhZezMI0Q

@shntnu
Copy link
Contributor Author

shntnu commented Jul 10, 2024

Turns out Synapse might be a good option for our needs here https://www.perplexity.ai/page/comparing-synapse-and-zenodo-Yo3npXDzSqSFEOf9Ocln3g

Update July 26, 2024: I nixed this idea because it doesn't have any advantage over Zenodo, given that we plan to use manifest files (see next comment)

@shntnu
Copy link
Contributor Author

shntnu commented Jul 26, 2024

@afermg and I discussed that using manifest files to version components of the JUMP dataset is the simplest route.

For example, for the "assembled" data (batch corrected, single large parquet file per modalitity), we will create a CSV file that points to the version of the data that we currently recommend using; this file will be versioned using Zenodo. A script within the repository will produce the CSV file, and a GitHub Action will automate the process of uploading new versions to Zenodo, which will create human-readable version numbers.

This does make things a bit fragmented and non-uniform because we may end up creating manifests that are not standard across datasets. However, this is exactly how we do it in publications – we version specific data components we care about.

Note that because s3://cellpainting-gallery has object-level data versioning enabled, we trivially have access to versioning at that level (per object) of granularity.

h/t to @jessica-ewald who talked me out of going down the rabbit-hole of minting DOIs for each object.


We can achieve something similar using Quilt packages, but we didn't want to introduce new dependencies given that the solution seems relatively straightforward. Still we should keep Quilt in mind in case we find ourselves adding more "features" to this system of creating manifests.

@shntnu
Copy link
Contributor Author

shntnu commented Jul 26, 2024

I'll add notes here about our how we've create a citable DOI for the https://github.com/jump-cellpainting/datasets as a whole

I just wish there was some method to update a record created via this process. E.g. this was created https://zenodo.org/records/12983164 when I cut this release https://github.com/jump-cellpainting/datasets/releases/tag/v0.6.0. But then I updated the release notes, but the original release notes that get copied over to https://zenodo.org/records/12983164 cannot be edited IIUC.

@afermg
Copy link
Collaborator

afermg commented Jul 26, 2024

Just a quick note: I've been trying to code a way to update versions for our profile_index.csv to Zenodo but since they changed the API I'm unable to create new versions of existing datasets. I'm not the only one by the looks of this geneontology/pipeline#345. If we are not going to be producing new versions very often I'd suggest to just upload them manually.

@shntnu
Copy link
Contributor Author

shntnu commented Jul 26, 2024

I've been trying to code a way to update versions for our profile_index.csv to Zenodo but since they changed the API I'm unable to create new versions of existing datasets.

Oh so you can create new datasets, but not update an existing dataset, using their API? But you can do so manually?

So bizarre

@afermg
Copy link
Collaborator

afermg commented Jul 26, 2024

Actually, it's taken some work but I think I found a way to do it. Some parts of the REST api work fine for curl (bash) and some others work fine for Python. Combining both we can get a functional way to automatically re-upload and re-version things :).

@shntnu
Copy link
Contributor Author

shntnu commented Aug 2, 2024

For our notes: @afermg has now implemented this versioning strategy: #121

I will keep this issue open for a bit in case we want to discuss this topic further.

@shntnu shntnu transferred this issue from another repository Oct 3, 2024
@shntnu shntnu transferred this issue from another repository Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants