-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: HCA DCP Bundle Types and Definitions #93
Conversation
|:-|:-|:-|:-| | ||
| Primary bundle | A bundle that contains primary (raw) data files and all related metadata files. | computational biologists; Data Processing Pipelines; Data Browser; Query Service | 1 | | ||
| Secondary bundle | A bundle that contains secondary (alignment, expression) data files produced by the Data Processing Pipelines, input primary data files, and all related metadata files. | computational biologists; Data Browser; Query Service; Matrix Service | 2, 3, 4(?) | | ||
| Tertiary bundle* | A bundle that contains tertiary (expression) data files produced by the Matrix Service, input primary and secondary data files, and all related metadata files. | computational biologists; biologists; Data Browser; Query Service | 3, 4 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "tertiary (expression)" is unnecessarily confusing. Tertiary analysis can be any meta-analysis, it should not be constrained to be "expression", which itself is a confusing shorthand for "expression matrix".
I propose renaming:
- "Secondary bundle" -> "Analysis bundle"
- "Tertiary bundle" -> "Expression Matrix bundle"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with renaming secondary and tertiary (in fact, that's the point of this RFC!). If we are renaming those to be nouns, we should also rename the primary bundle, perhaps:
- "Primary bundle" -> "Raw (or Primary) data bundle"
- "Secondary bundle" -> "Analysis bundle"
- "Tertiary bundle" -> "Expression Matrix bundle"
To clarify, "Secondary/Analysis bundles" also contain expression data (gene x cell count matrices) in addition to alignment results, so I wonder if it's confusing to have these results not in the "Expression Matrix bundle".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Raw" doesn't seem very meaningful. Someone looking at gene expression (most likely the majority of users, might think of unnormalized counts as "raw".
"Matrix" is a type of data structured, so it seems odd to include it in the name. Kind of like calling sequence experiments "fastaq bundles". How about "gene expression bundle"
Analysis doesn't have much meaning either. I was kind of thinking gene expression is the result of analysis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gene expression matrices are a result of both pipelines and the matrix service (pipelines make them, matrix service acts on them for users).
rfcs/text/0000-bundle-definition.md
Outdated
| Tertiary bundle* | A bundle that contains tertiary (expression) data files produced by the Matrix Service, input primary and secondary data files, and all related metadata files. | computational biologists; biologists; Data Browser; Query Service | 3, 4 | | ||
| Project bundle* | A bundle that contains project-level metadata files. This bundle type does not contain data files or metadata files related to biomaterials, protocols, processes, or files. | computational biologists; biologists; | ? | | ||
| Resource/Reference bundle* | A bundle that contains reference data files used by the Data Processing Pipelines(?) and all related metadata files. This bundle type does not contain data files or metadata files related to biomaterials, protocols, processes(, or projects?). | computational biologists; Data Processing Pipelines; | 2, 3, 5 | | ||
| DAPS bundle* | ? | ? | ? | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should be a link here to #88, indicating that the term and concept of a DAPS bundle is subject to discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I have added a reference.
rfcs/text/0000-bundle-definition.md
Outdated
|
||
*Acceptance criteria are the conditions that a RFC must satisfy to be accepted by users or other stakeholders.* | ||
|
||
1. Data consumers can make use of data from the DCP without knowing what a bundle is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this RFC is still in draft status, but I wanted to note here that this AC doesn't seem to fit in with the rest of the document, which focuses on defining bundle types rather than hiding bundle details from data consumers. Maybe it can be re-worded or removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @samanehsan, I agree it's not addressed by the RFC in its current form. I'll remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to me the AC above is a valid constraint on the solution, that it does not further bake in the bundle concept to the DCP interface. Users care about releases, the files in them, and the processes used to create the files.
We should be careful not to force hundereds/thousands of users to also learn our "bundle types" when we already have metadata describing the processes that creates the files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NoopDog That may be true, but the content of this RFC as it stands does not change the status quo in that respect nor pass judgment on how data is presented to the user.
More generally, any advocate for removing bundles from the presentation layer should come up with a concrete, actionable proposal (perhaps in another RFC) for what to replace them with.
It may be more useful to tag bundles by the process that created them rather than the type of data they contain. The type of data can be inferred from the process and the process is more descriptive. I have a few more notes about this here: |
@NoopDog how would this tagging look like in practice, and what would a subscription query look like to specifically match such bundles? |
| Project bundle* | A bundle that contains project-level metadata files. This bundle type does not contain data files or metadata files related to biomaterials, protocols, processes, or files. | computational biologists; biologists; | ? | | ||
| Resource/Reference bundle* | A bundle that contains reference data files used by the Data Processing Pipelines(?) and all related metadata files. This bundle type does not contain data files or metadata files related to biomaterials, protocols, processes(, or projects?). | computational biologists; Data Processing Pipelines; | 2, 3, 5 | | ||
| DAPS bundle* | Definition discussion ongoing in [PR #88](https://github.com/HumanCellAtlas/dcp-community/pull/88). | ? | ? | | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why the above bundles were defined in this manner other than this is de facto, but I am not sure that is the intent of this RFC, or at least does not feel like a formal understanding of bundles but more of a documentation of what we are using (without comment or systematic definitions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TimothyTickle I believe that is the intent. I don't think we need to "formally understand bundles". We need to define bundle types and annotate them with enough information for application developers to easily reason about them, so they can develop applications and present data to users with reasonable certainty.
rfcs/text/0000-bundle-definition.md
Outdated
Describe here... | ||
|
||
- Will contain project-level metadata only. | ||
- Unclear how this bundle will be used by data consumers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about internal users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, great point. Added "and DCP services" to the above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the way this RFC is heading and the kinds of questions being asked here but I do not think this RFC answers those questions or is even complete.
Happy to talk if you feel I can be helpful.
|
||
> Asterisk (*) indicates bundles types that do not currently exist in the DCP but have been discussed as potential future bundle types to support data consumers. | ||
|
||
### New bundle types and their definitions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These new bundle types and definitions are not complete but I would love to review them when they are. Feel free to reach out.
Please note: I have merged the contents of #86 into this PR; please refer to the discussion in that PR for additional detail and comments addressed from reviewers. |
I create issue #114 to get media types turned into a real RFC. That is good enough to not impact this PR. |
As a refinement and extension of this RFC I have created an alternate RFC linked here : RFC: HCA DCP Application Layer Bundle Types and Definitions . The alternate proposal differs from this RFC mainly in that:
Your review/feedback is appreciated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks much better, and the definitions feel less specific to use and a little more generalized and reusable (potentially less over-fit).
Pausing to merge with #119 and align with upcoming reproducibility and data citation requirements. |
Obsolete |
Please note: the contents of #86 were merged into this PR; please refer to the discussion in that PR for additional detail and comments addressed from reviewers.