Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: HCA DCP Bundle Types and Definitions #93

Closed
wants to merge 29 commits into from

Conversation

malloryfreeberg
Copy link
Member

@malloryfreeberg malloryfreeberg commented Jul 29, 2019

Please note: the contents of #86 were merged into this PR; please refer to the discussion in that PR for additional detail and comments addressed from reviewers.

|:-|:-|:-|:-|
| Primary bundle | A bundle that contains primary (raw) data files and all related metadata files. | computational biologists; Data Processing Pipelines; Data Browser; Query Service | 1 |
| Secondary bundle | A bundle that contains secondary (alignment, expression) data files produced by the Data Processing Pipelines, input primary data files, and all related metadata files. | computational biologists; Data Browser; Query Service; Matrix Service | 2, 3, 4(?) |
| Tertiary bundle* | A bundle that contains tertiary (expression) data files produced by the Matrix Service, input primary and secondary data files, and all related metadata files. | computational biologists; biologists; Data Browser; Query Service | 3, 4 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "tertiary (expression)" is unnecessarily confusing. Tertiary analysis can be any meta-analysis, it should not be constrained to be "expression", which itself is a confusing shorthand for "expression matrix".

I propose renaming:

  • "Secondary bundle" -> "Analysis bundle"
  • "Tertiary bundle" -> "Expression Matrix bundle"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with renaming secondary and tertiary (in fact, that's the point of this RFC!). If we are renaming those to be nouns, we should also rename the primary bundle, perhaps:

  • "Primary bundle" -> "Raw (or Primary) data bundle"
  • "Secondary bundle" -> "Analysis bundle"
  • "Tertiary bundle" -> "Expression Matrix bundle"

To clarify, "Secondary/Analysis bundles" also contain expression data (gene x cell count matrices) in addition to alignment results, so I wonder if it's confusing to have these results not in the "Expression Matrix bundle".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Raw" doesn't seem very meaningful. Someone looking at gene expression (most likely the majority of users, might think of unnormalized counts as "raw".

"Matrix" is a type of data structured, so it seems odd to include it in the name. Kind of like calling sequence experiments "fastaq bundles". How about "gene expression bundle"

Analysis doesn't have much meaning either. I was kind of thinking gene expression is the result of analysis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gene expression matrices are a result of both pipelines and the matrix service (pipelines make them, matrix service acts on them for users).

| Tertiary bundle* | A bundle that contains tertiary (expression) data files produced by the Matrix Service, input primary and secondary data files, and all related metadata files. | computational biologists; biologists; Data Browser; Query Service | 3, 4 |
| Project bundle* | A bundle that contains project-level metadata files. This bundle type does not contain data files or metadata files related to biomaterials, protocols, processes, or files. | computational biologists; biologists; | ? |
| Resource/Reference bundle* | A bundle that contains reference data files used by the Data Processing Pipelines(?) and all related metadata files. This bundle type does not contain data files or metadata files related to biomaterials, protocols, processes(, or projects?). | computational biologists; Data Processing Pipelines; | 2, 3, 5 |
| DAPS bundle* | ? | ? | ? |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should be a link here to #88, indicating that the term and concept of a DAPS bundle is subject to discussion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I have added a reference.

@kislyuk kislyuk mentioned this pull request Jul 30, 2019

*Acceptance criteria are the conditions that a RFC must satisfy to be accepted by users or other stakeholders.*

1. Data consumers can make use of data from the DCP without knowing what a bundle is.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this RFC is still in draft status, but I wanted to note here that this AC doesn't seem to fit in with the rest of the document, which focuses on defining bundle types rather than hiding bundle details from data consumers. Maybe it can be re-worded or removed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @samanehsan, I agree it's not addressed by the RFC in its current form. I'll remove it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to me the AC above is a valid constraint on the solution, that it does not further bake in the bundle concept to the DCP interface. Users care about releases, the files in them, and the processes used to create the files.

We should be careful not to force hundereds/thousands of users to also learn our "bundle types" when we already have metadata describing the processes that creates the files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NoopDog That may be true, but the content of this RFC as it stands does not change the status quo in that respect nor pass judgment on how data is presented to the user.

More generally, any advocate for removing bundles from the presentation layer should come up with a concrete, actionable proposal (perhaps in another RFC) for what to replace them with.

@NoopDog
Copy link

NoopDog commented Sep 11, 2019

It may be more useful to tag bundles by the process that created them rather than the type of data they contain. The type of data can be inferred from the process and the process is more descriptive.

I have a few more notes about this here:

#86 (comment)

@kislyuk kislyuk marked this pull request as ready for review September 11, 2019 16:50
@kislyuk
Copy link
Member

kislyuk commented Sep 11, 2019

@NoopDog how would this tagging look like in practice, and what would a subscription query look like to specifically match such bundles?

| Project bundle* | A bundle that contains project-level metadata files. This bundle type does not contain data files or metadata files related to biomaterials, protocols, processes, or files. | computational biologists; biologists; | ? |
| Resource/Reference bundle* | A bundle that contains reference data files used by the Data Processing Pipelines(?) and all related metadata files. This bundle type does not contain data files or metadata files related to biomaterials, protocols, processes(, or projects?). | computational biologists; Data Processing Pipelines; | 2, 3, 5 |
| DAPS bundle* | Definition discussion ongoing in [PR #88](https://github.com/HumanCellAtlas/dcp-community/pull/88). | ? | ? |

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why the above bundles were defined in this manner other than this is de facto, but I am not sure that is the intent of this RFC, or at least does not feel like a formal understanding of bundles but more of a documentation of what we are using (without comment or systematic definitions).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TimothyTickle I believe that is the intent. I don't think we need to "formally understand bundles". We need to define bundle types and annotate them with enough information for application developers to easily reason about them, so they can develop applications and present data to users with reasonable certainty.

Describe here...

- Will contain project-level metadata only.
- Unclear how this bundle will be used by data consumers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about internal users?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, great point. Added "and DCP services" to the above.

Copy link
Contributor

@TimothyTickle TimothyTickle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the way this RFC is heading and the kinds of questions being asked here but I do not think this RFC answers those questions or is even complete.

Happy to talk if you feel I can be helpful.


> Asterisk (*) indicates bundles types that do not currently exist in the DCP but have been discussed as potential future bundle types to support data consumers.

### New bundle types and their definitions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new bundle types and definitions are not complete but I would love to review them when they are. Feel free to reach out.

@kislyuk kislyuk requested a review from lauraclarke September 16, 2019 16:48
@kislyuk
Copy link
Member

kislyuk commented Sep 16, 2019

Please note: I have merged the contents of #86 into this PR; please refer to the discussion in that PR for additional detail and comments addressed from reviewers.

rfcs/text/0000-bundle-definition.md Show resolved Hide resolved
rfcs/text/0000-bundle-definition.md Outdated Show resolved Hide resolved
@diekhans
Copy link
Contributor

diekhans commented Sep 16, 2019

I create issue #114 to get media types turned into a real RFC. That is good enough to not impact this PR.

@kislyuk
Copy link
Member

kislyuk commented Sep 16, 2019

@diekhans I already imported it here: #113

@NoopDog
Copy link

NoopDog commented Sep 25, 2019

As a refinement and extension of this RFC I have created an alternate RFC linked here : RFC: HCA DCP Application Layer Bundle Types and Definitions .

The alternate proposal differs from this RFC mainly in that:

  1. Type information is added to a new type.json metadata file rather than added to the DSS bundle.json file.

  2. Type information is expressed in JSON rather than in RFC 7231 media type syntax.

  3. The schema and bundle types are represented as JSON schema in the metadata schema repo and documented on the data portal rather than maintained in a DSS registry.

  4. The proposed types are refined by making the process and protocol that created the bundle explicit in the type.json.

Your review/feedback is appreciated.

Copy link
Contributor

@TimothyTickle TimothyTickle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks much better, and the definitions feel less specific to use and a little more generalized and reusable (potentially less over-fit).

@NoopDog
Copy link

NoopDog commented Nov 5, 2019

Pausing to merge with #119 and align with upcoming reproducibility and data citation requirements.

@NoopDog
Copy link

NoopDog commented Apr 5, 2021

Obsolete

@NoopDog NoopDog closed this Apr 5, 2021
@DailyDreaming DailyDreaming deleted the mfreeberg-rfc-bundle-definitions branch April 15, 2021 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants