Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: HCA DCP Bundle Types and Definitions #93

Closed
wants to merge 29 commits into from
Closed
Changes from 22 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
14c0b4b
Added bundle definition RFC.
malloryfreeberg Jul 12, 2019
f58d817
Fixed formatting.
malloryfreeberg Jul 12, 2019
f9daf4a
Fixed formatting again.
malloryfreeberg Jul 12, 2019
6ce0cc0
Added new bundle type description stubs.
malloryfreeberg Jul 12, 2019
c4f3735
Updated table header definitions.
malloryfreeberg Jul 12, 2019
ed916a1
Updated user story.
malloryfreeberg Jul 12, 2019
e4c5e15
Fixed formatting.
malloryfreeberg Jul 12, 2019
e44d743
Added note about specification being OOS
malloryfreeberg Jul 15, 2019
044a3bd
Updated motivation.
malloryfreeberg Jul 15, 2019
4f8e7bc
Updated use cases.
malloryfreeberg Jul 15, 2019
5028e14
Added question about bundle definition.
malloryfreeberg Jul 15, 2019
055b9ff
Removed bundle specifications.
malloryfreeberg Jul 15, 2019
67df958
Added notation of new bundle types.
malloryfreeberg Jul 15, 2019
f898c18
Added clarification of new bundle types
malloryfreeberg Jul 15, 2019
f05242e
Updated use cases in table.
malloryfreeberg Jul 15, 2019
8244ca9
Updated use cases in table.
malloryfreeberg Jul 15, 2019
fa0a66e
Updated NB for new bundle types.
malloryfreeberg Jul 15, 2019
ff6921c
Updated Resource bundle definition.
malloryfreeberg Jul 15, 2019
efbcc3a
Update 0000-bundle-definition.md
malloryfreeberg Jul 16, 2019
b6c8828
Added DAPS bundle type.
malloryfreeberg Jul 29, 2019
bb551e7
Updated questions.
malloryfreeberg Jul 29, 2019
439cb91
Added reference to DAPS bundle RFC.
malloryfreeberg Jul 31, 2019
0db9fd2
Remove data consumer acceptance criterion
kislyuk Sep 12, 2019
d968507
Merge #86 into #93
kislyuk Sep 16, 2019
7ab8aa6
Fix typo, thanks @chmreid
kislyuk Sep 16, 2019
8a448cb
Data Processing Service may consume secondary analysis bundles
kislyuk Sep 16, 2019
f4d0722
Project bundles may be used by DCP services
kislyuk Sep 16, 2019
a7b7953
Address review comments
kislyuk Sep 27, 2019
b14ccb5
Address review comments
kislyuk Sep 27, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
156 changes: 156 additions & 0 deletions rfcs/text/0000-bundle-definition.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
### DCP PR:

***Leave this blank until the RFC is approved** then the **Author(s)** must create a link between the assigned RFC number and this pull request in the format:*

`[dcp-community/rfc#](https://github.com/HumanCellAtlas/dcp-community/pull/<PR#>)`

# HCA DCP Bundle Types and Definitions

## Summary

There are currently no documented definitions for what a "bundle" is in the HCA DCP, including no definitions of what they are, what they contain, or who they support. Lack of clarity causes confusion for data consumers about how the data and metadata in the DCP are structured and organized. Here we propose to formalize a bundle definition, bundle type definitions, target users and consumers, and specific use cases in order to support the needs of data consumers.
kislyuk marked this conversation as resolved.
Show resolved Hide resolved

Specifically, the following charters will be supported by this RFC:
- the Data Store (DSS) [charter](https://github.com/HumanCellAtlas/dcp-community/blob/master/charters/DataStore/charter.md#core-capabilities) ("The Data Store owns the bundle use cases and bundle type definitions, while the precise specifications will be negotiated between the Metadata and other teams.");
- the Ingestion Service [charter](https://github.com/HumanCellAtlas/dcp-community/blob/master/charters/IngestionService/charter.md#dependencies) ("The Ingestion Service has a dependency on the Data Store team to provide new bundle requirements, and the Metadata Schema team to provide the definition of bundle specifications, in order to support the generation of bundles from submitted data and metadata."); and
- the Metadata Schema [charter](https://github.com/HumanCellAtlas/dcp-community/blob/master/charters/MetadataSchema/charter.md#objectives) ("Working with the Data Store to receive new bundle structure requirements and defining a new specification to be implemented by the Ingestion Service team.").


## Author(s)

*Recommended format for Authors:*

[Mallory Freeberg](mailto:[email protected])

[Brian Hannafious](mailto:[email protected])

[Hannes Schmidt](mailto:[email protected])

[Andrey Kislyuk](mailto:[email protected])

> Replace Mallory with Zina P and/or Chris V.

## Shepherd
***Leave this blank.** This role is assigned by DCP PM to guide the **Author(s)** through the RFC process.*

*Recommended format for Shepherds:*

`[Name](mailto:[email protected])`

## Motivation

*Describe the user or technical need in detail [with alignment to the DCP roadmap priorities where possible]. Link prior community discussions to demonstrate support for this RFC.*

There are currently no documented definitions for what a "bundle" is in the HCA DCP, including no definitions of what they are, what they contain, or who they support. Lack of clarity causes confusion for data consumers (and DCP developers) about how the data and metadata in the DCP are structured and organized. At the June 2019 DCP F2F meeting in Cambridge, a breakout session was held to discuss "What is a bundle?", and a lot of useful discussion occurred. This RFC will formally document what a bundle is, anchoring the definition on how bundles support HCA DCP data consumers. This RFC will also describe new bundles types that could be supported by the HCA DCP.
kislyuk marked this conversation as resolved.
Show resolved Hide resolved

### User Stories

*Share the [User Stories](https://www.mountaingoatsoftware.com/agile/user-stories) motivating this RFC.*

1. As a data consumer who develops data processing tools (*e.g.* a sequence read aligner), I would like to browse, query, and get primary data from the DSS so that I can use it as testing data during development.
1. As a data consumer who develops analysis tools (*e.g.* a clustering algorithm), I would like to browse, query, and get alignment data from the DSS so that I can use it as testing data during development.
1. As a data consumer who is a computational biologist, I would like to get expression matrices from the DSS so that I can compare against data I already have.
1. As a data consumer who is a biologist, I would like to get expression matrices from the DSS so that I can see the expression profile of &lt;my gene of interest&gt; in &lt;my cell type(s) or organ(s) of interest&gt;.
1. As a data consumer who is a computational biologist, I would like to know what reference data was used to generate processed data in the DSS and I would like to be able to get it for myself so that I can process/analyze my own data with the same references as HCA DCP data.


## Scientific "guardrails" [optional]

*Describe recommended or mandated review from HCA Science governance to ensure that the RFC addresses the needs of the scientific community.*

## Detailed Design

*Explain the design in sufficient detail such that the implementation and (if appropriate) the interaction with existing DCP software are both reasonably clear.*

### General bundle definition

In the HCA DCP, a bundle is defined as:
1. a transaction of files that belong together for scientific reasons - **OR** -
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we adopt more and more scientific use cases this definition (1) will become less and less useful. With more use cases, files will need to be dynamically aggregated to fulfill specific use cases (as illustrated in the multlibrary RFA). If this is a hard grouping (a transaction of files) then would you group files only by run or only by release or biological network? None of these would be the right answer, but all would be needed groupings for scientific use cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TimothyTickle thanks for pointing this out. I updated and expanded the bundle type definition while merging the contents of #86 into this PR - please take another look.

I disagree that the "transactional unit" definition will become less useful. Operationally, a bundle is the output of an application run. It has to be grouped transactionally for data integrity puproses, and decorated with metadata for discoverability purposes. Whether the grouping is done by release, biological network, etc. is up to the application developers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed with @kislyuk ; transactions are key to data integrity and it is necessary for DCP components to think deeply about data integrity.

1. an association between a set of files (*e.g.* data, metadata)

> These two definitions came out of the F2F bundle session and were generally agreed to make sense by different groups. Should we go with just one of these? Do we merge them together? Do we accept both? Should there be a vote?

### Bundle types and their definitions

Bundle types, definitions, consumers, and specifications are outlined below. Use cases from above are indicated where supported by a specific bundle type.

#### Table header definitions
- **Type**: The name of the bundle type
- **Definition**: What is and is not in this bundle type
- **Target consumer**: Who will be consuming this bundle type
- **Use cases**: Which Users Stories are supported by this bundle type

| Type | Definition | Target consumer | Use cases |
|:-|:-|:-|:-|
| Primary bundle | A bundle that contains primary (raw) data files and all related metadata files. | computational biologists; Data Processing Pipelines; Data Browser; Query Service | 1 |
| Secondary bundle | A bundle that contains secondary (alignment, expression) data files produced by the Data Processing Pipelines, input primary data files, and all related metadata files. | computational biologists; Data Browser; Query Service; Matrix Service | 2, 3, 4(?) |
kislyuk marked this conversation as resolved.
Show resolved Hide resolved
| Tertiary bundle* | A bundle that contains tertiary (expression) data files produced by the Matrix Service, input primary and secondary data files, and all related metadata files. | computational biologists; biologists; Data Browser; Query Service | 3, 4 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "tertiary (expression)" is unnecessarily confusing. Tertiary analysis can be any meta-analysis, it should not be constrained to be "expression", which itself is a confusing shorthand for "expression matrix".

I propose renaming:

  • "Secondary bundle" -> "Analysis bundle"
  • "Tertiary bundle" -> "Expression Matrix bundle"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with renaming secondary and tertiary (in fact, that's the point of this RFC!). If we are renaming those to be nouns, we should also rename the primary bundle, perhaps:

  • "Primary bundle" -> "Raw (or Primary) data bundle"
  • "Secondary bundle" -> "Analysis bundle"
  • "Tertiary bundle" -> "Expression Matrix bundle"

To clarify, "Secondary/Analysis bundles" also contain expression data (gene x cell count matrices) in addition to alignment results, so I wonder if it's confusing to have these results not in the "Expression Matrix bundle".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Raw" doesn't seem very meaningful. Someone looking at gene expression (most likely the majority of users, might think of unnormalized counts as "raw".

"Matrix" is a type of data structured, so it seems odd to include it in the name. Kind of like calling sequence experiments "fastaq bundles". How about "gene expression bundle"

Analysis doesn't have much meaning either. I was kind of thinking gene expression is the result of analysis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gene expression matrices are a result of both pipelines and the matrix service (pipelines make them, matrix service acts on them for users).

| Project bundle* | A bundle that contains project-level metadata files. This bundle type does not contain data files or metadata files related to biomaterials, protocols, processes, or files. | computational biologists; biologists; | ? |
kislyuk marked this conversation as resolved.
Show resolved Hide resolved
| Resource/Reference bundle* | A bundle that contains reference data files used by the Data Processing Pipelines(?) and all related metadata files. This bundle type does not contain data files or metadata files related to biomaterials, protocols, processes(, or projects?). | computational biologists; Data Processing Pipelines; | 2, 3, 5 |
kislyuk marked this conversation as resolved.
Show resolved Hide resolved
| DAPS bundle* | Definition discussion ongoing in [PR #88](https://github.com/HumanCellAtlas/dcp-community/pull/88). | ? | ? |

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why the above bundles were defined in this manner other than this is de facto, but I am not sure that is the intent of this RFC, or at least does not feel like a formal understanding of bundles but more of a documentation of what we are using (without comment or systematic definitions).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TimothyTickle I believe that is the intent. I don't think we need to "formally understand bundles". We need to define bundle types and annotate them with enough information for application developers to easily reason about them, so they can develop applications and present data to users with reasonable certainty.

> Asterisk (*) indicates bundles types that do not currently exist in the DCP but have been discussed as potential future bundle types to support data consumers.

### New bundle types and their definitions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new bundle types and definitions are not complete but I would love to review them when they are. Feel free to reach out.


> **NB**: Describing and defining new bundle types probably belong in their own design RFC, but they are included here as they were a natural extension of defining the current two bundle types. Consider moving the "New bundle types and their definitions" section to a new design RFC.

The only bundle types that currently exist in the DSS are Primary and Secondary bundles. The remaining bundle types are proposed based on data consumer needs and are described in more detail below.

#### New Tertiary bundle

Describe here...

- Produced by Matrix Service. *Will anyone else be producing expression data files using the output of Data Processing Pipelines as input?*
- In the first instance will contain expression data files representing one project. In the future, will contain expression data files representing partials projects or multiple projects.
-

#### New Project bundle

Describe here...

- Will contain project-level metadata only.
- Unclear how this bundle will be used by data consumers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about internal users?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, great point. Added "and DCP services" to the above.


#### New Resource/Reference bundle

Describe here...

- Should we call it "Resource" or "Reference" bundle?
- Who will generate/maintain these? Data Processing Pipelines?

### Acceptance Criteria [optional]

*Acceptance criteria are the conditions that a RFC must satisfy to be accepted by users or other stakeholders.*
kislyuk marked this conversation as resolved.
Show resolved Hide resolved

1. Data consumers can make use of data from the DCP without knowing what a bundle is.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this RFC is still in draft status, but I wanted to note here that this AC doesn't seem to fit in with the rest of the document, which focuses on defining bundle types rather than hiding bundle details from data consumers. Maybe it can be re-worded or removed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @samanehsan, I agree it's not addressed by the RFC in its current form. I'll remove it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to me the AC above is a valid constraint on the solution, that it does not further bake in the bundle concept to the DCP interface. Users care about releases, the files in them, and the processes used to create the files.

We should be careful not to force hundereds/thousands of users to also learn our "bundle types" when we already have metadata describing the processes that creates the files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NoopDog That may be true, but the content of this RFC as it stands does not change the status quo in that respect nor pass judgment on how data is presented to the user.

More generally, any advocate for removing bundles from the presentation layer should come up with a concrete, actionable proposal (perhaps in another RFC) for what to replace them with.

1. DCP developers have a shared understanding of what a bundle is, including what is in a bundle and how bundles relate to one another.

### Unresolved Questions

*What aspects of the design do you expect to clarify further through the RFC review process?*
1. What is the process for adding or adopting new bundle types? Changing definition of current bundle types?
1. What other bundle specifications should be included here, especially in light of the bundle type design RFC ([PR link](https://github.com/HumanCellAtlas/dcp-community/pull/86))?

*What aspects of the design do you expect to clarify later during iterative development of this RFC?*
1. Does this informational RFC depend on mechanics of copy-forward?
1. What bundle types should be defined in this RFC? What types should be defined in future RFCs?


### Drawbacks and Limitations [optional]

*Why should this RFC **not** be implemented?*

### Prior Art [optional]

*Share references to prior art to deepen community understanding of the RFC, such as learnings, adaptations from earlier designs, or community standards.*

- [Submission to bundles SOP (Apr 2018)](https://docs.google.com/document/d/1x8mYLU8ubpZtTrzkwJrft1heqReX-pJLjJfb2X0fd2w/edit#heading=h.yciwsskdtgsq)
- [Bundle metadata schema update ticket](https://github.com/HumanCellAtlas/metadata-schema/issues/986)

### Alternatives [optional]

*Highlight other possible approaches to delivering the value proposed in this RFC.
What other designs were explored? What were their advantages? What was the rationale for rejecting alternatives?*

Status quo: No one *really* knows what a bundle is or there are many different ideas about what a bundle is. Chaos ensues.