Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: HCA DCP Bundle Types and Definitions #93

Closed
wants to merge 29 commits into from
Closed
Changes from 27 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
14c0b4b
Added bundle definition RFC.
malloryfreeberg Jul 12, 2019
f58d817
Fixed formatting.
malloryfreeberg Jul 12, 2019
f9daf4a
Fixed formatting again.
malloryfreeberg Jul 12, 2019
6ce0cc0
Added new bundle type description stubs.
malloryfreeberg Jul 12, 2019
c4f3735
Updated table header definitions.
malloryfreeberg Jul 12, 2019
ed916a1
Updated user story.
malloryfreeberg Jul 12, 2019
e4c5e15
Fixed formatting.
malloryfreeberg Jul 12, 2019
e44d743
Added note about specification being OOS
malloryfreeberg Jul 15, 2019
044a3bd
Updated motivation.
malloryfreeberg Jul 15, 2019
4f8e7bc
Updated use cases.
malloryfreeberg Jul 15, 2019
5028e14
Added question about bundle definition.
malloryfreeberg Jul 15, 2019
055b9ff
Removed bundle specifications.
malloryfreeberg Jul 15, 2019
67df958
Added notation of new bundle types.
malloryfreeberg Jul 15, 2019
f898c18
Added clarification of new bundle types
malloryfreeberg Jul 15, 2019
f05242e
Updated use cases in table.
malloryfreeberg Jul 15, 2019
8244ca9
Updated use cases in table.
malloryfreeberg Jul 15, 2019
fa0a66e
Updated NB for new bundle types.
malloryfreeberg Jul 15, 2019
ff6921c
Updated Resource bundle definition.
malloryfreeberg Jul 15, 2019
efbcc3a
Update 0000-bundle-definition.md
malloryfreeberg Jul 16, 2019
b6c8828
Added DAPS bundle type.
malloryfreeberg Jul 29, 2019
bb551e7
Updated questions.
malloryfreeberg Jul 29, 2019
439cb91
Added reference to DAPS bundle RFC.
malloryfreeberg Jul 31, 2019
0db9fd2
Remove data consumer acceptance criterion
kislyuk Sep 12, 2019
d968507
Merge #86 into #93
kislyuk Sep 16, 2019
7ab8aa6
Fix typo, thanks @chmreid
kislyuk Sep 16, 2019
8a448cb
Data Processing Service may consume secondary analysis bundles
kislyuk Sep 16, 2019
f4d0722
Project bundles may be used by DCP services
kislyuk Sep 16, 2019
a7b7953
Address review comments
kislyuk Sep 27, 2019
b14ccb5
Address review comments
kislyuk Sep 27, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
248 changes: 248 additions & 0 deletions rfcs/text/0000-bundle-definition.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,248 @@
### DCP PR:

***Leave this blank until the RFC is approved** then the **Author(s)** must create a link between the assigned RFC number and this pull request in the format:*

`[dcp-community/rfc#](https://github.com/HumanCellAtlas/dcp-community/pull/<PR#>)`

# HCA DCP Bundle Types and Definitions

## Summary

There are currently no documented definitions for what a "bundle" is in the HCA DCP, including no definitions of what
they are, what they contain, or who they support. Lack of clarity causes confusion for data consumers about how the data
and metadata in the DCP are structured and organized. This RFC formalizes *de facto* data bundle types, definitions,
target users/consumers, and specific use cases to improve clarity and confidence among DCP developers when working with
the DCP data model.

Specifically, the following charters will be supported by this RFC:

- The Data Store (DSS)
[charter](https://github.com/HumanCellAtlas/dcp-community/blob/master/charters/DataStore/charter.md#core-capabilities)
("The Data Store owns the bundle use cases and bundle type definitions, while the precise specifications will be
negotiated between the Metadata and other teams.");

- The Ingestion Service
[charter](https://github.com/HumanCellAtlas/dcp-community/blob/master/charters/IngestionService/charter.md#dependencies)
("The Ingestion Service has a dependency on the Data Store team to provide new bundle requirements, and the Metadata
Schema team to provide the definition of bundle specifications, in order to support the generation of bundles from
submitted data and metadata."); and

- The Metadata Schema
[charter](https://github.com/HumanCellAtlas/dcp-community/blob/master/charters/MetadataSchema/charter.md#objectives)
("Working with the Data Store to receive new bundle structure requirements and defining a new specification to be
implemented by the Ingestion Service team.").

## Author(s)

* [Mallory Freeberg](mailto:[email protected]) (TODO: Replace Mallory with Zina P and/or Chris V.)

* [Brian Hannafious](mailto:[email protected])

* [Hannes Schmidt](mailto:[email protected])

* [Andrey Kislyuk](mailto:[email protected])

## Shepherd

* [Andrey Kislyuk](mailto:[email protected])

## Motivation

There are currently no documented definitions for what a "bundle" is in the HCA DCP, including no definitions of what
they are, what they contain, or who they support. Lack of clarity causes confusion for data consumers (and DCP
developers) about how the data and metadata in the DCP are structured and organized. At the June 2019 DCP F2F meeting in
Cambridge, a breakout session was held to discuss "What is a bundle?", and a lot of useful discussion occurred. This RFC
will formally document what a bundle is, anchoring the definition on how bundles support HCA DCP data consumers. This
RFC will also describe new bundle types that could be supported by the HCA DCP.

### User Stories

1. As a data consumer who develops data processing tools (*e.g.* a sequence read aligner), I would like to browse,
query, and get primary data from the DSS so that I can use it as testing data during development.

1. As a data consumer who develops analysis tools (*e.g.* a clustering algorithm), I would like to browse, query, and
get alignment data from the DSS so that I can use it as testing data during development.

1. As a data consumer who is a computational biologist, I would like to get expression matrices from the DSS so that I
can compare against data I already have.

1. As a data consumer who is a biologist, I would like to get expression matrices from the DSS so that I can see the
expression profile of &lt;my gene of interest&gt; in &lt;my cell type(s) or organ(s) of interest&gt;.

1. As a data consumer who is a computational biologist, I would like to know what reference data was used to generate
processed data in the DSS and I would like to be able to get it for myself so that I can process/analyze my own data
with the same references as HCA DCP data.

1. As a tool developer, I would like to identify and get raw data in the DSS so that I can run my data processing and
analysis tools.

1. As a tool developer, I would like to identify and get alignment results and count matrices in order to produce
expression matrices.

1. As an external user of the DSS who is developing analysis tools, I would like to browse, query, and get raw and
processed data based on some rough criteria so that I can understand what the data are and develop my tool.

1. As an Azul or Query Service developer I want to be able to distinguish between different types of bundles in order to
know which bundle to index, link, etc. or to provide bundle types to users as a search criterion.

1. As a computational biologist, I would like to use the same references that the HCA DCP uses in their analysis.

## Detailed Design

### What is a bundle?

A bundle is a unit of data organization, roughly equivalent to a folder or directory. In the HCA DCP, a bundle is
defined as either a transactional grouping of files that belong together for scientific reasons, or a useful association
between a set of files for utility purposes.
kislyuk marked this conversation as resolved.
Show resolved Hide resolved

* Data bundles contain the results of a single data transaction. They can be viewed as the digital equivalent of an
"instrument run" - a quantity of data gathered from a single process instance. Data bundles are annotated with
metadata using a procedure designed by the metadata team.

* Utility bundles exist to organize (group) data bundles and represent data of non-DCP origin (such as reference data)
that is used in the course of analysis.

### How are bundle types defined?

Operationally, bundle types can be defined like file types or media types: they are used as a data type identifier for
applications to pattern match on whether data should be considered as suitable for input to the application.

### Bundle types and their definitions

Bundle types, definitions, consumers, and specifications are outlined below. Use cases from above are indicated where
supported by a specific bundle type.

#### Table header definitions

- **Type**: The name of the bundle type

- **Definition**: What is and is not in this bundle type

- **Target consumer**: Who will be consuming this bundle type

- **Use cases**: Which Users Stories are supported by this bundle type

| Type | Definition | Target consumer | Use cases |
|:-|:-|:-|:-|
| Primary bundle | A bundle that contains primary (raw) data files and all related metadata files. | computational biologists; Data Processing Pipelines; Data Browser; Query Service | 1 |
| Secondary bundle | A bundle that contains secondary (alignment, expression) data files produced by the Data Processing Pipelines, input primary data files, and all related metadata files. | computational biologists; Data Processing Pipelines; Data Browser; Query Service; Matrix Service | 2, 3, 4(?) |
| Tertiary bundle* | A bundle that contains tertiary (expression) data files produced by the Matrix Service, input primary and secondary data files, and all related metadata files. | computational biologists; biologists; Data Browser; Query Service | 3, 4 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "tertiary (expression)" is unnecessarily confusing. Tertiary analysis can be any meta-analysis, it should not be constrained to be "expression", which itself is a confusing shorthand for "expression matrix".

I propose renaming:

  • "Secondary bundle" -> "Analysis bundle"
  • "Tertiary bundle" -> "Expression Matrix bundle"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with renaming secondary and tertiary (in fact, that's the point of this RFC!). If we are renaming those to be nouns, we should also rename the primary bundle, perhaps:

  • "Primary bundle" -> "Raw (or Primary) data bundle"
  • "Secondary bundle" -> "Analysis bundle"
  • "Tertiary bundle" -> "Expression Matrix bundle"

To clarify, "Secondary/Analysis bundles" also contain expression data (gene x cell count matrices) in addition to alignment results, so I wonder if it's confusing to have these results not in the "Expression Matrix bundle".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Raw" doesn't seem very meaningful. Someone looking at gene expression (most likely the majority of users, might think of unnormalized counts as "raw".

"Matrix" is a type of data structured, so it seems odd to include it in the name. Kind of like calling sequence experiments "fastaq bundles". How about "gene expression bundle"

Analysis doesn't have much meaning either. I was kind of thinking gene expression is the result of analysis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gene expression matrices are a result of both pipelines and the matrix service (pipelines make them, matrix service acts on them for users).

| Project bundle* | A bundle that contains project-level metadata files. This bundle type does not contain data files or metadata files related to biomaterials, protocols, processes, or files. | computational biologists; biologists; | ? |
kislyuk marked this conversation as resolved.
Show resolved Hide resolved
| Resource/Reference bundle* | A bundle that contains reference data files used by the Data Processing Pipelines(?) and all related metadata files. This bundle type does not contain data files or metadata files related to biomaterials, protocols, processes(, or projects?). | computational biologists; Data Processing Pipelines; | 2, 3, 5 |
kislyuk marked this conversation as resolved.
Show resolved Hide resolved
| DAPS bundle* | Definition discussion ongoing in [PR #88](https://github.com/HumanCellAtlas/dcp-community/pull/88). | ? | ? |

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why the above bundles were defined in this manner other than this is de facto, but I am not sure that is the intent of this RFC, or at least does not feel like a formal understanding of bundles but more of a documentation of what we are using (without comment or systematic definitions).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TimothyTickle I believe that is the intent. I don't think we need to "formally understand bundles". We need to define bundle types and annotate them with enough information for application developers to easily reason about them, so they can develop applications and present data to users with reasonable certainty.

> Asterisk (*) indicates bundle types that do not currently exist in the DCP but have been discussed as potential future
> bundle types to support data consumers.

### New bundle types and their definitions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new bundle types and definitions are not complete but I would love to review them when they are. Feel free to reach out.


The only bundle types that currently exist in the DSS are Primary and Secondary bundles. The remaining bundle types are
proposed based on data consumer needs and are described in more detail below.

#### New Tertiary bundle

- Produced by Matrix Service. *Will anyone else be producing expression data files using the output of Data Processing
Pipelines as input?*

- In the first instance will contain expression data files representing one project. In the future, will contain
expression data files representing partials projects or multiple projects.

#### New Project bundle

- Will contain project-level metadata only.

- Can be used to group data for presentation to data consumers and DCP services.

#### New Resource/Reference bundle

- Should we call it "Resource" or "Reference" bundle?

- Who will generate/maintain these? Data Processing Pipelines?

### When is a new bundle type needed?

* There are 3 classes of data bundles currently provided or concretely envisioned in DCP: primary data bundles,
secondary analysis bundles, and tertiary (meta-analysis or aggregate analysis) bundles. A new class of bundle could be
needed if a new analysis modality or data aggregation method were to be introduced.

* As denoted in the bundle type notation sketch below, [RFC 7231](https://tools.ietf.org/html/rfc7231) parameters can be
used to designate the process that generated a bundle. A new class of process (for example, a new secondary analysis
pipeline) could cause a new `bundle-type` value with the same bundle type and subtype as previous secondary analysis
data bundles, but with a new parameter value for `hca-pipeline`.

* Similarly, a new modality or format of primary experimental output data would cause a new primary data bundle
`bundle-type` value.

### Conveying bundle types

A new field, `bundle-type`, will be introduced to the DSS bundle manifest (the response to `GET bundle` in DSS). The
field value is a string formatted in a similar manner
to [DCP Media Types](https://docs.google.com/document/d/1TqihrgXjct9aDmTJO52_gE2WlpFysB1OkG9C8exmWTw) and consistent
with the syntax defined in [RFC 7231](https://tools.ietf.org/html/rfc7231#section-3.1.1.1). DSS must require the bundle
type to be specified when creating a new bundle, but should not enforce the vocabulary. DSS must allow subscriptions
and search results to be predicated on bundle type using the same string predicates available with other metadata and
manifest string fields.

### Registry of Bundle Types

Introduce a registry of bundle types to be maintained by the Data Store team with specification input from the Metadata
team. The initial contents of the registry are:

| Bundle Type | Bundle Class | Example `bundle-type` value |
|--------------------------|---------------------|-------------------------------------------------------------|
| Project Metadata Bundle | Utility bundle | `hca/project` |
| Primary Sequence Bundle | Primary data bundle | `hca/primary-data; hca-data-type:SmartSeq2` |
| Primary Imaging Bundle | Primary data bundle | `hca/primary-data; hca-data-type:sptx` |
| Secondary Analysis Bundle| Analysis data bundle| `hca/analysis-output; hca-pipeline:snap-atac` |
| Reference Resource Bundle| Utility bundle | `hca/analysis-reference; hca-resource-type:reference-genome`|
| Expression Matrix Bundle | Analysis data bundle| `hca/analysis-output; hca-resource-type:expression-matrix` |

The registry would be maintained either as part of this git repository (HumanCellAtlas/dcp-community) or as part of a
repository chosen by the Data Store team, subject to a pull request-based contribution model.

### Acceptance Criteria [optional]

*Acceptance criteria are the conditions that a RFC must satisfy to be accepted by users or other stakeholders.*
kislyuk marked this conversation as resolved.
Show resolved Hide resolved

1. DCP developers have a shared understanding of what a bundle is, including what is in a bundle and how bundles relate
to one another.

### Unresolved Questions

*What aspects of the design do you expect to clarify further through the RFC review process?*

1. What is the process for adding or adopting new bundle types? Changing definition of current bundle types?

1. What other bundle specifications should be included here, especially in light of the bundle type design RFC
([PR link](https://github.com/HumanCellAtlas/dcp-community/pull/86))?

*What aspects of the design do you expect to clarify later during iterative development of this RFC?*

1. Does this informational RFC depend on mechanics of copy-forward?

1. What bundle types should be defined in this RFC? What types should be defined in future RFCs?

1. What defines a "correct" bundle type?

1. Is the level of detail for the operational requirements of a bundle registry sufficient?

### Drawbacks and Limitations [optional]

*Why should this RFC **not** be implemented?*

### Prior Art [optional]

*Share references to prior art to deepen community understanding of the RFC, such as learnings, adaptations from earlier designs, or community standards.*

- [Submission to bundles SOP (Apr 2018)](https://docs.google.com/document/d/1x8mYLU8ubpZtTrzkwJrft1heqReX-pJLjJfb2X0fd2w)

- [Bundle metadata schema update ticket](https://github.com/HumanCellAtlas/metadata-schema/issues/986)

### Alternatives [optional]

*Highlight other possible approaches to delivering the value proposed in this RFC.
What other designs were explored? What were their advantages? What was the rationale for rejecting alternatives?*

- Status Quo: no well-defined bundle types. Existing DCP components continue to infer bundle types via ad hoc methods
and included metadata. No one *really* knows what a bundle is, or there are many different ideas about what a bundle
is.