Skip to content

Commit

Permalink
WIP
Browse files Browse the repository at this point in the history
  • Loading branch information
kislyuk committed Jul 11, 2019
1 parent 9e6815c commit 82b1302
Showing 1 changed file with 90 additions and 0 deletions.
90 changes: 90 additions & 0 deletions rfcs/text/0000-bundle-types.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
### DCP PR:

***Leave this blank until the RFC is approved** then the **Author(s)** must create a link between the assigned RFC number and this pull request in the format:*

`[dcp-community/rfc#](https://github.com/HumanCellAtlas/dcp-community/pull/<PR#>)`

# RFC: Bundle Types

## Summary

Formalizes data bundle types, definitions, target users/consumers, and specific use cases to improve clarity and
confidence among DCP developers when working with the DCP data model.

## Author(s)

* [Mallory Freeberg](mailto:[email protected])
* [Brian Hannafious](mailto:[email protected])
* [Hannes Schmidt](mailto:[email protected])
* [Andrey Kislyuk](mailto:[email protected])

## Shepherd
***Leave this blank.** This role is assigned by DCP PM to guide the **Author(s)** through the RFC process.*

*Recommended format for Shepherds:*

`[Name](mailto:[email protected])`

## Motivation

There are currently no documented definitions for what a "bundle" is in the HCA DCP, including no definitions of what
they are, what they contain, or who they support. Lack of clarity causes confusion for DCP developers and data consumers
about how the data and metadata in the DCP are structured and organized.

### User Stories

1. As a tool developer, I would like to identify and get raw data in the DSS so that I can run my data processing and
analysis tools.

1. As a tool developer, I would like to identify and get alignment results and count matrices in order to produce
expression matrices.

1. As a computational biologist, I would like to get expression matrices for the Immune Cell Atlas project so that I can
do my research.

1. As an external user of the DSS who is developing analysis tools, I would like to browse, query, and get raw and
processed data based on some rough criteria so that I can understand what the data are and develop my tool.

1. As an Azul developer I want to be able to distinguish between different types of bundles in order to know which
bundle to index, link, etc.

1. As a computational biologist, I would like to know where the reference data is.

## Detailed Design

Introduce a new field, `bundle-type`, in the DSS bundle manifest (the response to `GET bundle` in DSS). The field value
is a string formatted in a similar manner
to [DCP Media Types](https://docs.google.com/document/d/1TqihrgXjct9aDmTJO52_gE2WlpFysB1OkG9C8exmWTw) and consistent
with the syntax defined in [RFC 7231](https://tools.ietf.org/html/rfc7231#section-3.1.1.1). DSS must require the bundle
type to be specified when creating a new bundle, but should not enforce the vocabulary. DSS must allow subscriptions
and search results to be predicated on bundle type using the same string predicates available with other metadata and
manifest string fields.

### Registry of Bundle Types

Introduce a registry of bundle types to be maintained by the Metadata team. The initial contents of the registry are:

| Bundle Type | Example `bundle-type` value |
|--------------------------|-----------------------------------------------------------|
| Project Metadata Bundle | `hca/project` |
| Primary Sequence Bundle | `hca/primary-data; hca-data-type:SmartSeq2` |

This comment has been minimized.

Copy link
@malloryfreeberg

malloryfreeberg Jul 12, 2019

Member

Is the idea that there will be an hca-data-type for all single cell technologies (e.g. "SmartSeq2")? This information is already in the HCA metadata standard. What are the consequences of duplicating it here? Which use case is this supporting? I was thinking this value would be something more like hca-data-type:sequence or hca-data-type:image

This comment has been minimized.

Copy link
@kislyuk

kislyuk Jul 30, 2019

Author Member

The idea is that the assay type should be clearly evident from the top level bundle metadata. Currently it is quite hard for the uninitiated to figure out where to look for information defining the assay that is in the bundle; I propose that it be surfaced here. I like the sequence and image designations too; we could update this to read: hca/primary-data; hca-data-type:sequence; hca-assay-type:SmartSeq2

This comment has been minimized.

Copy link
@malloryfreeberg

malloryfreeberg Jul 31, 2019

Member

I guess I'm just confused why someone - at the top level - needs to know the assay type.

I get why data processing pipelines needs to know assay: they query it to find bundles they can process. But there are many fields they query to find bundles they can process. Do we want to put all that info at the top level?

I get why Browser needs to know assay: they display it as a facet. But they display many fields as facets. Should all go in the top level?

AFAIK, Smartseq2 and 10x bundles are not treated differently by the DCP except in the case of deciding which workflows to run; which again depends on more fields than just assay type.

| Primary Imaging Bundle | `hca/primary-data; hca-data-type:sptx` |
| Secondary Analysis Bundle| `hca/analysis-output; hca-pipeline:snap-atac` |

This comment has been minimized.

Copy link
@malloryfreeberg

malloryfreeberg Jul 12, 2019

Member

Given that the green box is the "Data Processing Pipelines" (as is the charter name) and that "analysis" is often interpreted as meaning clustering or trajectory or some other more complex process, I would recommend this value be something like hca/processed-data.

I have the same concern here about the hca-pipeline:snap-atac value as the SmartSeq2 value above: This information is in the metadata, so what is the value of having it also at the bundle level? Which use case is this supporting?

This comment has been minimized.

Copy link
@kislyuk

kislyuk Jul 30, 2019

Author Member

Hmm... I would not agree that "analysis" is more prone to misinterpretation than "processed-data" here. I'm fine changing this, but perhaps we could gather some more opinions.

Again, the reason to surface it here is that it is essential typing data that is otherwise not easy to find or predicate searches on.

This comment has been minimized.

Copy link
@malloryfreeberg

malloryfreeberg Jul 31, 2019

Member

+1 for getting other opinions!

| Reference Resource Bundle| `hca/analysis-support; hca-resource-type:reference-genome`|
| Expression Matrix Bundle | `hca/analysis-output; hca-resource-type:expression-matrix`|

This comment has been minimized.

Copy link
@malloryfreeberg

malloryfreeberg Jul 12, 2019

Member

So some of these bundle types don't exist yet (Project Metadata Bundle, Reference Resource Bundle, Expression Matrix Bundle). Do we need to have a separate proposal for adding these bundle types, and does this RFC depend on that proposal? We also don't have a clear definition of what a bundle is. Does this RFC depend on having that definition?

There were some definitions proposed at the F2F session, including

  • a transaction of files that belong together for scientific reasons
  • an association between a set of files (e.g. data, metadata)

This comment has been minimized.

Copy link
@kislyuk

kislyuk Jul 30, 2019

Author Member

@malloryfreeberg and I discussed this in person, and Mallory agreed to put the bundle definition work into https://github.com/HumanCellAtlas/dcp-community/pull/93/files.

### Unresolved Questions

- Bundle degrees: Useful? Who assigns them? Who uses them?
- Is there actually added value to not having to update all bundles when e.g. the project is updated?
- Versioned/unversioned references?

### Prior Art

[Submission to bundles SOP](https://docs.google.com/document/d/1x8mYLU8ubpZtTrzkwJrft1heqReX-pJLjJfb2X0fd2w) (Apr 2018)
[Bundle metadata schema update ticket](https://github.com/HumanCellAtlas/metadata-schema/issues/986)

### Alternatives

- Status Quo: no well-defined bundle types. Existing DCP components continue to infer bundle types via ad hoc methods
and included metadata.

0 comments on commit 82b1302

Please sign in to comment.