Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processing Datasets that Span Multiple Data Collection Runs #88

Merged
merged 117 commits into from
Sep 30, 2019
Merged
Show file tree
Hide file tree
Changes from 108 commits
Commits
Show all changes
117 commits
Select commit Hold shift + click to select a range
45b08f3
add multi data collection dataset processing
barkasn Jul 17, 2019
0e0daa9
Amend multi-data-collection-dataset-processing to match template
barkasn Jul 17, 2019
02a0aa2
Document cleanup and adding TODOs
barkasn Jul 18, 2019
78ac972
Added more justification for submission not defining the analysis. Va…
diekhans Jul 19, 2019
1ad087c
restored accidently deleted text (emacs aimed at foot)
diekhans Jul 19, 2019
9cabc99
delete stray character
diekhans Jul 19, 2019
4d5cc79
Minor fixes and cleanup
barkasn Jul 19, 2019
4fd862e
edit passing on summary, motivation, and user stories
diekhans Jul 28, 2019
162df3d
edits to design section
diekhans Jul 29, 2019
7783758
Minor edits while reading
barkasn Jul 30, 2019
34bfa55
cleanups
barkasn Jul 30, 2019
55c5036
working updates
barkasn Jul 30, 2019
c8957a6
refactoring savepoint1
barkasn Jul 30, 2019
4f7ed26
more info on different types
barkasn Jul 30, 2019
d647b7f
added a few comments
diekhans Jul 30, 2019
0f38734
updates to updates section
barkasn Jul 30, 2019
5b5db7e
added question for MD
barkasn Jul 30, 2019
67f83b1
minor changes
barkasn Jul 31, 2019
0dca2a3
add info on handling by analysis of updates
barkasn Jul 31, 2019
3e0be79
example improvements
barkasn Jul 31, 2019
0ead84a
minor changes
barkasn Jul 31, 2019
b2ab1c3
some edits
diekhans Aug 1, 2019
167ca5e
Merge signalling implementation into detailed design
barkasn Aug 1, 2019
aff4845
cleanups
barkasn Aug 1, 2019
b06b46d
provide into to metadata section
barkasn Aug 1, 2019
2672bd9
spelling and grammar; add examples in line, remove alternative soluti…
barkasn Aug 1, 2019
66d23ec
Improve structure to adhere to template
barkasn Aug 1, 2019
f3a3a47
Link to ongoing RFC about bundles
barkasn Aug 1, 2019
88759ac
add image
barkasn Aug 1, 2019
d059228
Add figure 2
barkasn Aug 1, 2019
b3bc779
update fig 2
barkasn Aug 1, 2019
42d5a4c
update figure
barkasn Aug 1, 2019
f4d889b
add figure 1
barkasn Aug 1, 2019
f41c617
insert figure 1 link into text
barkasn Aug 1, 2019
bdb1ff4
update figure 2 to have output bundles
barkasn Aug 2, 2019
647f08f
text update
diekhans Aug 5, 2019
31c27e3
clarified based on input
diekhans Aug 6, 2019
bdf12a9
clarified statement
diekhans Aug 6, 2019
cf69090
changed one word
diekhans Aug 6, 2019
d023534
added more unresolved questions
diekhans Aug 7, 2019
055069d
updates from discussion with data operations group
diekhans Aug 16, 2019
987e76b
RFC: DCP Incident Response Plan (#80)
stahiri Jul 25, 2019
5f86e18
Rename 0000-Incident-Response-Plan.md to 0009-Incident-Response-Plan.md
stahiri Jul 25, 2019
228650a
Editorial updates (#89)
brianraymor Jul 25, 2019
5880874
Hint that the title name should be replaced (#95)
mweiden Jul 31, 2019
6b45769
fixed incorrect naming of images for RFC 0008 (#96)
diekhans Aug 6, 2019
44ccb09
add multi data collection dataset processing
barkasn Jul 17, 2019
2195b7e
Amend multi-data-collection-dataset-processing to match template
barkasn Jul 17, 2019
d0c417e
Document cleanup and adding TODOs
barkasn Jul 18, 2019
bad7c0c
Added more justification for submission not defining the analysis. Va…
diekhans Jul 19, 2019
6a29f71
restored accidently deleted text (emacs aimed at foot)
diekhans Jul 19, 2019
d14b0a1
delete stray character
diekhans Jul 19, 2019
0cb6b6d
Minor fixes and cleanup
barkasn Jul 19, 2019
b6f5cdd
edit passing on summary, motivation, and user stories
diekhans Jul 28, 2019
32f9a41
edits to design section
diekhans Jul 29, 2019
9cb2a61
Minor edits while reading
barkasn Jul 30, 2019
700e6fc
cleanups
barkasn Jul 30, 2019
5a2ca0b
working updates
barkasn Jul 30, 2019
4965207
refactoring savepoint1
barkasn Jul 30, 2019
45e0297
more info on different types
barkasn Jul 30, 2019
3244584
added a few comments
diekhans Jul 30, 2019
46ccf67
updates to updates section
barkasn Jul 30, 2019
12819f5
added question for MD
barkasn Jul 30, 2019
7c1ab49
minor changes
barkasn Jul 31, 2019
1ca98fc
add info on handling by analysis of updates
barkasn Jul 31, 2019
68f1415
example improvements
barkasn Jul 31, 2019
b3f4fdd
minor changes
barkasn Jul 31, 2019
f1a965b
some edits
diekhans Aug 1, 2019
9b26722
Merge signalling implementation into detailed design
barkasn Aug 1, 2019
59bece8
cleanups
barkasn Aug 1, 2019
a024df8
provide into to metadata section
barkasn Aug 1, 2019
62f0a4e
spelling and grammar; add examples in line, remove alternative soluti…
barkasn Aug 1, 2019
0e26cfc
Improve structure to adhere to template
barkasn Aug 1, 2019
0e4d947
Link to ongoing RFC about bundles
barkasn Aug 1, 2019
37232e7
add image
barkasn Aug 1, 2019
ff34f96
Add figure 2
barkasn Aug 1, 2019
b1f572b
update fig 2
barkasn Aug 1, 2019
6668dbe
update figure
barkasn Aug 1, 2019
0aff94e
add figure 1
barkasn Aug 1, 2019
28bebb1
insert figure 1 link into text
barkasn Aug 1, 2019
a4735a0
update figure 2 to have output bundles
barkasn Aug 2, 2019
937af40
text update
diekhans Aug 5, 2019
f6e8d2b
clarified based on input
diekhans Aug 6, 2019
a91c018
clarified statement
diekhans Aug 6, 2019
c0db4c6
changed one word
diekhans Aug 6, 2019
6ef0511
added more unresolved questions
diekhans Aug 7, 2019
8ba35ba
merged final changes before re-review
diekhans Aug 16, 2019
9932b80
Apply suggestions from code review
diekhans Aug 21, 2019
946f3ef
modifications to address comments and fix typos
diekhans Sep 2, 2019
7603d92
added reference on bundling
diekhans Sep 2, 2019
6bbd2e3
text improvements
barkasn Sep 4, 2019
68e579c
Update rfcs/text/0000-multi-data-collection-dataset-processing.md
barkasn Sep 13, 2019
f359b69
Update rfcs/text/0000-multi-data-collection-dataset-processing.md
barkasn Sep 13, 2019
ad4d80b
add unresolved issue on data schema
diekhans Sep 13, 2019
2e2714f
Clarified and address some of the unresolved issues. update figures …
diekhans Sep 15, 2019
b21a514
fixed typo
diekhans Sep 17, 2019
33a4eee
fixed typo
diekhans Sep 17, 2019
9ff4d71
moved information about rejecting collections to Alternatives
diekhans Sep 17, 2019
52d6f98
Fixed typos
barkasn Sep 17, 2019
322b6a2
clarified text about project submission
diekhans Sep 17, 2019
f069f58
removed over enthustic use of adjectives
diekhans Sep 17, 2019
ac5554b
changed working and fixed typos
diekhans Sep 17, 2019
7bce2f7
clarified wording
diekhans Sep 17, 2019
9da164d
fixed link formatting
diekhans Sep 17, 2019
e36e1fb
modifided text to better describe consistency
diekhans Sep 18, 2019
d1266b1
update on implementation feasibility
barkasn Sep 19, 2019
9f5914d
updated to more clearly define scope and rename constant for JSON to …
diekhans Sep 23, 2019
466410b
clarification thanks to on Saman's input
diekhans Sep 23, 2019
1fc19d4
clarified text as requested by samanehsan
diekhans Sep 24, 2019
e6ce46d
updated figure 1 to match text changes
diekhans Sep 25, 2019
03e4a98
pre-merge commit
barkasn Sep 30, 2019
ad1db04
revert the rename for merge with master
barkasn Sep 30, 2019
51116d0
rename with correct number
barkasn Sep 30, 2019
d451765
update image location
barkasn Sep 30, 2019
5301acd
image update
barkasn Sep 30, 2019
436b502
change link
barkasn Sep 30, 2019
37602ba
fix link
barkasn Sep 30, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
142 changes: 142 additions & 0 deletions rfcs/text/0000-multi-data-collection-dataset-processing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
### DCP PR:

***Leave this blank until the RFC is approved** then the **Author(s)** must create a link between the assigned RFC number and this pull request in the format:*

`[dcp-community/rfc#](https://github.com/HumanCellAtlas/dcp-community/pull/<PR#>)`

# RFC: Data Groups: Processing Datasets that Span Multiple Bundles

## Summary
This RFC proposes a solution to allow processing of datasets that span multiple bundles. It addresses the general problem that the current DCP data model contains no representation of data set grouping and version-completeness beyond that of individual data bundles. The current DCP model is restricted in that all data processing can be a one-to-one linear sequence of steps deriving one bundle from another.

However, there are use cases where a given processing step can require input from multiple input bundles, resulting in a directed acyclic graph (DAG) rather than a linear graph. The relevant DAG sub-graph must be defined and all of its associated data available to correctly initiate processing. A notion and implemented representation of data completeness are required to support DAG processing.

This RFC defines a general mechanism for grouping bundles, here called *data groups*. By providing a general method for grouping bundles, this proposal provides a mechanism for addressing various tasks that cross bundle boundaries. This has an impact on analysis, pipelines, algorithmic complexity of data processing, data consistency, and data set quality control. The current bundle grouping, described in detail here, is a submission to a project from external producers or analysis . However, the concept generalizes in a manner that other groupings can be defined to accommodate future needs.

## Author(s)
- [Nick Barkas](mailto:[email protected])
- [Mark Diekhans](mailto:[email protected])

## Shepherd
- [Nick Barkas](mailto:[email protected])

## Motivation
The current DCP model of processing has the scope of a single data bundle, which contains a single assay or analysis. There is no mechanism to know when a group of bundles that need to be processed together is complete and available for further processing. This results in the limitation that data processing is linear and is entirely based on ingestion packaging resulting in multiple problems, as outlined below.

#### Multi-input analysis
Analysis pipelines may require input from multiple different assays. To perform such analyses, they must be triggered only when all input data is available. The current mechanism of bundle notifications is not effective when data spans multiple bundles because there is no clear set of rules to identify the bundles that need to be co-processed. This limits the downstream process to the scope of a single bundle and requires Ingest to define the scope of all subsequent processing without foreknowledge of future needs.

An immediate need for analysis pipelines to process 10X V2 scRNA-seq datasets that span multiple bundles exists. The need for co-processing multiple bundles is not however limited to this scenario and extends to any other assay modality that involves repeated sequencing of the same library. Furthermore, the need to co-process data may extend to other data modalities that the DCP will accept in the future.

In the future, a need to run analysis processes where input sets span multiple projects may arise. A mechanism to specify processing data collections that are not restricted to a single project will be required to support this functionality.

#### Algorithmic complexity and performance
Producing combined metadata or data from multiple assay bundles using the bundle event system results in O(N^2) algorithmic complexity, where N is the number of data bundles in some grouping. As each bundle event arrives, it must be combined with accumulated metadata from previous bundles. Additionally, there is a race condition on reading and writing the accumulated results that must be handled as the order of event receipts is not guaranteed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate a bit more on what the race condition is here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building accumulated data file in S3 from contents in multiple bundles. For instance, creating a project metadata TSV:

  • bundle event A handler reads project TSV
  • bundle event B handler reads project TSV
  • bundle event A handler adds A metadata to project metadata
  • bundle event A handler writes project TSV
  • bundle event B handler adds B metadata to project metadata
  • bundle event B handler writes project TSV, overwriting A's updates

This is also O(N^2)


One approach to addressing this is to merely queue all data bundle events. This, however, requires knowledge of completion so that process can be triggered, which is not addressed by the use of a queue. The algorithmic complexity of combining incoming notifications impacts the data browser when creating project metadata TSVs and normalized JSON files, as well as the matrix service when generating pre-built project matrices.

#### Data consistency and completeness
Consistency and completeness across a data set is an important attribute of an atlas. An incomplete data set may result in misinterpretation of data or missed discoveries. While a data set can be updated in the future, there is a need to know when the current version of a data set is fully available. This is referred to as *version-completeness*. Currently, the DCP has no way to indicate version-completeness beyond the bundle level. Neither the data browser nor any consumer API and has the information to indicate that a project submission to ingest is complete. This information is only available within Ingest, with no way to communicate the status to other components. Ensuring consistency between bundles in a group requires modifications to be completed before they are made available. For example, an update to metadata that is duplicated between bundle must all finish before the update is published.

#### Data set quality control
An approach to quality control (QC) is to run a partial or full analysis on a subset of the data before doing a full analysis. This requires defining an analyzable subset of the data to pass on to the processing pipelines. A QC data subset must meet the criteria that allow the QC analysis to run.

## User Stories

As a data contributor, I want to maintain flexibility in the laboratory approaches I can use to perform top-up sequencing, library multiplexing and arrange replicates on flowcells in a way that minimizes overall cost, while being assured that the processing will be correct.

As a data-wrangler, I want to be able to succinctly enter the metadata I receive from the data contributor and have a clear understanding of what I need to do to ensure correct processing.

As a schema developer, I want to ensure we are capturing and requiring, the metadata from a contributor to ensuring the pipelines can run appropriately.

As a member of the pipelines team, I want to ensure that the pipelines process the correct data sets and I can easily access the metadata required to trigger the pipelines. I do not want to be restricted in designing analysis based on how data was submitted.

As a data consumer, I want to be able to find all data files that need to be processed together so that I can run my data processing pipeline. I do not want to be restricted in designing analysis based on how data was submitted.

As a data consumer, I want to be confident that the HCA DCP has correctly processed all data files that belong together so that I know the matrices I receive have been generated correctly.

As a data consumer, I need to know that a data set is done so that I don't download and use it until it is version-complete.

## Detailed Design
A new concept of *data group* is added to the DCP data model to address these issues. A data group is defined as a set of specific versions of metadata and data that is complete and consistent by a pre-specified set of criteria. A *data group* is not created until all its contents are complete and uploaded to the DSS. As with any bundle, events are generated when a *data group* is created, updated, or deleted.

A data group has a scope that specifies what level of the data hierarchy is contained in the group. Scope is the breadth of the type of data represented in the group, not the step in the life-cycle that the group references. The per-bundle notification model has a scope of a single bundle. The above use cases will require a *project* scope that indicates submission to a project is complete. There is a discussion for units of data different than *project*, currently referred to as *data sets*. These could form another data group scope if required.

While the life-cycle step can be inferred by interrogating the bundles referenced by a *data group*, this will most likely very expensive and non-obvious. Instead, we add the attribute *step* that defines where it the DCP pipeline the *data group* was produced. The exact values will be defined by the producer components. Once possible approach is to use the bundle type of the primary data types, for example *hca/analysis-output; hca-pipeline:snap-atac*.

Data groups are implemented as a new bundle type that contains a JSON file
listing the FQIDs (UUIDs with versions) of all bundles in the data group. The
existing DSS subscription mechanism is used for notifications.

A new schema *data_group* will be created that contains the fields:
- *scope* - Symbolic name of the scope, for example, *PROJECT_SCOPE*.
- *step* - Symbolic name of the life-cycle step, for example *hca/primary-data; hca-data-type:sptx*.
- *bundle_fqids* - List of FQIDs of bundles that are in the group.



See also:
- [The Ingest Submission Data Model](https://docs.google.com/document/d/1qlgIROPK4qESy9-Lwvva1ZWJ3S5EdZ0HWJiGGvt2JRM/edit#heading=h.8r5egyz5ly82) for a description of submissions.
- [Bundling in ingest](https://docs.google.com/document/d/1cGfolHMGEe1IKkXSBS75SAdSri9Np_oW6Z4S9-Ax1QY/edit#heading=h.xv0ag3c16x15) for details on how bundles are created.

Initially, only the *PROJECT_SCOPE* scope will be implemented. *PROJECT_SCOPE* will indicate that a single submission of data to a project is version-complete and will be generated by Ingest only after all data bundles are committed to the DSS. The generation of the *PROJECT_SCOPE* data group indicates that data in the submission to the project should be processed by downstream components. Ingest is responsible for creating *PROJECT_SCOPE* *data groups* after the relevant bundles have been created. Figure 1 shows a diagram of a project with multiple *data groups*.

![Figure 1](../images/0000-multi-data-collection-data-processing-images/project-submission-in-project.png)


## Application to analysis pipelines

While this RFC proposes a general mechanism, it was developed in response to the needs of analysis to co-process data that are in multiple assay bundles. This section uses co-processing of multiple sequencing run from a given library as an example of the use of data groups to identify data that should be processed together. A key concept behind this proposal is that upstream submission should not be defining how downstream analysis groups inputs for processing. The analysis pipelines need to be able to group input based on ways that may not be defined at submission time. New analysis pipelines should be able to group inputs in arbitrary ways without requiring changes to the way data is packaged.

### Identifying data to co-process
Pipelines that require grouping data from multiple sequencing assays (co-processing) subscribe to *PROJECT_SCOPES* events instead of per-bundle events. Unlike with the assay bundle per pipeline run, this makes the pipeline framework responsible for collecting assay bundles into pipeline runs. Pipelines that do not require co-processing continue to operate on a per-bundle basis, however they still need to subscribe to the project data group so they can notify ingest when the input *data group* process is complete for use by downstream consumers, such as Azul.
Copy link

@samanehsan samanehsan Sep 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be more accurate if it said the pipeline service needs to subscribe to the project data group...

The difference is that if every pipeline subscribed to "project data group notifications", whenever a project is submitted the same notification would get sent to each pipeline. This creates a lot of duplicated work in terms of inspecting the contents to determine if that pipeline should be run so it would be more efficient to have a single subscription to data group notifications. However, if there's a way to query the project data group bundles without requesting all of the metadata files from the data store then one subscription/pipeline would be fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, @samanehsan text has been clarified


For full background and a detailed proposal on how these libraries are represented in the metadata see the RFC [Representing sequencing library preparations in the HCA DCP metadata standard](https://github.com/HumanCellAtlas/dcp-community/blob/68f0c42fc9bd55297be8c241663721271c90b694/rfcs/text/0010-rfc-library-preparation.md).

The *data group* submission event does not define bundles to be co-processed, it indicates that data to be co-processed is contained within the *data group* and that that group fulfills version-completeness requirements. Co-processing will be initiated by pipelines based on a well-defined set of rules that is data modality-specific. The partition of a submission *data group* into co-processing units may require multiple parameters. For instance, with 10X V2 3’ scRNA-seq is grouped by library, however, a submission may contain non-10X data modalities that need to be processed separately.

An analysis pipeline event handler component will be created that is responsible for partitioning a *data group* into co-processing units, dispatching the pipelines, and tracking completion. This is the responsibility of Analysis. When analysis results are submitted to ingest, this creates a secondary analysis *data group*.

The creation and update of *PROJECT_SCOPES* *data groups*, both primary and secondary, is the responsibility of the Ingest system.

![Figure 2](../images/0000-multi-data-collection-data-processing-images/flow-of-information.png)
Figure 2: Flow of information in the proposed data model

#### Updates to *data groups*
Updates to *PROJECT_SCOPES* must result in updating of the associated *data groups* to trigger reprocessing. When part or the entirety of a project submission is updated the associated data group must be identified and updated with the new bundle versions if bundles have been replaced and/or with additional new bundles if bundles have been added. This is the responsibility of Ingest. Analysis must handle the update to the *data group* by initiating only the pipelines that have been affected by the update and ensure that the output bundles of this analysis are updates to existing bundles.

*Data group* updates may result from creating new bundles to add additional types of data to a project, or updating bundles to change metadata or append data to data files (e.g. FASTQ top-up).

#### Implementation feasibility

Both Analysis and Ingest confirm that implementation of the data_groups bundle proposed here is feasible in principle. Analysis indicates that significant changes need to be made to the existing infrastructure to handle the new type of notification. The generation of secondary data_groups of type PROJECT_SUBMISSION will be generated by analysis submitting to submission and Ingest will identify the bundles included in the secondary bundle.

The following order of implementation is proposed, accounting for the dependencies between components to ensure uninterupted system functionality:

1) Ingest adds functionality for supporting data_groups and starts using them for singalling submission completion
2) Analysis adds handler for data_groups but continues to use bundle events during a transition period
3) Analysis switches to using data_group events
4) Ingests adds support for secondary data_groups

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can get up to here currently but going beyond requires resolution of the analysis updates RFC.

5) Analyis starts notifying Ingest for the existence of data_groups
6) Downstream components (Matrix Service, Portal) start listening for secondary data_groups

### Unresolved Questions

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is very important to think about how this will stretch beyond PROJECT_SUBMISSION. It doesn't need to be high detail but I would need a warm feeling that we are not heading down a development path that can't cater to later requirements.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As below, on reconsideration I don't think is too important to answer now as I suspect PROJECT_SUBMISSION is needed in pretty much all circumstances.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you consider whether much of this can be implemented by subscriptions to the query service? This would still need to be supplemented by ingest signals of submission but it may allow more flexibility for future requirements, where those can be catered for by subscription requirements rather than new data model entities in ingest, with all the work and coordination that involves.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect it could be implemented by query service events, although it is imaged that the QS will use project data groups to populate the database.

Without understanding more about how QS events will integrate into the system, using the DSS event mechanism seems like a conservative step.

@MDunitz @kislyuk, any thoughts?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is still and important question but even with this, some submission finished event (here corresponding to PROJECT_SUBMISSION) would still be necessary. Hence, I don't think it needs to be answered here.


#### Aspects of the design to be resolved during RFC review
* Is *step* the best terminology for the type of a *data group*

#### Aspects of the design to be resolved during implementation:
* Where does the *data group* JSON schema belong? It is not part of the experimental data model, so doesn't fit in the current hierarchy.

* How is *step* defined? By reusing bundle type definitions or another mechanism?

#### Aspects of the design to be resolved in the future
* *Data group* concept relates directly to the UX discussion on *experiment data sets*. This RFC and that work could be unified with a direct mapping between *experiment* and *data groups*.

## Dependencies

Implementation of this RFC is partly dependent on [RFC: HCA DCP Bundle Types and Definitions](https://github.com/HumanCellAtlas/dcp-community/pull/93), however, introspection of bundles could be used as a workaround.

### Alternatives

- DSS collections were considered as a possible implementation, however, they don't support subscriptions and are not indexed, so they are not a sufficient mechanism for implementing *data groups*.