RFC: HCA DCP Domain Oriented Bundle Types and Definitions #119

NoopDog · 2019-09-25T16:53:39Z

October 9: Last call for community review.

Link to document view: here

This RFC is offered as a refinement and generalization of the RFC: HCA DCP Bundle Types and Definitions.

This RFC describes:

An application/metadata level mechanism to define bundle types and a simple bundle type hierarchy.
An application/metadata level method to specify the type of a bundle.
Several bundle types designed to match current usage and near term expected usage.

Key Differences from RFC: HCA DCP Bundle Types and Definitions

This proposal differs from RFC: HCA DCP Bundle Types and Definitions mainly in that:

Type information is added to a type.json metadata file rather than added to the DSS bundle.json file.
Type information is expressed in JSON rather than of RFC 7231 media type syntax.
The schema and bundle types are represented as JSON schema in the metadata schema repo and documented on the Data Portal rather than maintained in a DSS registry.
Additional background and motivation is presented.

briandoconnor

Seems like a very good thing to nail down Dave.

Are the provenance links in the JSON DSS UUIDs and versions? Maybe make that more clear?

There should be a plan to document this on a DCP section of the portal. It should be one of our key SOP docs for the project I think.

This needs to be kept up to date over time. If we add new types they need to be documented.

NoopDog · 2019-09-25T18:45:05Z

Are the provenance links in the JSON DSS UUIDs and versions? Maybe make that more clear?

Yes provenance.document_id is the type.json file DSS UUID. It's not clear to me exactly what the provenance.submission_date means in this context and if this maps reliably to the DSS file version in general.

I have updated the doc to reflect the above.

diekhans

Only partway. Generally, I really like the approach, but have outlined some concerns.

Also, I worry this may be trying to do too much in one step and at the same time not enough to have an easier to use system. Large changes are impossible to make without break the DCP. We have to take steps, even if temporary, to get to the right place.

rfcs/text/0000-rfc-application-layer-bundle-types-and-definitions.md

diekhans · 2019-09-26T02:00:38Z

rfcs/text/0000-rfc-application-layer-bundle-types-and-definitions.md

+
+### Key Differences from RFC: HCA DCP Bundle Types and Definitions
+
+This proposal differs from [RFC: HCA DCP Bundle Types and Definitions](https://github.com/HumanCellAtlas/dcp-community/pull/93/files) mainly in that:


I see this as a disadvantage. The number of JSON objects required to actually understand the bundle is already large. From a client perspective, not including type information in the manifest doesn't seem intuitive. While the DSS strives to not be tied to just one project, this is an implementation detail we should not inflict on users. Even if the generic DSS doesn't have a schema associated with bundle.json, the DCP can add one, which the schema link being a DSS configuration option.

diekhans · 2019-09-26T02:01:07Z

rfcs/text/0000-rfc-application-layer-bundle-types-and-definitions.md

+
+1. Type information is added to a type.json  metadata file rather than added to the DSS bundle.json file.
+
+1. Type information is expressed in JSON rather than of [RFC 7231](https://tools.ietf.org/html/rfc7231#section-3.1.1.1) media type syntax.


Now that you mention it, this makes a lot of sense

rfcs/text/0000-rfc-application-layer-bundle-types-and-definitions.md

diekhans · 2019-09-28T15:06:20Z

rfcs/text/0000-rfc-application-layer-bundle-types-and-definitions.md

+
+`[dcp-community/rfc#](https://github.com/HumanCellAtlas/dcp-community/pull/<PR#>)`
+
+# Application Layer Bundle Types and Definitions


I find the title of this RFC confusing, as I don't know what the application layer and have not seen it as part of any other DCP documentation. Is this the data browser? the API?

I think it would be better to not bring in terminology that isn't currently part of the mental DCP lexicon.

Although, we do need a written DCP lexicon!!

Reading wikipedia: Application layer, the title makes even less sense.

I mean "Application Layer" in the multi tier architecture sense https://en.wikipedia.org/wiki/Multitier_architecture

Not the OSI Layer 7 networking protocol layer sense.

I have added a note to the top of the doc that should clear the issue about the "Application Layer" term.

Note: I have used the term "Application Layer" extensively in this document and the term seems to be causing some confusion. Let me clarify that I mean "Application Layer in the Multi Tier Architecture sense as opposed to the "Data Layer". Here the "Appplications" are the boxes or clients of DSS. Whoever writes into the database (DSS) needs to validate the records (bundles) they create are appropriately shaped as the database (DSS) does not to this.

I will resolve all comments about the "Application Layer" term. Please reopen if you feel this is not clear.

It is still very confusing as the terminology is not used in the DCP and I still don't know what software comprises the application layer. Azul?

I still feel that "application layer" is just introducing confusion from another programming model that is not used in the DCP. "DSS client" or "DSS application" is clearer in reference to the DSS. In terms of the overall DCP,

Maybe this is something to bring up in tech arch???

p.s. I do see the distinction trying to be made here and I think it is important, it is only the term that seems problematic.

rfcs/text/0000-rfc-application-layer-bundle-types-and-definitions.md

diekhans · 2019-10-01T14:20:39Z

rfcs/text/0000-rfc-application-layer-bundle-types-and-definitions.md

+
+### Example JSON Files
+
+#### Assay Bundle


Merging links.json, which is part of the experimental metadata with bundle types defeats the goal of the metadata group at separating the metadata layer from the bundle layer.

The goal here is to tag the bundles with the process and protocol that created them in a uniform manner without disrupting the current system. Links.json use different file names for analysis vs assay bundles for the process and protocol links. Here the process and protocol are normalized across bundle types.

Also by using the protocol and process id this gives the ultimate in type refinement on what is in this bundle.

I do not see this as breaking a separation of concerns. The ids of the process and protocol used to create the bundle should not be uniquely visible to the "experimental metadata". Our "processing model" as well should be able to benefit from the terms created in the experimental model to the extent that it is useful.

It the processes and protocols being referenced are metadata instances, this is very confusing

Yes provenance.document_id is the type.json file DSS UUID. It's not clear to me exactly what the provenance.submission_date means in this context and if this maps reliably to the DSS file version in general.

I have updated the doc to reflect the above.

Provenance section is added by ingest. The submission_date may correspond to the ingest, however, is in a slightly different format so it should not be counted on. This is problematic and there is a ticket about this weirdness.

Also, this does not couple the experimental metadata with the bundle metadata, it couples the bundle metadata with a budding processing model in order to define the bundle type in terms of the experimental metadata. This is a use, not a pollution of the experimental metadata.

We can easily add new data to links.json too to address these issues.

TimothyTickle · 2019-10-02T21:51:52Z

rfcs/text/0000-rfc-application-layer-bundle-types-and-definitions.md

+
+Analysis bundles use an `analysis_process0.json` and `analysis_protocol_0.json`. Additionally because of copy forward, the `process_0.json` and `sequencing_protocol_0.json` also exist in the analysis bundles.
+
+Having the same information in different places depending on bundle type makes determing the process and protocol that created the bundle more difficlut than it needs to be. 


Suggested change

Having the same information in different places depending on bundle type makes determing the process and protocol that created the bundle more difficlut than it needs to be.

Having the same information in different places depending on bundle type makes determining the process and protocol that created the bundle more difficult than it needs to be.

TimothyTickle · 2019-10-02T21:56:29Z

rfcs/text/0000-rfc-application-layer-bundle-types-and-definitions.md

+> _Note: It may take some discussion to decide on an effective set of subclasses. Sequencing assay may need to change to “2nd Gen Sequencing Assay” or “NGS Sequencing Assay” for example._
+
+#### Analysis Bundle
+Umbrella type used to represent an analysis with the following possible subclasses:


If the analysis outputs are going to be based on analysis types this will be a group with many subclasses as the diversity of analysis performed will only increase. No change requested, just FYI comment.

TimothyTickle · 2019-10-02T22:01:30Z

rfcs/text/0000-rfc-application-layer-bundle-types-and-definitions.md

+1. “Alignment-Expression” - representing our current secondary analysis pipelines.
+
+
+1. “Expression Matrix” - a possible subtype representing the expression matrix generation.


This may simply be "matrix" given both count and expression matrices can currently exist. I believe in this case the shapes of the data would be the same but the measurements the matrices hold would be different. If this typing is how you would differentiate matrices that hold counts from matrices that hold expression it is important to note that they are not equivalent. In the case the subclassing is how those matrices are differentiated, I would recommend something like "count matrix" and "expression matrix" but once again these subtypes will balloon in number.

TimothyTickle · 2019-10-02T22:02:28Z

rfcs/text/0000-rfc-application-layer-bundle-types-and-definitions.md

+
+1. “Clustering” - a possible subtype representing cell clustering analyses output for example.
+
+> _Note: The exact names of the subclasses above can likely be improved upon. Please suggest._


Honestly, it may be good to just get the ones that exist standardized and then move forward as we extend the analysis. There are so many one can imagine we may need but we may not get to for a while.

Yes, these were meant to be suggestive. We would only define what we need to get going.

TimothyTickle · 2019-10-02T22:06:35Z

rfcs/text/0000-rfc-application-layer-bundle-types-and-definitions.md

+> _Note: The exact names of the subclasses above can likely be improved upon. Please suggest._
+
+#### Reference Bundle
+Umbrella type representing a reference file used in analysis. Subclasses TBD.


I think we will have a better feel for this as we get more data modalities in the system. For now, this may be as simple as human and mouse. It may be good not to use the common name but use some specific ontology for the species names (I am sure our EBI friends have a favorite ontology to use). @lauraclarke

TimothyTickle

I have left some comments, most of which were cautionary and none of which should block this RFC. I feel the definition of bundles should really serve our technical/implementation team and so believe they should determine the fate of the RFC.

theathorn · 2019-10-14T21:32:02Z

rfcs/text/0000-rfc-domain-oriented-bundle-types-and-definitions.md

+
+1. The data_groups concept from [Processing Datasets that Span Multiple Data Collection Runs RFC](https://github.com/HumanCellAtlas/dcp-community/pull/88) is integrated and a corresponding bundle type is specified.
+
+1. The bundle types schema is maintained new ```bundle_schema``` Github repository and documented on the Data Portal rather than maintained in a DSS registry.


grammar "maintained new"?

theathorn · 2019-10-14T21:32:26Z

rfcs/text/0000-rfc-domain-oriented-bundle-types-and-definitions.md

+
+1. The bundle types schema is maintained new ```bundle_schema``` Github repository and documented on the Data Portal rather than maintained in a DSS registry.
+
+1. The bundle.json created by the HCA CLI download is renamed to manifest.json so that a new bundle.json can be used as a "bundle_descriptor" specifying the type information as a nomral file in the bundle.  


nomral -> normal

NoopDog · 2019-11-05T18:04:53Z

Pausing to align with upcoming reproducibility and data citation requirements.

NoopDog · 2019-11-05T18:05:05Z

Pausing to align with upcoming reproducibility and data citation requirements.

NoopDog · 2019-11-05T18:06:57Z

Pausing to align with upcoming reproducibility and data citation requirements.

NoopDog · 2021-04-05T16:09:45Z

Obsoleted

NoopDog self-assigned this Sep 25, 2019

NoopDog added Architecture rfc-community-review and removed rfc-community-review labels Sep 25, 2019

NoopDog mentioned this pull request Sep 25, 2019

RFC: HCA DCP Bundle Types and Definitions #93

Closed

NoopDog added the rfc-community-review label Sep 25, 2019

briandoconnor approved these changes Sep 25, 2019

View reviewed changes

diekhans requested changes Sep 26, 2019

View reviewed changes

diekhans reviewed Sep 28, 2019

View reviewed changes

theathorn reviewed Sep 30, 2019

View reviewed changes

rfcs/text/0000-rfc-application-layer-bundle-types-and-definitions.md Outdated Show resolved Hide resolved

theathorn reviewed Sep 30, 2019

View reviewed changes

rfcs/text/0000-rfc-application-layer-bundle-types-and-definitions.md Outdated Show resolved Hide resolved

diekhans reviewed Oct 1, 2019

View reviewed changes

TimothyTickle reviewed Oct 2, 2019

View reviewed changes

TimothyTickle approved these changes Oct 2, 2019

View reviewed changes

NoopDog added 3 commits October 3, 2019 10:56

Initial draft

7d862d1

update with feedback

51f5514

feedback incorporated

76b4ee1

NoopDog force-pushed the drogers-rfc-application-layer-bundle-types-and-definitions branch from e0331c1 to 76b4ee1 Compare October 3, 2019 20:00

NoopDog changed the title ~~RFC: HCA DCP Application Layer Bundle Types and Definitions~~ RFC: HCA DCP Domain Oriented Bundle Types and Definitions Oct 3, 2019

theathorn reviewed Oct 14, 2019

View reviewed changes

NoopDog added rfc-paused and removed rfc-community-review labels Nov 5, 2019

NoopDog closed this Apr 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: HCA DCP Domain Oriented Bundle Types and Definitions #119

RFC: HCA DCP Domain Oriented Bundle Types and Definitions #119

NoopDog commented Sep 25, 2019 •

edited

Loading

briandoconnor left a comment

NoopDog commented Sep 25, 2019

diekhans left a comment

diekhans Sep 26, 2019

diekhans Sep 26, 2019

diekhans Sep 28, 2019

diekhans Oct 1, 2019

NoopDog Oct 1, 2019

NoopDog Oct 1, 2019

diekhans Oct 1, 2019

diekhans Oct 1, 2019

diekhans Oct 1, 2019

diekhans Oct 1, 2019

NoopDog Oct 1, 2019

diekhans Oct 1, 2019

diekhans Oct 1, 2019

NoopDog Oct 1, 2019 •

edited

Loading

diekhans Oct 1, 2019

TimothyTickle Oct 2, 2019 •

edited

Loading

TimothyTickle Oct 2, 2019

TimothyTickle Oct 2, 2019

TimothyTickle Oct 2, 2019

NoopDog Oct 3, 2019

TimothyTickle Oct 2, 2019

TimothyTickle left a comment

theathorn Oct 14, 2019

theathorn Oct 14, 2019

NoopDog commented Nov 5, 2019

NoopDog commented Nov 5, 2019

NoopDog commented Nov 5, 2019

NoopDog commented Apr 5, 2021


		### Key Differences from RFC: HCA DCP Bundle Types and Definitions

		This proposal differs from [RFC: HCA DCP Bundle Types and Definitions](https://github.com/HumanCellAtlas/dcp-community/pull/93/files) mainly in that:


		1. Type information is added to a type.json metadata file rather than added to the DSS bundle.json file.

		1. Type information is expressed in JSON rather than of [RFC 7231](https://tools.ietf.org/html/rfc7231#section-3.1.1.1) media type syntax.


		`[dcp-community/rfc#](https://github.com/HumanCellAtlas/dcp-community/pull/<PR#>)`

		# Application Layer Bundle Types and Definitions


		Analysis bundles use an `analysis_process0.json` and `analysis_protocol_0.json`. Additionally because of copy forward, the `process_0.json` and `sequencing_protocol_0.json` also exist in the analysis bundles.

		Having the same information in different places depending on bundle type makes determing the process and protocol that created the bundle more difficlut than it needs to be.

		1. “Alignment-Expression” - representing our current secondary analysis pipelines.


		1. “Expression Matrix” - a possible subtype representing the expression matrix generation.


		1. “Clustering” - a possible subtype representing cell clustering analyses output for example.

		> _Note: The exact names of the subclasses above can likely be improved upon. Please suggest._


		1. The data_groups concept from [Processing Datasets that Span Multiple Data Collection Runs RFC](https://github.com/HumanCellAtlas/dcp-community/pull/88) is integrated and a corresponding bundle type is specified.

		1. The bundle types schema is maintained new ```bundle_schema``` Github repository and documented on the Data Portal rather than maintained in a DSS registry.


		1. The bundle types schema is maintained new ```bundle_schema``` Github repository and documented on the Data Portal rather than maintained in a DSS registry.

		1. The bundle.json created by the HCA CLI download is renamed to manifest.json so that a new bundle.json can be used as a "bundle_descriptor" specifying the type information as a nomral file in the bundle.

RFC: HCA DCP Domain Oriented Bundle Types and Definitions #119

RFC: HCA DCP Domain Oriented Bundle Types and Definitions #119

Conversation

NoopDog commented Sep 25, 2019 • edited Loading

Key Differences from RFC: HCA DCP Bundle Types and Definitions

briandoconnor left a comment

Choose a reason for hiding this comment

NoopDog commented Sep 25, 2019

diekhans left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NoopDog Oct 1, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TimothyTickle Oct 2, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TimothyTickle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NoopDog commented Nov 5, 2019

NoopDog commented Nov 5, 2019

NoopDog commented Nov 5, 2019

NoopDog commented Apr 5, 2021

NoopDog commented Sep 25, 2019 •

edited

Loading

NoopDog Oct 1, 2019 •

edited

Loading

TimothyTickle Oct 2, 2019 •

edited

Loading