Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: HCA DCP Domain Oriented Bundle Types and Definitions #119

Conversation

NoopDog
Copy link

@NoopDog NoopDog commented Sep 25, 2019

October 9: Last call for community review.

Link to document view: here

This RFC is offered as a refinement and generalization of the RFC: HCA DCP Bundle Types and Definitions.

This RFC describes:

  1. An application/metadata level mechanism to define bundle types and a simple bundle type hierarchy.

  2. An application/metadata level method to specify the type of a bundle.

  3. Several bundle types designed to match current usage and near term expected usage.

Key Differences from RFC: HCA DCP Bundle Types and Definitions

This proposal differs from RFC: HCA DCP Bundle Types and Definitions mainly in that:

  1. Type information is added to a type.json metadata file rather than added to the DSS bundle.json file.

  2. Type information is expressed in JSON rather than of RFC 7231 media type syntax.

  3. The schema and bundle types are represented as JSON schema in the metadata schema repo and documented on the Data Portal rather than maintained in a DSS registry.

  4. Additional background and motivation is presented.

Copy link

@briandoconnor briandoconnor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a very good thing to nail down Dave.

Are the provenance links in the JSON DSS UUIDs and versions? Maybe make that more clear?

There should be a plan to document this on a DCP section of the portal. It should be one of our key SOP docs for the project I think.

This needs to be kept up to date over time. If we add new types they need to be documented.

@NoopDog
Copy link
Author

NoopDog commented Sep 25, 2019

Are the provenance links in the JSON DSS UUIDs and versions? Maybe make that more clear?

Yes provenance.document_id is the type.json file DSS UUID. It's not clear to me exactly what the provenance.submission_date means in this context and if this maps reliably to the DSS file version in general.

I have updated the doc to reflect the above.

Copy link
Contributor

@diekhans diekhans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only partway. Generally, I really like the approach, but have outlined some concerns.

Also, I worry this may be trying to do too much in one step and at the same time not enough to have an easier to use system. Large changes are impossible to make without break the DCP. We have to take steps, even if temporary, to get to the right place.


### Key Differences from RFC: HCA DCP Bundle Types and Definitions

This proposal differs from [RFC: HCA DCP Bundle Types and Definitions](https://github.com/HumanCellAtlas/dcp-community/pull/93/files) mainly in that:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this as a disadvantage. The number of JSON objects required to actually understand the bundle is already large. From a client perspective, not including type information in the manifest doesn't seem intuitive. While the DSS strives to not be tied to just one project, this is an implementation detail we should not inflict on users. Even if the generic DSS doesn't have a schema associated with bundle.json, the DCP can add one, which the schema link being a DSS configuration option.


1. Type information is added to a type.json metadata file rather than added to the DSS bundle.json file.

1. Type information is expressed in JSON rather than of [RFC 7231](https://tools.ietf.org/html/rfc7231#section-3.1.1.1) media type syntax.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you mention it, this makes a lot of sense


`[dcp-community/rfc#](https://github.com/HumanCellAtlas/dcp-community/pull/<PR#>)`

# Application Layer Bundle Types and Definitions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the title of this RFC confusing, as I don't know what the application layer and have not seen it as part of any other DCP documentation. Is this the data browser? the API?

I think it would be better to not bring in terminology that isn't currently part of the mental DCP lexicon.

Although, we do need a written DCP lexicon!!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading wikipedia: Application layer, the title makes even less sense.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean "Application Layer" in the multi tier architecture sense https://en.wikipedia.org/wiki/Multitier_architecture

Not the OSI Layer 7 networking protocol layer sense.

Screen Shot 2019-10-01 at 10 44 28 AM

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a note to the top of the doc that should clear the issue about the "Application Layer" term.

Note: I have used the term "Application Layer" extensively in this document and the term seems to be causing some confusion. Let me clarify that I mean "Application Layer in the Multi Tier Architecture sense as opposed to the "Data Layer". Here the "Appplications" are the boxes or clients of DSS. Whoever writes into the database (DSS) needs to validate the records (bundles) they create are appropriately shaped as the database (DSS) does not to this.

I will resolve all comments about the "Application Layer" term. Please reopen if you feel this is not clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still very confusing as the terminology is not used in the DCP and I still don't know what software comprises the application layer. Azul?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still feel that "application layer" is just introducing confusion from another programming model that is not used in the DCP. "DSS client" or "DSS application" is clearer in reference to the DSS. In terms of the overall DCP,

Maybe this is something to bring up in tech arch???

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.s. I do see the distinction trying to be made here and I think it is important, it is only the term that seems problematic.


### Example JSON Files

#### Assay Bundle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging links.json, which is part of the experimental metadata with bundle types defeats the goal of the metadata group at separating the metadata layer from the bundle layer.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal here is to tag the bundles with the process and protocol that created them in a uniform manner without disrupting the current system. Links.json use different file names for analysis vs assay bundles for the process and protocol links. Here the process and protocol are normalized across bundle types.

Also by using the protocol and process id this gives the ultimate in type refinement on what is in this bundle.

I do not see this as breaking a separation of concerns. The ids of the process and protocol used to create the bundle should not be uniquely visible to the "experimental metadata". Our "processing model" as well should be able to benefit from the terms created in the experimental model to the extent that it is useful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It the processes and protocols being referenced are metadata instances, this is very confusing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes provenance.document_id is the type.json file DSS UUID. It's not clear to me exactly what the provenance.submission_date means in this context and if this maps reliably to the DSS file version in general.

I have updated the doc to reflect the above.

Provenance section is added by ingest. The submission_date may correspond to the ingest, however, is in a slightly different format so it should not be counted on. This is problematic and there is a ticket about this weirdness.

Copy link
Author

@NoopDog NoopDog Oct 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this does not couple the experimental metadata with the bundle metadata, it couples the bundle metadata with a budding processing model in order to define the bundle type in terms of the experimental metadata. This is a use, not a pollution of the experimental metadata.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can easily add new data to links.json too to address these issues.


Analysis bundles use an `analysis_process0.json` and `analysis_protocol_0.json`. Additionally because of copy forward, the `process_0.json` and `sequencing_protocol_0.json` also exist in the analysis bundles.

Having the same information in different places depending on bundle type makes determing the process and protocol that created the bundle more difficlut than it needs to be.
Copy link
Contributor

@TimothyTickle TimothyTickle Oct 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Having the same information in different places depending on bundle type makes determing the process and protocol that created the bundle more difficlut than it needs to be.
Having the same information in different places depending on bundle type makes determining the process and protocol that created the bundle more difficult than it needs to be.

> _Note: It may take some discussion to decide on an effective set of subclasses. Sequencing assay may need to change to “2nd Gen Sequencing Assay” or “NGS Sequencing Assay” for example._

#### Analysis Bundle
Umbrella type used to represent an analysis with the following possible subclasses:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the analysis outputs are going to be based on analysis types this will be a group with many subclasses as the diversity of analysis performed will only increase. No change requested, just FYI comment.

1. “Alignment-Expression” - representing our current secondary analysis pipelines.


1. “Expression Matrix” - a possible subtype representing the expression matrix generation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may simply be "matrix" given both count and expression matrices can currently exist. I believe in this case the shapes of the data would be the same but the measurements the matrices hold would be different. If this typing is how you would differentiate matrices that hold counts from matrices that hold expression it is important to note that they are not equivalent. In the case the subclassing is how those matrices are differentiated, I would recommend something like "count matrix" and "expression matrix" but once again these subtypes will balloon in number.


1. “Clustering” - a possible subtype representing cell clustering analyses output for example.

> _Note: The exact names of the subclasses above can likely be improved upon. Please suggest._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, it may be good to just get the ones that exist standardized and then move forward as we extend the analysis. There are so many one can imagine we may need but we may not get to for a while.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these were meant to be suggestive. We would only define what we need to get going.

> _Note: The exact names of the subclasses above can likely be improved upon. Please suggest._

#### Reference Bundle
Umbrella type representing a reference file used in analysis. Subclasses TBD.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will have a better feel for this as we get more data modalities in the system. For now, this may be as simple as human and mouse. It may be good not to use the common name but use some specific ontology for the species names (I am sure our EBI friends have a favorite ontology to use). @lauraclarke

Copy link
Contributor

@TimothyTickle TimothyTickle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left some comments, most of which were cautionary and none of which should block this RFC. I feel the definition of bundles should really serve our technical/implementation team and so believe they should determine the fate of the RFC.

@NoopDog NoopDog force-pushed the drogers-rfc-application-layer-bundle-types-and-definitions branch from e0331c1 to 76b4ee1 Compare October 3, 2019 20:00
@NoopDog NoopDog changed the title RFC: HCA DCP Application Layer Bundle Types and Definitions RFC: HCA DCP Domain Oriented Bundle Types and Definitions Oct 3, 2019

1. The data_groups concept from [Processing Datasets that Span Multiple Data Collection Runs RFC](https://github.com/HumanCellAtlas/dcp-community/pull/88) is integrated and a corresponding bundle type is specified.

1. The bundle types schema is maintained new ```bundle_schema``` Github repository and documented on the Data Portal rather than maintained in a DSS registry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grammar "maintained new"?


1. The bundle types schema is maintained new ```bundle_schema``` Github repository and documented on the Data Portal rather than maintained in a DSS registry.

1. The bundle.json created by the HCA CLI download is renamed to manifest.json so that a new bundle.json can be used as a "bundle_descriptor" specifying the type information as a nomral file in the bundle.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nomral -> normal

@NoopDog
Copy link
Author

NoopDog commented Nov 5, 2019

Pausing to align with upcoming reproducibility and data citation requirements.

1 similar comment
@NoopDog
Copy link
Author

NoopDog commented Nov 5, 2019

Pausing to align with upcoming reproducibility and data citation requirements.

@NoopDog
Copy link
Author

NoopDog commented Nov 5, 2019

Pausing to align with upcoming reproducibility and data citation requirements.

@NoopDog
Copy link
Author

NoopDog commented Apr 5, 2021

Obsoleted

@NoopDog NoopDog closed this Apr 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants