-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: HCA DCP Domain Oriented Bundle Types and Definitions #119
RFC: HCA DCP Domain Oriented Bundle Types and Definitions #119
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a very good thing to nail down Dave.
Are the provenance links in the JSON DSS UUIDs and versions? Maybe make that more clear?
There should be a plan to document this on a DCP section of the portal. It should be one of our key SOP docs for the project I think.
This needs to be kept up to date over time. If we add new types they need to be documented.
Yes I have updated the doc to reflect the above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only partway. Generally, I really like the approach, but have outlined some concerns.
Also, I worry this may be trying to do too much in one step and at the same time not enough to have an easier to use system. Large changes are impossible to make without break the DCP. We have to take steps, even if temporary, to get to the right place.
rfcs/text/0000-rfc-application-layer-bundle-types-and-definitions.md
Outdated
Show resolved
Hide resolved
rfcs/text/0000-rfc-application-layer-bundle-types-and-definitions.md
Outdated
Show resolved
Hide resolved
rfcs/text/0000-rfc-application-layer-bundle-types-and-definitions.md
Outdated
Show resolved
Hide resolved
rfcs/text/0000-rfc-application-layer-bundle-types-and-definitions.md
Outdated
Show resolved
Hide resolved
rfcs/text/0000-rfc-application-layer-bundle-types-and-definitions.md
Outdated
Show resolved
Hide resolved
rfcs/text/0000-rfc-application-layer-bundle-types-and-definitions.md
Outdated
Show resolved
Hide resolved
|
||
### Key Differences from RFC: HCA DCP Bundle Types and Definitions | ||
|
||
This proposal differs from [RFC: HCA DCP Bundle Types and Definitions](https://github.com/HumanCellAtlas/dcp-community/pull/93/files) mainly in that: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this as a disadvantage. The number of JSON objects required to actually understand the bundle is already large. From a client perspective, not including type information in the manifest doesn't seem intuitive. While the DSS strives to not be tied to just one project, this is an implementation detail we should not inflict on users. Even if the generic DSS doesn't have a schema associated with bundle.json, the DCP can add one, which the schema link being a DSS configuration option.
|
||
1. Type information is added to a type.json metadata file rather than added to the DSS bundle.json file. | ||
|
||
1. Type information is expressed in JSON rather than of [RFC 7231](https://tools.ietf.org/html/rfc7231#section-3.1.1.1) media type syntax. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that you mention it, this makes a lot of sense
rfcs/text/0000-rfc-application-layer-bundle-types-and-definitions.md
Outdated
Show resolved
Hide resolved
rfcs/text/0000-rfc-application-layer-bundle-types-and-definitions.md
Outdated
Show resolved
Hide resolved
|
||
`[dcp-community/rfc#](https://github.com/HumanCellAtlas/dcp-community/pull/<PR#>)` | ||
|
||
# Application Layer Bundle Types and Definitions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the title of this RFC confusing, as I don't know what the application layer and have not seen it as part of any other DCP documentation. Is this the data browser? the API?
I think it would be better to not bring in terminology that isn't currently part of the mental DCP lexicon.
Although, we do need a written DCP lexicon!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading wikipedia: Application layer, the title makes even less sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean "Application Layer" in the multi tier architecture sense https://en.wikipedia.org/wiki/Multitier_architecture
Not the OSI Layer 7 networking protocol layer sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a note to the top of the doc that should clear the issue about the "Application Layer" term.
Note: I have used the term "Application Layer" extensively in this document and the term seems to be causing some confusion. Let me clarify that I mean "Application Layer in the Multi Tier Architecture sense as opposed to the "Data Layer". Here the "Appplications" are the boxes or clients of DSS. Whoever writes into the database (DSS) needs to validate the records (bundles) they create are appropriately shaped as the database (DSS) does not to this.
I will resolve all comments about the "Application Layer" term. Please reopen if you feel this is not clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is still very confusing as the terminology is not used in the DCP and I still don't know what software comprises the application layer. Azul?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still feel that "application layer" is just introducing confusion from another programming model that is not used in the DCP. "DSS client" or "DSS application" is clearer in reference to the DSS. In terms of the overall DCP,
Maybe this is something to bring up in tech arch???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
p.s. I do see the distinction trying to be made here and I think it is important, it is only the term that seems problematic.
rfcs/text/0000-rfc-application-layer-bundle-types-and-definitions.md
Outdated
Show resolved
Hide resolved
rfcs/text/0000-rfc-application-layer-bundle-types-and-definitions.md
Outdated
Show resolved
Hide resolved
|
||
### Example JSON Files | ||
|
||
#### Assay Bundle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merging links.json, which is part of the experimental metadata with bundle types defeats the goal of the metadata group at separating the metadata layer from the bundle layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal here is to tag the bundles with the process and protocol that created them in a uniform manner without disrupting the current system. Links.json use different file names for analysis vs assay bundles for the process and protocol links. Here the process and protocol are normalized across bundle types.
Also by using the protocol and process id this gives the ultimate in type refinement on what is in this bundle.
I do not see this as breaking a separation of concerns. The ids of the process and protocol used to create the bundle should not be uniquely visible to the "experimental metadata". Our "processing model" as well should be able to benefit from the terms created in the experimental model to the extent that it is useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It the processes and protocols being referenced are metadata instances, this is very confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
provenance.document_id
is the type.json file DSS UUID. It's not clear to me exactly what theprovenance.submission_date
means in this context and if this maps reliably to the DSS file version in general.I have updated the doc to reflect the above.
Provenance section is added by ingest. The submission_date may correspond to the ingest, however, is in a slightly different format so it should not be counted on. This is problematic and there is a ticket about this weirdness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this does not couple the experimental metadata with the bundle metadata, it couples the bundle metadata with a budding processing model in order to define the bundle type in terms of the experimental metadata. This is a use, not a pollution of the experimental metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can easily add new data to links.json too to address these issues.
|
||
Analysis bundles use an `analysis_process0.json` and `analysis_protocol_0.json`. Additionally because of copy forward, the `process_0.json` and `sequencing_protocol_0.json` also exist in the analysis bundles. | ||
|
||
Having the same information in different places depending on bundle type makes determing the process and protocol that created the bundle more difficlut than it needs to be. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the same information in different places depending on bundle type makes determing the process and protocol that created the bundle more difficlut than it needs to be. | |
Having the same information in different places depending on bundle type makes determining the process and protocol that created the bundle more difficult than it needs to be. |
> _Note: It may take some discussion to decide on an effective set of subclasses. Sequencing assay may need to change to “2nd Gen Sequencing Assay” or “NGS Sequencing Assay” for example._ | ||
|
||
#### Analysis Bundle | ||
Umbrella type used to represent an analysis with the following possible subclasses: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the analysis outputs are going to be based on analysis types this will be a group with many subclasses as the diversity of analysis performed will only increase. No change requested, just FYI comment.
1. “Alignment-Expression” - representing our current secondary analysis pipelines. | ||
|
||
|
||
1. “Expression Matrix” - a possible subtype representing the expression matrix generation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may simply be "matrix" given both count and expression matrices can currently exist. I believe in this case the shapes of the data would be the same but the measurements the matrices hold would be different. If this typing is how you would differentiate matrices that hold counts from matrices that hold expression it is important to note that they are not equivalent. In the case the subclassing is how those matrices are differentiated, I would recommend something like "count matrix" and "expression matrix" but once again these subtypes will balloon in number.
|
||
1. “Clustering” - a possible subtype representing cell clustering analyses output for example. | ||
|
||
> _Note: The exact names of the subclasses above can likely be improved upon. Please suggest._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, it may be good to just get the ones that exist standardized and then move forward as we extend the analysis. There are so many one can imagine we may need but we may not get to for a while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, these were meant to be suggestive. We would only define what we need to get going.
> _Note: The exact names of the subclasses above can likely be improved upon. Please suggest._ | ||
|
||
#### Reference Bundle | ||
Umbrella type representing a reference file used in analysis. Subclasses TBD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we will have a better feel for this as we get more data modalities in the system. For now, this may be as simple as human and mouse. It may be good not to use the common name but use some specific ontology for the species names (I am sure our EBI friends have a favorite ontology to use). @lauraclarke
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have left some comments, most of which were cautionary and none of which should block this RFC. I feel the definition of bundles should really serve our technical/implementation team and so believe they should determine the fate of the RFC.
e0331c1
to
76b4ee1
Compare
|
||
1. The data_groups concept from [Processing Datasets that Span Multiple Data Collection Runs RFC](https://github.com/HumanCellAtlas/dcp-community/pull/88) is integrated and a corresponding bundle type is specified. | ||
|
||
1. The bundle types schema is maintained new ```bundle_schema``` Github repository and documented on the Data Portal rather than maintained in a DSS registry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grammar "maintained new"?
|
||
1. The bundle types schema is maintained new ```bundle_schema``` Github repository and documented on the Data Portal rather than maintained in a DSS registry. | ||
|
||
1. The bundle.json created by the HCA CLI download is renamed to manifest.json so that a new bundle.json can be used as a "bundle_descriptor" specifying the type information as a nomral file in the bundle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nomral -> normal
Pausing to align with upcoming reproducibility and data citation requirements. |
1 similar comment
Pausing to align with upcoming reproducibility and data citation requirements. |
Pausing to align with upcoming reproducibility and data citation requirements. |
Obsoleted |
October 9: Last call for community review.
Link to document view: here
This RFC is offered as a refinement and generalization of the RFC: HCA DCP Bundle Types and Definitions.
This RFC describes:
An application/metadata level mechanism to define bundle types and a simple bundle type hierarchy.
An application/metadata level method to specify the type of a bundle.
Several bundle types designed to match current usage and near term expected usage.
Key Differences from RFC: HCA DCP Bundle Types and Definitions
This proposal differs from RFC: HCA DCP Bundle Types and Definitions mainly in that:
Type information is added to a type.json metadata file rather than added to the DSS bundle.json file.
Type information is expressed in JSON rather than of RFC 7231 media type syntax.
The schema and bundle types are represented as JSON schema in the metadata schema repo and documented on the Data Portal rather than maintained in a DSS registry.
Additional background and motivation is presented.