From c502787de3e4371e632cf26e6c9a348c92d68b04 Mon Sep 17 00:00:00 2001 From: maniarathi Date: Tue, 5 Nov 2019 09:19:04 -0800 Subject: [PATCH 1/7] Adding RFC that contains pointers to old RFCs that were created pre-RFC era. --- ...0000-pre-rfc-dcp-architecture-decisions.md | 52 +++++++++++++++++++ 1 file changed, 52 insertions(+) create mode 100644 rfcs/imported/0000-pre-rfc-dcp-architecture-decisions.md diff --git a/rfcs/imported/0000-pre-rfc-dcp-architecture-decisions.md b/rfcs/imported/0000-pre-rfc-dcp-architecture-decisions.md new file mode 100644 index 00000000..531ee0e3 --- /dev/null +++ b/rfcs/imported/0000-pre-rfc-dcp-architecture-decisions.md @@ -0,0 +1,52 @@ +### DCP PR: + +***Leave this blank until the RFC is approved** then the **Author(s)** must create a link between the assigned RFC number and this pull request in the format:* +`[dcp-community/rfc#](https://github.com/HumanCellAtlas/dcp-community/pull/)` + +# Table of Contents Grandfathering Former Design Decisions and Documents + +## Summary + +This RFC details a table of contents containing pointers to previous, relevant design decisions and documents that were made and created prior to the formalization of the current [RFC process](https://github.com/HumanCellAtlas/dcp-community/blob/master/rfcs/text/0001-rfc-process.md). + +## Author(s) + +[Arathi Mani](mailto:arathi.mani@chanzuckerberg.com) + +## Shepherd + +[Arathi Mani](mailto:arathi.mani@chanzuckerberg.com) + +## Motivation + +Prior to debut of the [RFC process](https://github.com/HumanCellAtlas/dcp-community/blob/master/rfcs/text/0001-rfc-process.md) a number of design decisions were made for the DCP Architecture that are still valuable to reference. Oftentime these documents are hard to find in the depths of the DCP Google Drive. This Informational RFC pulls important design documents that are still relevant as of the creation of this RFC and references them in a table of contents format here. + +### User Stories + +* As a DCP developer, I would like to remember and recall certain design decisions that impact the current DCP architecture and be able to find the designs with ease. + +## Table of Contents + +### Metadata Schema Decoupling + +**Last edited: June 18th, 2018** + +Summary: This document details sites of tight coupling between DCP code and the metadata schema and provides ideas of decoupling the two. + +[Link to document](https://docs.google.com/document/d/1ZPx0Q7flHpab-BBPiYGQsVT84epXd7ISZA7TTQPdVwE/edit) + +### DCP Media Types + +**Last edited: October 20th, 2017** + +Summary: DCP processing involves the transfer and storage of a number of different types of file. To indicate the type of each file to the DCP, we will use Internet media types during data movement and file storage. + +[Link to document](https://docs.google.com/document/d/1TqihrgXjct9aDmTJO52_gE2WlpFysB1OkG9C8exmWTw/edit#) + +### DCP DNS Design + +**Last edited: September 19th, 2018** + +Summary: Design of the humancellatlas.org domain. + +[Link to document](https://docs.google.com/document/d/1IcLzWvBzpPDnjPfUqReU-HpOMRP8TmSpxhINvMKDJVg/edit#heading=h.ez2eqrkr6i2p) \ No newline at end of file From 99fc51fd4b44913398472f41c3ae4c96a67a5c2c Mon Sep 17 00:00:00 2001 From: maniarathi Date: Tue, 5 Nov 2019 09:21:10 -0800 Subject: [PATCH 2/7] Adding Andrey as an author. --- rfcs/imported/0000-pre-rfc-dcp-architecture-decisions.md | 1 + 1 file changed, 1 insertion(+) diff --git a/rfcs/imported/0000-pre-rfc-dcp-architecture-decisions.md b/rfcs/imported/0000-pre-rfc-dcp-architecture-decisions.md index 531ee0e3..2c7bbd12 100644 --- a/rfcs/imported/0000-pre-rfc-dcp-architecture-decisions.md +++ b/rfcs/imported/0000-pre-rfc-dcp-architecture-decisions.md @@ -12,6 +12,7 @@ This RFC details a table of contents containing pointers to previous, relevant d ## Author(s) [Arathi Mani](mailto:arathi.mani@chanzuckerberg.com) +[Andrey Kislyuk](mailto:akislyuk@chanzuckerberg.com) ## Shepherd From 7d2c0f8955d153d9bd71b3bb74b56de75116f39b Mon Sep 17 00:00:00 2001 From: maniarathi Date: Wed, 6 Nov 2019 13:43:17 -0800 Subject: [PATCH 3/7] Bringing back specific media types RFC. --- ...0000-pre-rfc-dcp-architecture-decisions.md | 132 +++++++++++++++--- 1 file changed, 114 insertions(+), 18 deletions(-) diff --git a/rfcs/imported/0000-pre-rfc-dcp-architecture-decisions.md b/rfcs/imported/0000-pre-rfc-dcp-architecture-decisions.md index 2c7bbd12..ff792941 100644 --- a/rfcs/imported/0000-pre-rfc-dcp-architecture-decisions.md +++ b/rfcs/imported/0000-pre-rfc-dcp-architecture-decisions.md @@ -3,15 +3,24 @@ ***Leave this blank until the RFC is approved** then the **Author(s)** must create a link between the assigned RFC number and this pull request in the format:* `[dcp-community/rfc#](https://github.com/HumanCellAtlas/dcp-community/pull/)` -# Table of Contents Grandfathering Former Design Decisions and Documents +# DCP Media Types + +Please note that this RFC has been imported its original Google Doc and went through a design review process prior to the RFC Process implementation. The original document is [here](https://docs.google.com/document/d/1TqihrgXjct9aDmTJO52_gE2WlpFysB1OkG9C8exmWTw/edit#heading=h.87ix45a71erf). Please be aware that the contents below may potentially be out-of-date as the last-modified date is October 20th, 2017. ## Summary -This RFC details a table of contents containing pointers to previous, relevant design decisions and documents that were made and created prior to the formalization of the current [RFC process](https://github.com/HumanCellAtlas/dcp-community/blob/master/rfcs/text/0001-rfc-process.md). +DCP processing involves the transfer and storage of a number of different types of file. To indicate the type of each file to the DCP, we will use Internet media types [RFC 2046](https://tools.ietf.org/html/rfc2046) during data movement and file storage. ## Author(s) +Original RFC Author: + +[Sam Pierson](mailto:spierson@chanzuckerberg.com) + +Transcription Authors: + [Arathi Mani](mailto:arathi.mani@chanzuckerberg.com) + [Andrey Kislyuk](mailto:akislyuk@chanzuckerberg.com) ## Shepherd @@ -20,34 +29,121 @@ This RFC details a table of contents containing pointers to previous, relevant d ## Motivation -Prior to debut of the [RFC process](https://github.com/HumanCellAtlas/dcp-community/blob/master/rfcs/text/0001-rfc-process.md) a number of design decisions were made for the DCP Architecture that are still valuable to reference. Oftentime these documents are hard to find in the depths of the DCP Google Drive. This Informational RFC pulls important design documents that are still relevant as of the creation of this RFC and references them in a table of contents format here. +The DCP should not invent new media types; we should append `dcp-type=metadata` or `dcp-type=data` to our Content-Types using the media type "parameter" syntax, e.g. `Content-Type: application/json; dcp-type=”metadata/sample”`. Instead the DCP should strive to use "in-band" media type communication mechanisms where possible. + +## Detailed Design + +Some DCP subsystems need to know the type of files. For example differentiating between data files and metadata files is important for the Ingestion and Secondary Analysis services. + +There are several different strategies we could use to identify DCP file types: + +1) Store a DCP media type in headers/metadata separate from the normal Content-Type +2) Define new media types / subtypes + - and register them with IANA + - and use them without registering them +3) Use the personal ("_per._") media type subtype tree. +4) Use the vendor ("_vnd._") media type subtype tree. +5) Use the unregistered ("_x._") media type subtype tree. +6) Use media type parameters. + +There are different arguments against each of options 1 through 5, and one overriding argument that they all share, which is: + +**There really isn’t such a thing as a DCP specific type of file. In reality we are using standard types of files while are interpreted in DCP specific ways. E.g. a `.fastq.gz` data file isn’t a DCP-specific format, it is a gzipped text file.** + +This leads us to option 6, using Parameters. Parameters are additional information about a file type. They are not highly controlled and there may be an arbitrary number of them. + +### Proposal + +We add a media type parameter "`dcp-type`" to the end of the media type. This parameter will indicate the type of the file from the DCP’s perspective. When used with quotes, the parameter can contain slashes ("/") so we can subtype, just like regular media types). + +#### Example Media Types + + application/json; dcp-type=metadata + application/json; dcp-type="metadata/sample" + application/json; dcp-type="metadata/assay" + + application/octet-stream; dcp-type=data + application/gzip; dcp-type=data + +#### dcp-type Parameter Values + +To communicate possible values for the dcp-type parameter to parties who will use it, let us enumerate all the valid values in this table: + +| **dcp-type=** | **Description** | +| ------------- | --------------- | +| data | A data file | +| "metadata/assay" | A JSON object describing an assay | +| "metadata/sample" | A JSON object describing a sample| +| "metadata/protocol" | A JSON object describing a protocol | +| "metadata/project" | A JSON object describing a project | +| "metadata/analysis" | A JSON object describing an analysis | + +#### Compressed Data + +The use of media type suffix `+zip` is acceptable if the data is zipped, e.g. `application/octet-stream+zip`. However note that there is no suffix for GZip, and therefore `application/gzip` must be used. HTTP also provides for a `Content-Encoding` header that may indicate that data is compressed, but that is inappropriate in this case as files will remain compressed after transit and Content-Encoding is only used for data in flight and is not supported by storage services such as S3. + +### Communicating Media Type to, and between DCP Components + +Currently, files enter the DCP in 2 places: + +1) Metadata files are deposited by Ingest using the [Upload Service API](https://upload.staging.data.humancellatlas.org/). +2) Data files are uploaded by submitters using the Upload Service commands in the DCP CLI. + +Metadata files may be deposited by Ingest using the Upload Service API `PUT /area//`. This an HTTP interface and therefore supports the Content-Type header. The Content-Type of the files transmitted is stored with the file in the Upload Area using standard file metadata. It is the responsibility of the file uploader to provide a Content-Type with a `dcp-type=` parameter. + +Data files are uploaded to the Upload Service using the DCP CLI or (in future when available) the Upload Service Python library, e.g. `hca upload file `. These utilities will attempt to automatically determine (“sniff”) the media type of the file using code built on top of the _libmagic_ library. The parameter `dcp-type=data` will be appended to the media type. The media type may be overridden using command line options or API arguments (TODO). + +The DCP Upload Service is implemented using S3, which supports Content-Type metadata for files stored there. The media types provided by the above routes are copied to the S3 Content-Type metadata field. + +#### Ingestion Service + +Ingest does not currently directly access data files in the Upload Service. It uses the Upload Service API `GET /area/` to obtain a listing of the files present in the Upload Area. This listing contains a `content_type` entry for each file. + +#### Data Store Service (DSS) + +Files are stored in the DSS by copying them from an Upload Service Upload Area. During this copy process the DSS has access to the metadata, and hence Content-Type, of the file in the Upload Area. + +**Should we, in the future, support storage of files from a location that does not support Content-Type metadata, it will be the responsibility of the file provider to tag the file with a `dcp-content-type` tag containing the media type of the file.** + +It is the responsibility of the DSS to store the media type of each of the files it contains and return in when requested. This may be implemented using standard service metadata, if such metadata supports a Content-Type concept, or “out of band” e.g. using tags, if necessary. + +## Prior Art + +### Primer on Media Types + +As specified in [RFC 7231](https://tools.ietf.org/html/rfc7231#section-3.1.1.1) the structure of a media type used in HTTP `Content-Type` or `Accept` headers is: + + media-type = type "/" subtype *( OWS ";" OWS parameter ) + + type = token + subtype = token -### User Stories -* As a DCP developer, I would like to remember and recall certain design decisions that impact the current DCP architecture and be able to find the designs with ease. + OWS = *( SP / HTAB ) + ; optional whitespace -## Table of Contents + The type/subtype MAY be followed by parameters in the form of name=value pairs. -### Metadata Schema Decoupling + parameter = token "=" ( token / quoted-string ) -**Last edited: June 18th, 2018** +Additionally: -Summary: This document details sites of tight coupling between DCP code and the metadata schema and provides ideas of decoupling the two. +- [RFC 6838 section 3](https://tools.ietf.org/html/rfc6838#section-3) defines subtype prefixes known as "trees", such as "`per.`" (personal/vanity), "`vnd.`" (vendor) and "`x.`" (unregistered). -[Link to document](https://docs.google.com/document/d/1ZPx0Q7flHpab-BBPiYGQsVT84epXd7ISZA7TTQPdVwE/edit) +- [RFC 6839](https://tools.ietf.org/html/rfc6839) and others specify subtype suffixes such as "`+json`" and "`+zip`", which indicate the base format of the type. IANA keeps the [Structured Syntax Suffix Registry](https://www.iana.org/assignments/media-type-structured-suffix/media-type-structured-suffix.xhtml). -### DCP Media Types +- [RFC 2045](https://tools.ietf.org/html/rfc2045) points out that comments "(comment)" are allowed in RFC822 structured headers (they use Content-Type as their example), however this doesn’t appear to have carried over to HTTP. -**Last edited: October 20th, 2017** +[Wikipedia](https://en.wikipedia.org/wiki/Media_type) sums up the syntax more succinctly: -Summary: DCP processing involves the transfer and storage of a number of different types of file. To indicate the type of each file to the DCP, we will use Internet media types during data movement and file storage. + top-level-type-name / [ tree. ] subtype-name [ +suffix ] [ ; parameters ] -[Link to document](https://docs.google.com/document/d/1TqihrgXjct9aDmTJO52_gE2WlpFysB1OkG9C8exmWTw/edit#) +Top level types, trees, subtypes and suffixes should all be registered with IANA. -### DCP DNS Design +Top level types are: application, _audio_, _example_, _font_, _image_, _message_, _model_, _multipart_, _text_, _video_. -**Last edited: September 19th, 2018** +Trees are "" (the standard tree), per. (personal), vnd. (vendor) and x. (unregistered). -Summary: Design of the humancellatlas.org domain. +Suffixes are _+xml_, _+json_, _+json-seq_, _+ber_, _+der_, _+fastinfoset_, _+wbxml_ and _+zip_. -[Link to document](https://docs.google.com/document/d/1IcLzWvBzpPDnjPfUqReU-HpOMRP8TmSpxhINvMKDJVg/edit#heading=h.ez2eqrkr6i2p) \ No newline at end of file +Parameters are optional and, with a few exceptions, are not controlled. From ed6812f0466a81a62338056d09199d393c168aef Mon Sep 17 00:00:00 2001 From: maniarathi Date: Wed, 6 Nov 2019 13:43:52 -0800 Subject: [PATCH 4/7] Renaming file. --- ...-rfc-dcp-architecture-decisions.md => 0000-dcp-media-types.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename rfcs/imported/{0000-pre-rfc-dcp-architecture-decisions.md => 0000-dcp-media-types.md} (100%) diff --git a/rfcs/imported/0000-pre-rfc-dcp-architecture-decisions.md b/rfcs/imported/0000-dcp-media-types.md similarity index 100% rename from rfcs/imported/0000-pre-rfc-dcp-architecture-decisions.md rename to rfcs/imported/0000-dcp-media-types.md From 4448b598d33d974b7677c5447aaeadf431709ee9 Mon Sep 17 00:00:00 2001 From: maniarathi Date: Wed, 6 Nov 2019 13:52:41 -0800 Subject: [PATCH 5/7] Editorial changes. --- rfcs/imported/0000-dcp-media-types.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfcs/imported/0000-dcp-media-types.md b/rfcs/imported/0000-dcp-media-types.md index ff792941..27e1f3a4 100644 --- a/rfcs/imported/0000-dcp-media-types.md +++ b/rfcs/imported/0000-dcp-media-types.md @@ -5,7 +5,7 @@ # DCP Media Types -Please note that this RFC has been imported its original Google Doc and went through a design review process prior to the RFC Process implementation. The original document is [here](https://docs.google.com/document/d/1TqihrgXjct9aDmTJO52_gE2WlpFysB1OkG9C8exmWTw/edit#heading=h.87ix45a71erf). Please be aware that the contents below may potentially be out-of-date as the last-modified date is October 20th, 2017. +DISCLAIMER: Please note that this RFC has been imported its original Google Doc and went through a design review process prior to the implementation of the current [RFC Process](https://github.com/HumanCellAtlas/dcp-community/blob/master/rfcs/text/0001-rfc-process.md). The original document is [here](https://docs.google.com/document/d/1TqihrgXjct9aDmTJO52_gE2WlpFysB1OkG9C8exmWTw/edit#heading=h.87ix45a71erf). Please be aware that the contents below may potentially be out-of-date as the last-modified date of the original Google Document is October 20th, 2017. ## Summary From 1794edd5a88be6d25a332e4d7885151b7c586ec1 Mon Sep 17 00:00:00 2001 From: maniarathi Date: Thu, 7 Nov 2019 16:29:30 -0800 Subject: [PATCH 6/7] Addressed Sam's and MarkD's comments. --- rfcs/imported/0000-dcp-media-types.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/rfcs/imported/0000-dcp-media-types.md b/rfcs/imported/0000-dcp-media-types.md index 27e1f3a4..7f6fe4d4 100644 --- a/rfcs/imported/0000-dcp-media-types.md +++ b/rfcs/imported/0000-dcp-media-types.md @@ -5,7 +5,7 @@ # DCP Media Types -DISCLAIMER: Please note that this RFC has been imported its original Google Doc and went through a design review process prior to the implementation of the current [RFC Process](https://github.com/HumanCellAtlas/dcp-community/blob/master/rfcs/text/0001-rfc-process.md). The original document is [here](https://docs.google.com/document/d/1TqihrgXjct9aDmTJO52_gE2WlpFysB1OkG9C8exmWTw/edit#heading=h.87ix45a71erf). Please be aware that the contents below may potentially be out-of-date as the last-modified date of the original Google Document is October 20th, 2017. +DISCLAIMER: Please note that this RFC has been imported from its original Google Doc and was approved by a design review process prior to the implementation of the current [RFC Process](https://github.com/HumanCellAtlas/dcp-community/blob/master/rfcs/text/0001-rfc-process.md). The original document is [here](https://docs.google.com/document/d/1TqihrgXjct9aDmTJO52_gE2WlpFysB1OkG9C8exmWTw/edit#heading=h.87ix45a71erf). ## Summary @@ -29,7 +29,7 @@ Transcription Authors: ## Motivation -The DCP should not invent new media types; we should append `dcp-type=metadata` or `dcp-type=data` to our Content-Types using the media type "parameter" syntax, e.g. `Content-Type: application/json; dcp-type=”metadata/sample”`. Instead the DCP should strive to use "in-band" media type communication mechanisms where possible. +The DCP should not invent new media types. Instead the DCP should strive to use "in-band" media type communication mechanisms where possible. We can do this by appending `dcp-type=metadata` or `dcp-type=data` to our Content-Types using the media type "parameter" syntax, e.g. `Content-Type: application/json; dcp-type=”metadata/sample”`. ## Detailed Design From 48a82f4e67f037b2f77308cd004f23f5df1a3ec5 Mon Sep 17 00:00:00 2001 From: Mark Diekhans Date: Tue, 12 Nov 2019 18:15:32 -0500 Subject: [PATCH 7/7] imported DCP DNS design RFC from old process --- rfcs/text/0000-dcp-dns-design.md | 85 ++++++++++++++++++++++++++++++++ 1 file changed, 85 insertions(+) create mode 100644 rfcs/text/0000-dcp-dns-design.md diff --git a/rfcs/text/0000-dcp-dns-design.md b/rfcs/text/0000-dcp-dns-design.md new file mode 100644 index 00000000..10981a70 --- /dev/null +++ b/rfcs/text/0000-dcp-dns-design.md @@ -0,0 +1,85 @@ +### DCP PR: + +***Leave this blank until the RFC is approved** then the **Author(s)** must create a link between the assigned RFC number and this pull request in the format:* + +`[dcp-community/rfc#](https://github.com/HumanCellAtlas/dcp-community/pull/)` + +# DCP DNS Design + +Note: this RFC was approve under the pre-November, 2018 RFC process. + +## Summary + +This RFC defines the DNS structure of the HCA DCP. + +## Author(s) + +[Sam Pierson](mailto:spierson@chanzuckerberg.com) +[Tony Burdett](mailto:tburdett@ebi.ac.uk) +dvaughan + +## Shepherd + +[Mark Diekhans](mailto:markd@ucsc.edu) + +## Motivation + +Provide a single point of documentation of the DCP DNS structure. + +## Detailed Design + +The [humancellatlas.org](http://humancellatlas.org/) domain is controlled by Sanger IT. + +[data.humancellatlas.org](http://data.humancellatlas.org/) is a zone registered in AWS Route 53 that the DCP team controls. + +Sanger has added a record to their DNS configuration that delegates look-ups of *.data.humancellatlas.org* domains to the Route 53 nameservers. They have also added *schema.humancellatlas.org* to be added as a subdomain. The AWS Route 53 setting have been added. + + + +subdomain | | name server +------------------------|----|------------ +data.humancellatlas.org | NS | ns-202.awsdns-25.com + | | ns-1127.awsdns-12.org + | | ns-1554.awsdns-02.co.uk + | | ns-812.awsdns-37.net +schema.humacellatlas.org | NS | ns-1071.awsdns-05.org + | | ns-2004.awsdns-58.co.uk + | | ns-339.awsdns-42.com + | | ns-671.awsdns-19.net + +DCP services for the production deployment live under the domain *.data.humancellatlas.org*. Services in pre-production deployments have the deployment name inserted before *.data. + + + +### Production Deployment: + +Service | hostname +--------|--------- +Ingest Broker | ingest.data.humancellatlas.org +Ingest API | api.ingest.data.humancellatlas.org +Upload Service API | upload.data.humancellatlas.org +Data Store API | dss.data.humancellatlas.org +Secondary Analysis API | pipelines.data.humancellatals.org + +### Staging Deployment (periodic releases, more stable): + +Service | hostname +--------|--------- +Ingest Broker | ingest.staging.data.humancellatlas.org +Ingest API | api.ingest.staging.data.humancellatlas.org +Upload Service API | upload.staging.data.humancellatlas.org +Data Store API | dss.staging.data.humancellatlas.org +Secondary Analysis API | pipelines.staging.data.humancellatals.org + +### Development Deployment (ongoing development, unstable): + +Service | hostname +--------|--------- +Ingest Broker | ingest.dev.data.humancellatlas.org +Ingest API | api.ingest.dev.data.humancellatlas.org +Upload Service API | upload.dev.data.humancellatlas.org +Data Store API | dss.dev.data.humancellatlas.org +Secondary Analysis API | pipelines.dev.data.humancellatals.org + + +