Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/master' into 1.4
Browse files Browse the repository at this point in the history
  • Loading branch information
edeutsch committed Jul 5, 2023
2 parents 4cb21f9 + 89d8d18 commit 7583c73
Showing 1 changed file with 132 additions and 2 deletions.
Original file line number Diff line number Diff line change
@@ -1,3 +1,133 @@
## A TRAPI Attribute Specification for Source Retrieval Provenance
# A TRAPI Attribute Specification for Source Retrieval Provenance

## Overview
"Source retrieval provenance" describes the set of Information Resources through which the knowledge expressed in an Edge was passed, through various retrieval and/or transform operations, on its way to its current serialized form. For example, the provenance of a Gene-Chemical Edge in a message sent to a Translator ARA (e.g. ARAGORN) might be traced through the Translator KP that provided it (e.g. MolePro), one or more intermediate aggregator resources (e.g. ChEMBL), and back to the resource that originally created/curated it (e.g. ClinicalTrials.org).

````
ARAGORN --retrieved_from--> MolePro --retrieved_from--> ChEMBL --retrieved_from--> ClinicalTrials.gov
````
Note that source retrieval provenance concerns the **mechanical retrieval and transformation of data between web accessible information systems**. It does not trace the source of knowledge back to specific publications or data sets. And it is not concerned with the reasoning, inference or analysis activities that generate knowledge in the first place. These types of provenance are handled by a different set of properties in the EPC model (e.g. see the ‘Supporting Publications Specification’ [here]([url](https://github.com/NCATSTranslator/ReasonerAPI/blob/master/ImplementationGuidance/Specifications/supporting_publications_specification.md))).

## The Model
While the TRAPI schema uses the generic Attribute class for representing nearly all metadata about Edges in knowledge graphs, metadata about **source retreival provenance** is an exception - given the need to efficently find and parse this information for purposes of edge merging and debugging. A complete specification will be provided here soon. This early draft provies a brief overview of the model itself, guidance and conventions for implementing the model, and a few data examples to follow.

The diagram below shows the classes and properties defined in the [TRAPI schema]([url](https://github.com/NCATSTranslator/ReasonerAPI/blob/master/TranslatorReasonerAPI.yaml#L1107)) to support representation of source retreival provenance metadata.

![image](https://github.com/NCATSTranslator/ReasonerAPI/assets/5184212/840b8061-2fe4-4e15-968f-97cd87de22ab)

Briefly, the `Edge.sources` property points to one or more `RetreivalSource` objects - which capture information about how a particular InformationResource served as a source from which knowledge expressed in an Edge, or data used to generate this knowledge, was retrieved. This incudes the infores CURIE of the resource, the role it played (priamry knowledge source, aggregator knowledge source, or supporting data source), and Inforamtion Resources that were directly upstream in the retrieval chain. In addition, data providres can capture URLs of specific records in the Resource where the information reported in the Edge can be found.


## Implementation Guidance
A quick guide for implementers. Using the model describd above:

1. All Edges MUST report **one and only one** Retrieval Source serving as the `primary knowledge source`.

2. All Edges MUST provide a list of any Retrieval Sources that served as `aggregator knowlege sources` by retrieving the knowledge expressed in the Edge from the priamry source of another aggregator.

3. All Edges representing knowledge generated through analysis of data by a Translator Knoledge Provider (KP) SHOULD report any Retrieval Sources providing the data that they operated on as a `supporting data source`.
4. Values of the `RetrievalSource.resource_id` MUST be an CURIE from the InfoRes Catalog [here]([url](https://github.com/biolink/biolink-model/blob/master/infores_catalog.yaml)) (e.g. “infores:dgidb”, “infores:molepro”)


## Data Examples

Below we provide JSON data examples illustrating two retreival scenarios.

**Scenario 1**: Knowledge retrieval from a single external knowledge source
A single Edge originates in primary source KS1, and is retreived through multiple aggregators ending with the UI. Along the way, ARA1 merges the two edges retreived from KP1 and KP1. 

![image](https://github.com/NCATSTranslator/ReasonerAPI/assets/5184212/39f08657-f4a5-4410-b2c4-244a9558ef4b)
*KS = an external Knowledge Source. KP = a Translator Knowledge Provider.  ARA = a Translator Automated Reasoning Agent, UI = the Translator User Interface.
Each arrow in the diagram below (R1-R5) represents the distinct retrieval of one edge.*

````
{
"edges": {
"subject": "RXCUI:1544384",
"predicate": "biolink:treats",
"object": "MONDO:0008383",
"sources": [
"type": biolink:RetrievalSource,
"resource_id": "infores:KS_1",
"resource_role": "primary knowledge source",
},
{ # R1
"type": biolink:RetrievalSource
"resource_id": "infores:KP_1",
"resource_role": "aggregator knowledge source",
"usptream_resource_ids": ["infores:KS_1"]
},
{ # R2
"type": biolink:RetrievalSource,
"resource_id": "infores:KP_2",
"resource_role": "aggregator knowledge source",
"usptream_resource_ids": ["infores:KS_1"]
},
{ # R3, R4
"type": biolink:RetrievalSource,
"resource_id": "infores:ARA1",
"resource_role": "aggregator knowledge source",
"usptream_resource_ids": ["infores:KP_1", "infores:KP_2"]
},
{ # R5
"type": biolink:RetrievalSource,
"resource_id": "infores:UI",
"resource_role": "aggregator knowledge source",
"usptream_resource_ids": ["infores:ARA_1"]
},
]
}
````

**Scenario 2:** Retrieveal of knowledge generated by a KP from data
In this scenario, the knoweldge expressed in the Edge being retreived was originally generated by a KP based on on analysis of data it retrieved from upstream data sources. This is often the case for KPs like ICEES, COHD, and Multiomics KP that generate Edges reporting statistical corelations between variables in clinical, environmntal, or multiomics datasets.

In the scenario diagrammed below, data from two soruces (DB1, DB2) is retrieved by KP1, where the data is analyzed to generate an Edge. This makes KP1 the "primary source" of the knowledge, and DB1 and DB2 "supporting data sources". ARA1 then retreives this edges from KP1 and then passes it along to the UI.

![image](https://github.com/NCATSTranslator/ReasonerAPI/assets/5184212/40cce738-1235-4ab3-8628-fca92e348761)
*DB = an external data source. KP = a Translator Knowledge Provider.  ARA = a Translator Automated Reasoning Agent, UI = the Translator User Interface.
Each arrow (R1-R5) represents a distinct retrieval event (grey arrows/text indicates the retreival of *data* rather than knowledge).*

````
{
"edges": {
"id": "e21aa4542"
"subject": "RXCUI:1544384",
"predicate": "biolink:correlated_with",
"object": "MONDO:0008383",
"sources": [
{
"type": biolink:RetrievalSource,
"resource_id": "infores:DB_1",
"resource_role": "supporting data source",
},
{
"type": biolink:RetrievalSource,
"resource_id": "infores:DB_2",
"resource_role": "supporting data source",
},
{
"type": biolink:Source, # R1, R2
"resource_id": "infores:KP_1",
"resource_role": "primary knowledge source",
"upstreams_resource_ids": ["infores:DB_1", "infores:DB_2"]
},
{ # R3
"type": biolink:RetrievalSource,
"resource_id": "infores:ARA_1",
"resource_role": "aggregator data source",
"upstreams_resource_ids": ["infores:KP_1"]
},
{ # R4
"type": biolink:RetrievalSource,
"resource_id": "infores:UI",
"resource_role": "aggregator data source",
"upstreams_resource_ids": ["infores:ARA_1"]
},
]
}
````




The Attribute-based approach for representing source retrieval provenance (specified [here](https://docs.google.com/document/d/177sOmjTueIK4XKJ0GjxsARg909CaU71tReIehAp5DDo/edit#heading=h.9mu3cpnwwefy)) is now depretated and replaced by a refactored model implemented in TRAPI 1.3. Documentation COMING HERE SOON.

0 comments on commit 7583c73

Please sign in to comment.