-
Notifications
You must be signed in to change notification settings - Fork 3
Phenodigm evidence reading and writing
The ETL process to extract the Phenodigm data has changed over the last 3 years as the way to extract the data has changed too. In the first release, the ETL script sent a request to a remote SOLR instance to fetch the evidence in several steps. In the second release, the ETL script read the data directly from a JSON dump. In both cases, the process had to be run several times to ensure all data were extracted.
The Phenodigm data are located in the Google Cloud Storage in https://console.cloud.google.com/storage/browser/otar000-evidence_input/Phenodigm/?project=open-targets
Currently, we have access to all the SOLR data. Creating and running an instance of SOLR with the Phenodigm data is straightforward.
First, create a directory on your local computer to store the documents.
mkdir -p $HOME/Documents/data/phenodigm/mycores
Then, copy the latest solr index there
cp -r phenodigm2_v20171129-0 $HOME/Documents/data/phenodigm/mycores/
Finally, start SOLR on a machine by pointing to the SOLR core that you have created.
mkdir mycores
$ docker run -p 8983:8983 -v $HOME/Documents/data/phenodigm/mycores:/opt/solr/server/solr/mycores solr:5
From this point, you'll be able to query the data directly from SOLR like in this simple example:
curl http://localhost:8983/solr/phenodigm/query -d '
{
query:"*"
}'
The data model may change from release to release that requires adaptation of the code and is time consuming. To check what are the entities represented in the index, execute:
http://localhost:8983/solr/phenodigm/select?q=*:*&wt=json&indent=true&rows=0&facet=on&facet.field=type
This will return the number of documents per entity along with their names:
{
"responseHeader":{
"status":0,
"QTime":43,
"params":{
"q":"*:*",
"facet.field":"type",
"indent":"true",
"rows":"0",
"wt":"json",
"facet":"on"}},
"response":{"numFound":6338905,"start":0,"docs":[]
},
"facet_counts":{
"facet_queries":{},
"facet_fields":{
"type":[
"disease_model_summary",5751062,
"ontology_ontology",415103,
"gene",56175,
"mouse_model",40540,
"ontology",25582,
"disease",18245,
"disease_search",18245,
"disease_gene_summary",13953]},
"facet_dates":{},
"facet_ranges":{},
"facet_intervals":{},
"facet_heatmaps":{}}
}
In October 2018, we connected directly to the IMPC SOLR index to retrieve the data.
http://www.ebi.ac.uk/mi/impc/solr/phenodigm/select?q=*:*&wt=json&indent=true&rows=0&facet=on&facet.field=type
{
"responseHeader":{
"status":0,
"QTime":90,
"params":{
"q":"*:*",
"facet.field":"type",
"indent":"true",
"rows":"0",
"wt":"json",
"facet":"on"}},
"response":{"numFound":7024687,"start":0,"docs":[]
},
"facet_counts":{
"facet_queries":{},
"facet_fields":{
"type":[
"disease_model_summary",6073263,
"ontology_ontology",464149,
"gene",341121,
"mouse_model",42120,
"ontology",26625,
"gene_gene",26525,
"disease",18268,
"disease_search",18268,
"disease_gene_summary",14348]},
"facet_dates":{},
"facet_ranges":{}}}
This is an example query to return all the disease models for one disease (Carney Complex, type I) with a phenodgim score between [50, 100].
http://localhost:8983/solr/phenodigm/select?q=disease_id:%22OMIM:160980%22%20AND%20%20disease_model_max_norm:[50%20TO%20100]&wt=json&indent=true&fq=type:disease_model_summary&start=0&size=20
If you want the disease model for one specific gene and disease:
http://localhost:8983/solr/phenodigm/select?q=disease_id:%22OMIM:160980%22%20AND%20marker_symbol:%20%22Duox2%22%20AND%20%20disease_model_max_norm:[50%20TO%20100]&wt=json&indent=true&fq=type:disease_model_summary&start=0&size=20
If you want to know how many models have a score greater than or equal to 50
http://localhost:8983/solr/phenodigm/select?q=disease_model_max_norm:[50.0%20TO%20100]&wt=json&indent=true&fq=type:disease_model_summary&start=0&size=20
If you want to get the gene documents and see how they are structured:
http://www.ebi.ac.uk/mi/impc/solr/phenodigm/select?q=*:*&wt=json&indent=true&fq=type:gene&start=0&size=20
The first step before generating the evidence is to update the cache containing the mouse and human gene information. This is achieved by running the following command:
cd evidence_datasource_parsers/
python CommandLine.py --phenodigm --update-cache
Information will be updated in google cloud directly.
The next step is to run the code to generate the evidence. This time there is no need for extra command line flags.
cd evidence_datasource_parsers/
python CommandLine.py --phenodigm
The script will go through the SOLR index and access all the documents before transforming these documents into evidence strings. There is a threshold for the score which is greater than or equal to 50 for the phenodigm score.