Skip to content

Phenodigm evidence reading and writing

Gautier Koscielny edited this page Oct 10, 2018 · 16 revisions

The ETL process to extract the Phenodigm data has changed over the last 3 years as the way to extract the data has changed too. In the first release, the ETL script sent a request to a remote SOLR instance to fetch the evidence in several steps. In the second release, the ETL script read the data directly from a JSON dump. In both cases, the process had to be run several times to ensure all data were extracted.

Where are the Phenodigm data?

The Phenodigm data are located in the Google Cloud Storage in https://console.cloud.google.com/storage/browser/otar000-evidence_input/Phenodigm/?project=open-targets

Loading the Phenodigm data into SOLR

Currently, we have access to all the SOLR data. Creating and running an instance of SOLR with the Phenodigm data is straightforward.

First, create a directory on your local computer to store the documents.

mkdir -p $HOME/Documents/data/phenodigm/mycores

Then, copy the latest solr index there

cp -r phenodigm2_v20171129-0 $HOME/Documents/data/phenodigm/mycores/

Finally, start SOLR on a machine by pointing to the SOLR core that you have created.

mkdir mycores
$ docker run -p 8983:8983 -v $HOME/Documents/data/phenodigm/mycores:/opt/solr/server/solr/mycores solr:5

From this point, you'll be able to query the data directly from SOLR like in this simple example:

curl http://localhost:8983/solr/phenodigm/query -d '
{
  query:"*"
}'

The data model may change from release to release that requires adaptation of the code and is time consuming. To check what are the entities represented in the index, execute:

http://localhost:8983/solr/phenodigm/select?q=*:*&wt=json&indent=true&rows=0&facet=on&facet.field=type

This will return the number of documents per entity along with their names:

{
  "responseHeader":{
    "status":0,
    "QTime":43,
    "params":{
      "q":"*:*",
      "facet.field":"type",
      "indent":"true",
      "rows":"0",
      "wt":"json",
      "facet":"on"}},
  "response":{"numFound":6338905,"start":0,"docs":[]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "type":[
        "disease_model_summary",5751062,
        "ontology_ontology",415103,
        "gene",56175,
        "mouse_model",40540,
        "ontology",25582,
        "disease",18245,
        "disease_search",18245,
        "disease_gene_summary",13953]},
    "facet_dates":{},
    "facet_ranges":{},
    "facet_intervals":{},
    "facet_heatmaps":{}}
}

In October 2018, we connected directly to the IMPC SOLR index to retrieve the data.

http://www.ebi.ac.uk/mi/impc/solr/phenodigm/select?q=*:*&wt=json&indent=true&rows=0&facet=on&facet.field=type
{
  "responseHeader":{
    "status":0,
    "QTime":90,
    "params":{
      "q":"*:*",
      "facet.field":"type",
      "indent":"true",
      "rows":"0",
      "wt":"json",
      "facet":"on"}},
  "response":{"numFound":7024687,"start":0,"docs":[]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "type":[
        "disease_model_summary",6073263,
        "ontology_ontology",464149,
        "gene",341121,
        "mouse_model",42120,
        "ontology",26625,
        "gene_gene",26525,
        "disease",18268,
        "disease_search",18268,
        "disease_gene_summary",14348]},
    "facet_dates":{},
    "facet_ranges":{}}}

Get all the disease models for all or one disease

This is an example query to return all the disease models for one disease (Carney Complex, type I) with a phenodgim score between [50, 100].

http://localhost:8983/solr/phenodigm/select?q=disease_id:%22OMIM:160980%22%20AND%20%20disease_model_max_norm:[50%20TO%20100]&wt=json&indent=true&fq=type:disease_model_summary&start=0&size=20

If you want the disease model for one specific gene and disease:

http://localhost:8983/solr/phenodigm/select?q=disease_id:%22OMIM:160980%22%20AND%20marker_symbol:%20%22Duox2%22%20AND%20%20disease_model_max_norm:[50%20TO%20100]&wt=json&indent=true&fq=type:disease_model_summary&start=0&size=20

If you want to know how many models have a score greater than or equal to 50

http://localhost:8983/solr/phenodigm/select?q=disease_model_max_norm:[50.0%20TO%20100]&wt=json&indent=true&fq=type:disease_model_summary&start=0&size=20

If you want to get the gene documents and see how they are structured:

http://www.ebi.ac.uk/mi/impc/solr/phenodigm/select?q=*:*&wt=json&indent=true&fq=type:gene&start=0&size=20

Updating the mouse and human gene cache

The first step before generating the evidence is to update the cache containing the mouse and human gene information. This is achieved by running the following command:

cd evidence_datasource_parsers/
python CommandLine.py --phenodigm --update-cache

Information will be updated in google cloud directly.

Generating Phenodigm evidence

The next step is to run the code to generate the evidence. This time there is no need for extra command line flags.

cd evidence_datasource_parsers/
python CommandLine.py --phenodigm

The script will go through the SOLR index and access all the documents before transforming these documents into evidence strings. There is a threshold for the score which is greater than or equal to 50 for the phenodigm score.