This is the codebase for the PGxMine project that uses text-mining to identify papers for curation into PharmGKB. It is a Python3 project that makes use of the Kindred relation classifier along with the BioText project to manage the download of PubMed/PMC and alignment with PubTator.
The data can be viewed through the Shiny app. It can be downloaded as TSV files at Zenodo.
To run a local instance of the PGxmine viewer, the R Shiny code can be found in shiny/ and installation instructions are found there too.
This project depends on Kindred, scispacy and snakemake. They can be installed by:
pip install -r requirements.txt
This project uses a variety of data sources.
A few need to be downloaded as below, apart from DrugBank which needs to be download manually.
- MeSH (only needed to update the drug list)
- DrugBank (download manually as account is required and name it drugbank.xml)
- PharmGKB (used for constructing the drug list and comparisons)
The prepareData.sh script downloads some of the data dependencies and runs some preprocessing to extract necessary data (such as gene name mappings). The commands that it runs are detailed below.
year=`date +"%Y"`
# Download MeSH, dbSNP, Entrez Gene metadata and pharmGKB drug info
sh downloadDataDependencies.sh
# Extract the gene names associated with rsIDs from dbSNP
python linkRSIDToGeneName.py --dbsnp <(zcat data/GCF_000001405.*.gz) --pubtator <(zcat data/bioconcepts2pubtatorcentral.gz) --outFile data/dbsnp_selected.tsv
# Create the drug list with mappings from MeSH IDs to PharmGKB IDs (with some filtering using DrugBank categories)
python createDrugList.py --meshC data/c$year.bin --meshD data/d$year.bin --drugbank drugbank.xml --pharmgkb data/drugs.tsv --outFile data/selected_chemicals.json
# Extract a mapping from Entrez Gene ID to name
zgrep -P "^9606\t" data/gene_info.gz | cut -f 2,3,10 -d $'\t' > data/gene_names.tsv
# Unzip the annotated training data of pharmacogenomics relations
gunzip -c annotations.variant_other.bioc.xml.gz > data/annotations.variant_other.bioc.xml
gunzip -c annotations.variant_star_rs.bioc.xml.gz > data/annotations.variant_star_rs.bioc.xml
There is an example input file in the test_data directory which contains an PubMed abstract in BioC format. The run_example.sh script does a full run extracting chemical/variant associations and is shown below with comments. The final output is three files: mini_unfiltered.tsv, mini_collated.tsv, mini_sentences.tsv. This is equivalent to the test run with snakemake shown below.
To run a small example of the pipeline using snakemake, run the command below. This runs on the data in the test_data directory. It is equivalent to the commands in the run_example.sh script which provides some comments on what each step does. Snakemake is useful for running on the larger datasets with the full run commands further down.
MODE=test snakemake --cores 1
To do a full run, you need set up a local instance of BioText with the biocxml format. The command below will run Snakemake on the biotext. You must change BIOTEXT to point towards the biocxml directory in your local instance of BioText. The run will take a while and a cluster is recommended using snakemake's cluster support.
MODE=full BIOTEXT=/path/to/biotext/biocxml snakemake --cores 1
Here is a summary of the main script files. The Snakefile manages the execution of these in the correct ordering.
- findPGxSentences.py: Identify star alleles then find sentences that mention a chemical and variant
- getRelevantMeSH.py: Extracts MeSH terms related to age groups that is used by additional analysis
- createKB.py: Train and apply a relation classifier to extract pharmacogenomic chemical/variant associations
- filterAndCollate.py: Filter the results to reduce false positives and collate the associations
- utils/init.py: Big functions for variant normalization and outputting the formatted sentences
- createDrugList.py: Creates the list of drugs and drug mappings from MeSH IDs to PharmGKB IDs with some filtering by categories
- linkRSIDToGeneName.py: Extracts gene names from dbSNP associated with rsIDs
- linkStarToRSID.py: Some rudimentary text mining to link star alleles with a specific rsID
- prepareForAnnotation.py: Select sentences and output to the standoff format to be annotated
- prCurve.py: Calculate PR curves for the classifiers
The paper can be recompiled using the dataset using Bookdown. All text and code for stats/figures are in the paper/ directory.
Supplementary materials for the manuscript are found in supplementaryMaterials/.