GlycoSiteMiner is a literature mining-based pipeline for extracting glycosylation sites from PubMed abstracts. The code for the pipeline is made available as a docker image that can be pulled using the following command.
docker pull glygen/glycositeminer:1.0.0
After pulling the image, create a data directory and set an env variable that will contain the path to your data directory. In the example shown below, the directory "/data/glycositeminer/" is used a a data directory.
mkdir -p /data/glycositeminer/
export DATA_PATH=/data/glycositeminer/
To start a container from the image, run the following command
docker run -itd -v $DATA_PATH:/data --name running_glycositeminer glygen/glycositeminer:1.0.0
Use the following command to download and unpack data used by the pipeline.
nohup docker exec -t running_glycositeminer python download-pipeline-data.py &
When the process started by the above command is done, you should see the following file counts
27,933 files under $DATA_PATH/llm_entities/*/llm.*.json
18,087 files under $DATA_PATH/medline_extracts/pmid.*.txt
16,954 files under $DATA_PATH/pubtator_extracts/pmid.*.txt
9,311 files under $DATA_PATH/medline_abstracts/pmid.*.txt
1,245 files under $DATA_PATH/confirmation/*.json
89 files under $DATA_PATH/glygen/*
13 files under $DATA_PATH/misc/*
1 file under $DATA_PATH/gene_info/*
NOTE: if you wish to complile these dataset files from their original source, instructions are given at the bottom of this README.
nohup docker exec -t running_glycositeminer python make-entities.py &
When the process started by the above command is done, you should see the following file counts
9,311 files under $DATA_PATH/entities/site.* |wc
9,311 files under $DATA_PATH/entities/glyco.* |wc
5,917 files under $DATA_PATH/entities/gene.* |wc
4,593 files under $DATA_PATH/entities/extragene.* |wc
7,260 files under $DATA_PATH/entities/species.* |wc
The command given below will integrate
nohup docker exec -t running_glycositeminer python integrate-entities.py &
The command given below will create
nohup docker exec -t running_glycositeminer python make-match-sites.py &
The command given below will generate two files -- "$DATA_PATH/samples/samples_all.csv" containing
nohup docker exec -t running_glycositeminer python make-samples.py &
This step will run 10-fold cross validation using the samples in "$DATA_PATH/samples/samples_labeled.csv", and the output files will be under "$DATA_PATH/validation/". The file "recall_precision.json" recall and precision values for both SVM and MLP classifiers, and the confusion matrix values are in the file "confusion_matrix.json". The are also two PNG files, "roc.png" and "cm.png", showing the ROC curves and confusion matrix respectively.
nohup docker exec -t running_glycositeminer python run-cross-validation.py &
As described in the paper, these commands given below are used to find optimal threshold on the class probabilities that is suitable for our application. The output of the first command is saved in "$DATA_PATH/tuning/tuning.json", and the second command generates a PNG file "$DATA_PATH/tuning/balanced_accuracy.png" and cutoffs file "$DATA_PATH/tuning/cutoff.json" which contains cutoff values to be used when making predictions.
The reported cutoff values in the manuscript are
nohup docker exec -t running_glycositeminer python tuning-step-1.py &
You need to wait until the process started by the above command is finished.
nohup docker exec -t running_glycositeminer python tuning-step-2.py &
Using all the samples in "$DATA_PATH/samples/samples_labeled.csv", this step creates final modesl for both SVM and MLP classifiers and saves the models under "$DATA_PATH/models/".
docker exec -t running_glycositeminer python make-models.py
We can now apply the models to all samples "$DATA_PATH/samples/samples_all.csv" to make predictions. The output of the command below is saved in "$DATA_PATH/predicted/predicted.csv". This script also outputs stat files "$DATA_PATH/predicted/stats.txt" giving the number of sites predicted for each species. Since your cutoff values (in "$DATA_PATH/tuning/cutoff.json") can be slighly different from what has been reported in the manuscript, your numbers in "$DATA_PATH/predicted/stats.txt" can be slightly different from what has been reported in the manuscript.
docker exec -t running_glycositeminer python make-predictions.py
The following commands give the total number of sites which pass the LLM-based verification (
$ cat $DATA_PATH/predicted/predicted.csv | grep llm_yes |wc
1745 1745 157875
$ cat $DATA_PATH/predicted/predicted.csv | grep llm_yes |grep in_glygen_no |wc
1118 1118 100784
To download and process original data, follow the instructions given below.
Run the following command and the downloaded files will be saved under "$DATA_PATH/glygen/".
nohup docker exec -t running_glycositeminer python download-glygen.py &
Run the following command and the downloaded files will be saved under "$DATA_PATH/gene_info/".
nohup docker exec -t running_glycositeminer python download-gene-info.py &
Run the following command to download PubMed baseline xml files for 2024 which have file indexes starting from "1" to "1219". To find out the start and end values of the baseline files, visit "https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/" and look at the first and last *.xml.gz files. For year 24, since the first and last files are "pubmed24n0001.xml.gz" and "pubmed24n1219.xml.gz", the start and end file indexes are "1" and "1219". The downloaded files will be saved under "$DATA_PATH/medline/".
nohup docker exec -t running_glycositeminer python download-medline.py -c baseline -y 24 -s 1 -e 1219 &
Similarly, for the PubMed updatefiles, visit "https://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/" and you will find out first and last files are "pubmed24n1220.xml.gz" and "pubmed24n1600.xml.gz".
nohup docker exec -t running_glycositeminer python download-medline.py -c updatefiles -y 24 -s 1220 -e 1600 &
Run the following command and the downloaded files will be saved under "$DATA_PATH/pubtator/".
nohup docker exec -t running_glycositeminer python download-pubtator.py &
Next, run the following command to parse the *.xml.gz downloaded files under $DATA_PATH/medline/ and create medline extract files under $DATA_PATH/medline_extracts/. This is a parallelization wrapper script for another script named "extract-medline-data.py" and will spawn 10 "extract-medline-data.py"
nohup docker exec -t running_glycositeminer python download.py -s wrap-extract-medline-data.py &
When the 10 "extract-medline-data.py" are done and many files have been created under $DATA_PATH/medline_extracts/, run the following command to parse downloaded file under $DATA_PATH/pubtator/ and create pubtator extract files under $DATA_PATH/pubtator_extracts/
nohup docker exec -t running_glycositeminer python download.py -s extract-pubtator-data.py &