This repo contains the companion code for the manuscript: "Pepsickle rapidly and accurately predicts proteasomal cleavage sites for improved neoantigen identification" which can be found here.
All fully trained models have been deployed as a separate software package with instructions for installation and use. This can be found at: https://github.com/pdxgx/pepsickle
-
Download the code in this repository:
git clone https://github.com/pdxgx/pepsickle-paper.git cd ./pepsickle-paper
-
Setup and install necessary libraries (also requires python 3 and mysql to be installed). For full list of python requirements see reqirements.txt:
pip install -r requirements.txt
-
Dowload requisite datasets too large for Repo upload:
Enter the
./data/raw/database_pulls
directory and download the IEDB static data dump needed for analysis:NOTE: Paper analysis was performed on data pulled June, 29th, 2020. For identical reproduction subset all extracted data, including IEDB data, to entries on or before the specified date.
cd ./data/raw/database_pulls wget http://www.iedb.org/downloader.php?file_name=doc/iedb_public.sql.gz
The IEDB database ERD can also be found here.
The AntiJen database query feature is not working currently and has yet to be repaired, however the processed data from previously working queries can be found here.
For comparing training samples to the human proteome background, please download the human proteome from UniProt.
cd ../ wget https://www.uniprot.org/uniprot/?query=proteome:UP000005640%20reviewed:yes#
-
Return to the main directory and edit the
MASTER.sh
script to include your mysql user name and password using nano or a text editor of your choice.nano MASTER.sh
#### SETUP ## set working directory to base dir of project cd /path/to/pepsickle-paper ## set temp environmental vars for mysql use export MYSQL_USER=[USER] export MYSQL_PWD=[PASSWORD]
-
Create output directories that are expected by the pipeline. The following directories are needed for proper pipeline output:
while read d; do echo "mkdir $d" done < directory_list.txt
-
run the following command to iterate through data retrieval and processing steps:
This script runs through the primary analysis and model training pipeline. Alternative models and options mentioned in the manuscript are also available but commented out for streamlining. Some steps are slow and annotated as such in comment lines.
bash run_analysis.sh
NOTE: The validation analysis steps at the end of
run_analysis.sh
require the installation ofpepsickle
. To install pepsickle using the weights generated by this pipeline, follow step 7. For testing the deployedpepsickle
tool on the included validation data, simply follow the installation steps on the pepsickle repo page. -
Install
pepsickle
. For assessing performance on validation data, the deployed tool framework is used. To installpepsickle
using newly trained model weights instead of those built in by default, change out of thepepsickle-paper
directory and then follow these steps:git clone https://github.com/pdxgx/pepsickle cd ./pepsickle/pepsickle cp ./pepsickle-paper/data/model_weights/model.joblib . cd ../ pip install .
This will replace the default
model.joblib
file containing pre-trained weights with those trained through the rest of thepepsickle-paper
pipeline.
This project aggregates data from a variety of databases, including:
Data from peer reviewed literature was also aggregated. More details on paper specific data can be found in the main text (Table X).