Skip to content

scripts for data pulls and analysis related to the pepsickle proteasomal cleavage paper

Notifications You must be signed in to change notification settings

pdxgx/pepsickle-paper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

92 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pepsickle paper repository

This repo contains the companion code for the manuscript: "Pepsickle rapidly and accurately predicts proteasomal cleavage sites for improved neoantigen identification" which can be found here.

Trained model availability and use:

All fully trained models have been deployed as a separate software package with instructions for installation and use. This can be found at: https://github.com/pdxgx/pepsickle

How to run analysis in Linux/Unix:

  1. Download the code in this repository:

    git clone https://github.com/pdxgx/pepsickle-paper.git
    cd ./pepsickle-paper
    
  2. Setup and install necessary libraries (also requires python 3 and mysql to be installed). For full list of python requirements see reqirements.txt:

    pip install -r requirements.txt
    
  3. Dowload requisite datasets too large for Repo upload:

    Enter the ./data/raw/database_pulls directory and download the IEDB static data dump needed for analysis:

    NOTE: Paper analysis was performed on data pulled June, 29th, 2020. For identical reproduction subset all extracted data, including IEDB data, to entries on or before the specified date.

    cd ./data/raw/database_pulls
    wget http://www.iedb.org/downloader.php?file_name=doc/iedb_public.sql.gz
    

    The IEDB database ERD can also be found here.

    The AntiJen database query feature is not working currently and has yet to be repaired, however the processed data from previously working queries can be found here.

    For comparing training samples to the human proteome background, please download the human proteome from UniProt.

    cd ../
    wget https://www.uniprot.org/uniprot/?query=proteome:UP000005640%20reviewed:yes#
    
  4. Return to the main directory and edit the MASTER.sh script to include your mysql user name and password using nano or a text editor of your choice.

    nano MASTER.sh

     #### SETUP
     ## set working directory to base dir of project
     cd /path/to/pepsickle-paper
     ## set temp environmental vars for mysql use
     export MYSQL_USER=[USER]
     export MYSQL_PWD=[PASSWORD]
    
  5. Create output directories that are expected by the pipeline. The following directories are needed for proper pipeline output:

    while read d; do
         echo "mkdir $d"
    done < directory_list.txt
    
  6. run the following command to iterate through data retrieval and processing steps:

    This script runs through the primary analysis and model training pipeline. Alternative models and options mentioned in the manuscript are also available but commented out for streamlining. Some steps are slow and annotated as such in comment lines.

    bash run_analysis.sh

    NOTE: The validation analysis steps at the end of run_analysis.sh require the installation of pepsickle. To install pepsickle using the weights generated by this pipeline, follow step 7. For testing the deployed pepsickle tool on the included validation data, simply follow the installation steps on the pepsickle repo page.

  7. Install pepsickle. For assessing performance on validation data, the deployed tool framework is used. To install pepsickle using newly trained model weights instead of those built in by default, change out of the pepsickle-paper directory and then follow these steps:

    git clone https://github.com/pdxgx/pepsickle
    cd ./pepsickle/pepsickle
    cp ./pepsickle-paper/data/model_weights/model.joblib .
    cd ../
    pip install .
    

    This will replace the default model.joblib file containing pre-trained weights with those trained through the rest of the pepsickle-paper pipeline.

Information on data sources:

This project aggregates data from a variety of databases, including:

Data from peer reviewed literature was also aggregated. More details on paper specific data can be found in the main text (Table X).

About

scripts for data pulls and analysis related to the pepsickle proteasomal cleavage paper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published