Skip to content

Latest commit

 

History

History
91 lines (67 loc) · 4.62 KB

README.md

File metadata and controls

91 lines (67 loc) · 4.62 KB

Pepsickle paper repository

This repo contains the companion code for the manuscript: "Pepsickle rapidly and accurately predicts proteasomal cleavage sites for improved neoantigen identification" which can be found here.

Trained model availability and use:

All fully trained models have been deployed as a separate software package with instructions for installation and use. This can be found at: https://github.com/pdxgx/pepsickle

How to run analysis in Linux/Unix:

  1. Download the code in this repository:

    git clone https://github.com/pdxgx/pepsickle-paper.git
    cd ./pepsickle-paper
    
  2. Setup and install necessary libraries (also requires python 3 and mysql to be installed). For full list of python requirements see reqirements.txt:

    pip install -r requirements.txt
    
  3. Dowload requisite datasets too large for Repo upload:

    Enter the ./data/raw/database_pulls directory and download the IEDB static data dump needed for analysis:

    NOTE: Paper analysis was performed on data pulled June, 29th, 2020. For identical reproduction subset all extracted data, including IEDB data, to entries on or before the specified date.

    cd ./data/raw/database_pulls
    wget http://www.iedb.org/downloader.php?file_name=doc/iedb_public.sql.gz
    

    The IEDB database ERD can also be found here.

    The AntiJen database query feature is not working currently and has yet to be repaired, however the processed data from previously working queries can be found here.

    For comparing training samples to the human proteome background, please download the human proteome from UniProt.

    cd ../
    wget https://www.uniprot.org/uniprot/?query=proteome:UP000005640%20reviewed:yes#
    
  4. Return to the main directory and edit the MASTER.sh script to include your mysql user name and password using nano or a text editor of your choice.

    nano MASTER.sh

     #### SETUP
     ## set working directory to base dir of project
     cd /path/to/pepsickle-paper
     ## set temp environmental vars for mysql use
     export MYSQL_USER=[USER]
     export MYSQL_PWD=[PASSWORD]
    
  5. Create output directories that are expected by the pipeline. The following directories are needed for proper pipeline output:

    while read d; do
         echo "mkdir $d"
    done < directory_list.txt
    
  6. run the following command to iterate through data retrieval and processing steps:

    This script runs through the primary analysis and model training pipeline. Alternative models and options mentioned in the manuscript are also available but commented out for streamlining. Some steps are slow and annotated as such in comment lines.

    bash run_analysis.sh

    NOTE: The validation analysis steps at the end of run_analysis.sh require the installation of pepsickle. To install pepsickle using the weights generated by this pipeline, follow step 7. For testing the deployed pepsickle tool on the included validation data, simply follow the installation steps on the pepsickle repo page.

  7. Install pepsickle. For assessing performance on validation data, the deployed tool framework is used. To install pepsickle using newly trained model weights instead of those built in by default, change out of the pepsickle-paper directory and then follow these steps:

    git clone https://github.com/pdxgx/pepsickle
    cd ./pepsickle/pepsickle
    cp ./pepsickle-paper/data/model_weights/model.joblib .
    cd ../
    pip install .
    

    This will replace the default model.joblib file containing pre-trained weights with those trained through the rest of the pepsickle-paper pipeline.

Information on data sources:

This project aggregates data from a variety of databases, including:

Data from peer reviewed literature was also aggregated. More details on paper specific data can be found in the main text (Table X).