REcognition of Cluster-transition Determining Sites (RECDS)

RECDS evaluate the contribution of a specific amino acid site on the HA protein in the whole history of antigenic evolution. In RECDS, we ranked all of the HA sites by calculating the contribution scores derived from the forest of Gradient Boosting Classifiers trained by various sequence-based and structure-based features.

Copyright of RECDS scripts in code dir, including how to extract features, train model and color trees, are reserved by Aiping Wu lab.

Important: The project uses python 2.7 interpreter and when you run the pipeline, you should modify python script files and set some paths based on your computing environment and third-party software installation.

The repo contains the following:

1. Raw data used in this project

The antigenic clustering and raw sequence used in RECDS can be found in dataset/ directory.

H3N2

H3N2AntigenicCluster.strains: The strain names of A/H3N2 used by RECDS

H3N2/source/: Seven antigenic clusters derived from Smith’s studies including raw HA1 protein sequences.

H3N2/nonredundant/: The sequences after removing redundant HA1 protein sequences and sequence alignment.

Koel, B.F., et al. Substitutions near the receptor binding site determine major antigenic change during influenza virus evolution. Science 2013;342(6161):976-979.

H1N1

H1N1AntigenicCluster.strains: The strain names of A/H1N1 used by RECDS

H1N1/source/: Seven antigenic clusters derived from our previous studies including raw HA1 protein sequences.

H1N1/nonredundant/: The sequences after removing redundant HA1 protein sequences and sequence alignment.

Liu, M., et al. Antigenic Patterns and Evolution of the Human Influenza A (H1N1) Virus. Sci Rep 2015;5:14171.

2. Pipeline and result of the project

The pipeline uses python 2.7 interpreter. Source code of the RECDS pipeline can be found in code/sourceCode/script/. The major steps includes a) features extracting, b) GBC model training and c) phylogenetic tree coloring. The dependency scripts can be found in code/sourceCode/python/ and tools/.

a) Features extracting:

Including the PIMA score and other metrics calculated from predicted 3D structure.

PIMA score

Modify 1_PIMAscore.py file and set “seqin” path which contains virus sequences named by dData
dir/dData is the virus pairs name and you can find its example format in InputH3N2Example
dir/features/dPIMA is the output feature file

./1_PIMAscore.py dir/dData dir/features/dPIMA

MODELER structure

Modify 2_1_run_Modeler.py file and set some paths, such as runModeller, fastaDir and outDir
seqName is the virus name and you can find its example format in InputH3N2Example
dir/Modeler is the output file of outDir in 2_1_run_Modeler.py
dir/newModeler is the rename output file of Modeler structure

./2_1_run_Modeler.py dir/seqName # build modeler structure
./2_2_get_pdb.py dir/seqName dir/Modeler dir/newModeler # rename modeler model name

Relative solvent accessibility

Modify 3_1_runDssp.py file and set some paths which contains outdir for dssp resuls and Modeler for dir/newModeler;
dir/seqMap contains the virus name and you can find its example format in InputH3N2Example
Modify 3_2_get_dssp.py file and set dsspfile path to outdir for dssp resuls in 3_1_runDssp.py
dir/features/dDssp is the output feature file

./3_1_runDssp.py dir/seqMap
./3_2_get_dssp.py dir/dData dir/features/dDssp

Protein stability change

Modify 4_1_run_ddG.py file and set “seqin” path which contains virus sequences named by dData
dir/dSEF output file for ddG
Modify 4_2_get_ddG.py file and set ddGDir path to outdir for SEF resuls in 4_1_run_ddG.py
dir/dSEF is the output feature file

./4_1_run_ddG.py dir/dData dir/dSEF
./4_2_get_ddG.py dir/dData dir/features/dSEF

Half-sphere exposure (HSE)-up

Modify 5_1_runHSE.py file and set some paths which contains outdir for HSE resuls and ModelDir for dir/newModeler
Modify 5_2_get_HSE.py file and set hsefile path to outdir for HSE resuls in 5_1_runHSE.py
dir/features/dHSE is the output feature file

./5_1_runHSE.py dir/dData
./5_2_get_HSE.py dir/dData dir/features/dHSE

Energy changes

Modify 6_1_run_Amber.py file and set some paths
10000 is modeler model name
dir/amber/10000 is output dir
dir/newModeler is modeler structure dir
Modify 6_2_get_Amber.py file and set amberfile path to outdir for amber resuls in 6_1_run_Amber.py
dir/features/dAmber is the output feature file

# Van der Waals energy and the non-polar solvation energy
./6_1_run_Amber.py 10000 dir/amber/10000 dir/newModeler
./6_2_get_Amber.py dir/dData dir/features/dAmber

Merge six different features

Modify 7_merge_feature.py file and set featureDir path which contains above six features
dir/dFeatures is output dir

./7_merge_feature.py dir/dFeatures

b) Train GBC model

Use the features extracted in the previous steps to train Gradient Boost Classifier model.

9_cross_train.py and siteImportance_diffComb.py is similar to 8_siteImportance.py and they are our experiment scripts
dir/dFeatures is features file for antigenically different virus pairs
dir/sFeatures is features file for antigenically similar virus pairs
dir/train/allImportance is score output file

 # calculate contribution score for every amino acid position
./8_siteImportance dir/dFeatures dir/sFeatures dir/train/allImportance

c) Custom-made phylogenetic analysis.

Build phylogenetic tree, plot and color the tree according to the amino acid type of the candidate positions.

The phylogenetic trees of A/H3N2 and A/H1N1 were built under a GTR substitution model using RAxML (version 8.2.9). The GAMMA model of rate heterogeneity and other model parameters were estimated by RAxML.
The dedicated phylogenetic visualization R package ggtree (v1.6.1) was used to make the plot.
The result can be found in colorTree.zip.

3. Third-party software installation:

While the majority of programs in the package code/sourceCode/ are developed in the Aiping Wu lab here in the permission of use is released, there are some programs and databases which were developed by third-party groups. You may experience installation problems, please check the third-party websites.

Install python dependencies

Python modules pandas, biopython and scikit-learn are required. Installation using pip is recommended.

pip install pandas biopython scikit-learn

Install Modeller

The homologous modelling for protein 3D structure is implemented by Modeller. The download and installation can be found at https://salilab.org/modeller/. Modeller is written in python and C can be import as a python module.

Install dssp

The calculation of relative solvent accessibility (RSA) for protein residuals needs dssp. The download and installation can be found at https://swift.cmbi.umcn.nl/gv/dssp/. Or the pre-complied package can be installed on Linux machine:

sudo apt-get install dssp

Install SEF

We calculated ∆∆G denoting a protein stability change upon the single point mutation through the statistical energy function (SEF). This software is developed by Xiong Peng, and he modified application program interfaces (APIs) from original program for our use. So, the input is very simple, which just need protein structure and mutation information (wild type, position, mutant type). We have obtained his approval to open his modified software with our works and its software can be downloaded at https://drive.google.com/open?id=1Hls89AV6r5DNMXyDfC-qv4YhxkW_iaov. If you use SEF, please kindly cite as following:

Xiong, P., et al., Protein design with a comprehensive statistical energy function and boosted by experimental selection for foldability. Nat Commun, 2014. 5: p. 5330

Install Amber

Fill out the form from http://ambermd.org/GetAmber.php and click Download.

tar -jxvf AmberTools14.tar.bz2
rm AmberTools14.tar.bz2
cd amber14
echo "export AMBERHOME=$yourpath/amber14" >> ~/.bashrc
source ~/.bashrc
sudo apt-get install csh flex gfortran g++ xorg-dev zlib1g-dev libbz2-dev patch python-tk python-matplotlib
./configure gnu
source amber.sh
make install
make test  # which will run tests and will report successes or failures

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

REcognition of Cluster-transition Determining Sites (RECDS)

1. Raw data used in this project

H3N2

H1N1

2. Pipeline and result of the project

a) Features extracting:

PIMA score

MODELER structure

Relative solvent accessibility

Protein stability change

Half-sphere exposure (HSE)-up

Energy changes

Merge six different features

b) Train GBC model

c) Custom-made phylogenetic analysis.

3. Third-party software installation:

Install python dependencies

Install Modeller

Install dssp

Install SEF

Install Amber

About

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
code		code
dataset		dataset
tools		tools
.gitignore		.gitignore
readme.md		readme.md

wuaipinglab/RECDS

Folders and files

Latest commit

History

Repository files navigation

REcognition of Cluster-transition Determining Sites (RECDS)

1. Raw data used in this project

H3N2

H1N1

2. Pipeline and result of the project

a) Features extracting:

PIMA score

MODELER structure

Relative solvent accessibility

Protein stability change

Half-sphere exposure (HSE)-up

Energy changes

Merge six different features

b) Train GBC model

c) Custom-made phylogenetic analysis.

3. Third-party software installation:

Install python dependencies

Install Modeller

Install dssp

Install SEF

Install Amber

About

Topics

Resources

Stars

Watchers

Forks

Contributors 3

Languages