RECDS
evaluate the contribution of a specific amino acid site on the HA protein in the whole history of antigenic evolution.
In RECDS
, we ranked all of the HA sites by calculating the contribution scores derived from the forest of Gradient Boosting Classifiers trained by various sequence-based and structure-based features.
Copyright of RECDS
scripts in code dir, including how to extract features, train model and color trees, are reserved by Aiping Wu lab.
Important: The project uses python 2.7
interpreter and when you run the pipeline, you should modify python script files and set some paths based on your computing environment and third-party software installation.
The repo contains the following:
The antigenic clustering and raw sequence used in RECDS
can be found in dataset/
directory.
H3N2AntigenicCluster.strains
: The strain names of A/H3N2 used by RECDS
H3N2/source/
: Seven antigenic clusters derived from Smith’s studies including raw HA1 protein sequences.
H3N2/nonredundant/
: The sequences after removing redundant HA1 protein sequences and sequence alignment.
Koel, B.F., et al. Substitutions near the receptor binding site determine major antigenic change during influenza virus evolution. Science 2013;342(6161):976-979.
H1N1AntigenicCluster.strains
: The strain names of A/H1N1 used by RECDS
H1N1/source/
: Seven antigenic clusters derived from our previous studies including raw HA1 protein sequences.
H1N1/nonredundant/
: The sequences after removing redundant HA1 protein sequences and sequence alignment.
Liu, M., et al. Antigenic Patterns and Evolution of the Human Influenza A (H1N1) Virus. Sci Rep 2015;5:14171.
The pipeline uses python 2.7
interpreter. Source code of the RECDS
pipeline can be found in code/sourceCode/script/
. The major steps includes a) features extracting, b) GBC model training and c) phylogenetic tree coloring. The dependency scripts can be found in code/sourceCode/python/
and tools/
.
Including the PIMA score and other metrics calculated from predicted 3D structure.
- Modify
1_PIMAscore.py
file and set “seqin” path which contains virus sequences named bydData
dir/dData
is the virus pairs name and you can find its example format inInputH3N2Example
dir/features/dPIMA
is the output feature file
./1_PIMAscore.py dir/dData dir/features/dPIMA
- Modify
2_1_run_Modeler.py
file and set some paths, such as runModeller, fastaDir and outDir seqName
is the virus name and you can find its example format inInputH3N2Example
dir/Modeler
is the output file of outDir in2_1_run_Modeler.py
dir/newModeler
is the rename output file of Modeler structure
./2_1_run_Modeler.py dir/seqName # build modeler structure
./2_2_get_pdb.py dir/seqName dir/Modeler dir/newModeler # rename modeler model name
- Modify
3_1_runDssp.py
file and set some paths which contains outdir for dssp resuls and Modeler fordir/newModeler
; dir/seqMap
contains the virus name and you can find its example format inInputH3N2Example
- Modify
3_2_get_dssp.py
file and set dsspfile path to outdir for dssp resuls in3_1_runDssp.py
dir/features/dDssp
is the output feature file
./3_1_runDssp.py dir/seqMap
./3_2_get_dssp.py dir/dData dir/features/dDssp
- Modify
4_1_run_ddG.py
file and set “seqin” path which contains virus sequences named bydData
dir/dSEF
output file for ddG- Modify
4_2_get_ddG.py
file and set ddGDir path to outdir for SEF resuls in4_1_run_ddG.py
dir/dSEF
is the output feature file
./4_1_run_ddG.py dir/dData dir/dSEF
./4_2_get_ddG.py dir/dData dir/features/dSEF
- Modify
5_1_runHSE.py
file and set some paths which contains outdir for HSE resuls and ModelDir fordir/newModeler
- Modify
5_2_get_HSE.py
file and set hsefile path to outdir for HSE resuls in5_1_runHSE.py
dir/features/dHSE
is the output feature file
./5_1_runHSE.py dir/dData
./5_2_get_HSE.py dir/dData dir/features/dHSE
- Modify
6_1_run_Amber.py
file and set some paths 10000
is modeler model namedir/amber/10000
is output dirdir/newModeler
is modeler structure dir- Modify
6_2_get_Amber.py
file and set amberfile path to outdir for amber resuls in6_1_run_Amber.py
dir/features/dAmber
is the output feature file
# Van der Waals energy and the non-polar solvation energy
./6_1_run_Amber.py 10000 dir/amber/10000 dir/newModeler
./6_2_get_Amber.py dir/dData dir/features/dAmber
- Modify
7_merge_feature.py
file and set featureDir path which contains above six features dir/dFeatures
is output dir
./7_merge_feature.py dir/dFeatures
Use the features extracted in the previous steps to train Gradient Boost Classifier model.
9_cross_train.py
andsiteImportance_diffComb.py
is similar to8_siteImportance.py
and they are our experiment scriptsdir/dFeatures
is features file for antigenically different virus pairsdir/sFeatures
is features file for antigenically similar virus pairsdir/train/allImportance
is score output file
# calculate contribution score for every amino acid position
./8_siteImportance dir/dFeatures dir/sFeatures dir/train/allImportance
Build phylogenetic tree, plot and color the tree according to the amino acid type of the candidate positions.
- The phylogenetic trees of A/H3N2 and A/H1N1 were built under a GTR substitution model using
RAxML
(version 8.2.9). The GAMMA model of rate heterogeneity and other model parameters were estimated byRAxML
. - The dedicated phylogenetic visualization R package
ggtree
(v1.6.1) was used to make the plot. - The result can be found in
colorTree.zip
.
While the majority of programs in the package code/sourceCode/
are developed in the Aiping Wu lab here in the permission of use is released, there are some programs and databases which were developed by third-party groups. You may experience installation problems, please check the third-party websites.
Python modules pandas
, biopython
and scikit-learn
are required. Installation using pip
is recommended.
pip install pandas biopython scikit-learn
The homologous modelling for protein 3D structure is implemented by Modeller
. The download and installation can be found at https://salilab.org/modeller/. Modeller
is written in python and C can be import as a python module.
The calculation of relative solvent accessibility (RSA) for protein residuals needs dssp
. The download and installation can be found at https://swift.cmbi.umcn.nl/gv/dssp/. Or the pre-complied package can be installed on Linux machine:
sudo apt-get install dssp
We calculated ∆∆G denoting a protein stability change upon the single point mutation through the statistical energy function (SEF). This software is developed by Xiong Peng, and he modified application program interfaces (APIs) from original program for our use. So, the input is very simple, which just need protein structure and mutation information (wild type, position, mutant type). We have obtained his approval to open his modified software with our works and its software can be downloaded at https://drive.google.com/open?id=1Hls89AV6r5DNMXyDfC-qv4YhxkW_iaov. If you use SEF, please kindly cite as following:
Xiong, P., et al., Protein design with a comprehensive statistical energy function and boosted by experimental selection for foldability. Nat Commun, 2014. 5: p. 5330
Fill out the form from http://ambermd.org/GetAmber.php and click Download.
tar -jxvf AmberTools14.tar.bz2
rm AmberTools14.tar.bz2
cd amber14
echo "export AMBERHOME=$yourpath/amber14" >> ~/.bashrc
source ~/.bashrc
sudo apt-get install csh flex gfortran g++ xorg-dev zlib1g-dev libbz2-dev patch python-tk python-matplotlib
./configure gnu
source amber.sh
make install
make test # which will run tests and will report successes or failures