Skip to content

Scripts belonging to Team 1 Functional Annotation Group from GATECH BIOL 7210 Computational Genomics course.

License

Notifications You must be signed in to change notification settings

Hanuphant/Functional_Annotation_Pipeline

Repository files navigation

Functional Annotation

Functional annotation is the process of attaching biological information to sequences of genes or proteins. The basic level of annotation is using sequence alignment tool BLAST for finding similarities, and then annotating genes or proteins based on that. It refers to the ontologies of a gene product by using its molecular function, biological role, subcellular location, etc.

The two main approaches used to functionally annotate genes are:

  1. Ab-initio - intrinsic, uses gene/protein characteristics. Ab-initio uses common features / functions shared by a protein. These tools use HMMs, Support Vector Machines (SVMs), and ML/neural networks to carry out the annotation.

Homology based - based on comparison with known sequence characteristics. These tools are built on the assumption that homologous proteins have similar functions and shared ancestry. Techniques used are BLAST. Alignment sorting, etc.

Final Pipeline

Clustering

Clustering is a data mining approach that groups or clusters data components (usually represented as numeric vectors) based on how similar or different they are. Clustering genes by functional keyword association can provide direct information about the nature of genes and their functional association. Tool used for clustering - USearch

A unique sequence analysis tool which claims to be faster than BLAST search algorithms. It uses quick, heuristic methods, and rapidly looks for good hits. Uclust is used to cluster sequences because :

  • It saves computational time to clustering sequences.

  • The centroid sequence of the cluster is BLASTed against the database on the hyperparameter of cluster radius . Installation and Command used :

    wget [https://drive5.com/downloads/usearch11.0.667_i86linux32.gz](https://drive5.com/downloads/usearch11.0.667_i86linux32.gz)

    tar -xvzpf

    <path of the USearch download> -cluster_fast <input_fasta_file> -id <percent_identity_threshold> -centroid <centroid_fasta> -uc <centroid_index>

Tools based on features to be annotated

All your files and folders are presented as a tree in the file explorer. You can switch from one to another by clicking a file in the tree.

Homology based tools

  1. eggNOG -mapper-v2 :
  • Assumes orthologous groups (OGs) more useful for annotation

  • Aligns predicted protein seqs. (homology) using multiple databases

  • Leverages “pre-computed phylogenies” for each OG to infer relationships

  • Sources multiple ontology sites to annotate Installation and Commands :

Installation:

conda install -c bioconda eggnog-mapper
export PATH=/home/user/eggnog-mapper:/home/user/eggnog-mapper/eggnogmapper/bin:"$PATH"

download_eggnog_data.py

Command:

emapper.py -i <inputAminoAcid.faa> -o outputFileName –decorate_gff yes

  1. CARD-RGI - Comprehensive Antibiotic Resistance Database–Resistance Gene Identifier
  • “Resistome” prediction

  • Based on homology and SNP models

  • Annotates via Antiobiotic Resistance Ontologies (AROs0 Installation: git clone [https://github.com/arpcard/rgi](https://github.com/arpcard/rgi)

    conda env create -f rgi/conda_env.yml conda activate rgi

Commands:

Download CARD database

$ wget [https://card.mcmaster.ca/latest/data]code here

Extract CARD database

$ tar -xvf data ./card.json

Load CARD database for rgi

$ rgi load --card_json ./card.json

Run rgi on CARD database

$ rgi main -i <assembly_fasta> -o <card_output> -t <contig/protein>

Virulence Factor Database (VFDB)

  • Annotates virulence factors, which are genes that encode proteins that enable/enhance a pathogen’s ability to colonise, proliferate, and cause damage in their host(s).

  • The VFDB is a comprehensive online resource that contains curated information about pathogenic bacteria’s virulence factors.

  • They have their core dataset (as well as a full dataset), which contains the representative VF genes experimentally verified available for download at http://www.mgc.ac.cn/VFs/download.htm.

  • Using the core dataset file, one can construct a VFDB database using ncbi-blast’s makeblastdb tool.

  • We used ncbi-blast’s blastn tool to blast the contigs_uniq_consensus.fna files generated by the Gene Prediction group. The outfmt is 6 so that it could be converted into a GFF3 file using genomeGFTools’ blast2gff.py script (available at https://github.com/wrf/genomeGTFtools).

Installation and Commands:

  1. Install the latest version of ncbi-blast (available at https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download), unzip the file, and then add the bin folder to your path.

  2. Run the following command using makeblastdb to generate a VFDB database from the core dataset:

makeblastdb VFDB_setA_nt.fas.gz -dbtype nucl -out VFDBr

  1. Run blastn using the generated VFDB dataset:
  2. The outfile.txt is formatted in such a way that the blast2gff.py script will be able to generate a GFF3 formatted file. It does not include the ‘##gff-version 3’ header, however, so it was included afterwards.

blastn -db VFDB_database -query sample_file_path -num_threads 2 -

evalue 1e-10 -outfmt 6 -max_target_seqs 1 -out outfile.txt`
 ./blast2gff.py outfile.txt -b outfile.gff

Ab-initio based functional Annotation: TMHMM

  • Devised specially for the annotation of transmembrane helices (about 90 percent efficiency).

  • Striking advantage of this method: possible to model helix length

  • Algorithm - by finding most predictable topology using a hidden markov model

  • Pros:Performs better than TOPPRED and ALOM when topology prediction and discrimination between membrane proteins vs soluble proteins are considered. False predictions are almost NULL.

  • Cons:The main type of error made by TMHMM : predict signal peptides as transmembrane peptides in 69% of the G+ve bacteria.

Installation and commands

Installation: tar -xvf <source file>

The executable(tmhmm and decodeanhmm) is present in the unpacked bin directory.

Command:

./tmhmm -<output format> <input file> <output file>

SignalP

SignalP versions 1 through 4 can only predict Sec-translocated SPs cleaved by SPase I.

  • Version5.0:

  • Based on deep neural networks

  • SignalP 5.0 distinguishes three types of signal peptides in prokaryotes: Sec substrates cleaved by SPase I (Sec/SPI), Sec substrates cleaved by SPase II (Sec/SPII), and Tat substrates cleaved by SPase I (Tat/SPI).

  • SignalP 5.0 cannot identify Tat substrates cleaved by SPase II.

Installation and Command:

Installation : tar -xvf <source file>

The executable is present in the untarred folder.

Command :

signalp -fasta <input .faa file >-org gram- -format short -prefix <outputdirectory> -gff3

Annotation of non-coding regions: Piler-CR:

  • Piler-CR is a software designed to detect and analyse Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) repeats. It can detect CRISPR arrays in a sequence file in seconds with both high sensitivity and high specificity.

  • Piler-CR will give an output of the CRISPR arrays that it detects as well as the number of copies of sequences found in each array.

Installation and Commands:

  1. Piler-CR is available to download at https://www.drive5.com/pilercr/. Download the source code, unzip the file, and then add the bin folder to your path.

  2. We stored Piler-CR’s output into a text file in which we removed the first 75 lines to allow the use of Chris Brown’s CRISPRFileToGFF.pl script to generate a GFF3 file. His script is available at http://bioanalysis.otago.ac.nz/CRISPRgff3/. It doesn’t add a ‘##gff-version 3’ header, however, so that was included afterwards.

Piler-CR Commands:

-   pilercr -in filepath -out outfile.txt
    
-   sed -i 1,75d outfile.txt
    
-   perl ./CRISPRFileToGFF.pl -in outfile.txt -out outfile.gff

Results and graphs

  1. Clustering

  1. Eggnog

  1. CARD-RGI

  1. VFDB

For each sampler, there was an average of ~73 forward hits and ~78 reverse hits (150 total).

  1. SignalP

The average number of predictions made by SignalP is around 330 across all 50 isolates.

Sample GFF output file:

  1. TMHMM

The average annotations of membrane proteins for the 50 isolates is around 3250.

  1. Piler-CR

There were 2 arrays found in each sample, ranging from 10-13 copies in Array 1 and 8-13 copies in Array 2.

References:

  • Kalvari I, Nawrocki EP, Ontiveros-Palacios N, et al. Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Res. 2021;49(D1):D192-D200. doi:10.1093/nar/gkaa1047

  • Nawrocki EP. Annotating functional RNAs in genomes using Infernal. Methods Mol Biol. 2014;1097:163-197. doi:10.1007/978-1-62703-709-9_9

  • Stav, S., Atilho, R.M., Mirihana Arachchilage, G. et al. Genome-wide discovery of structured noncoding RNAs in bacteria. BMC Microbiol 19, 66 (2019). https://doi.org/10.1186/s12866-019-1433-7

  • Laslett D, Canback B. ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res. 2004;32(1):11-16. Published 2004 Jan 2. doi:10.1093/nar/gkh152.

  • Bland C, Ramsey TL, Sabree F, et al. CRISPR recognition tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats. BMC Bioinformatics. 2007;8:209. Published 2007 Jun 18. doi:10.1186/1471-2105-8-209.

  • Lagesen K, Hallin P, Rødland EA, Staerfeldt HH, Rognes T, Ussery DW. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 2007;35(9):3100-3108. doi:10.1093/nar/gkm160.

  • Carnielli, C. M., Winck, F. V., & Paes Leme, A. F. (2015). Functional annotation and biological interpretation of proteomics data. Biochimica et Biophysica Acta (BBA) - Proteins and Proteomics, 1854(1), 46-54. https://doi.org/https://doi.org/10.1016/j.bbapap.2014.10.019

  • Cantalapiedra, C. P., Hernández-Plaza, A., Letunic, I., Bork, P., & Huerta-Cepas, J. (2021). eggNOG-mapper v2: Functional Annotation, Orthology Assignments, and Domain Prediction at the Metagenomic Scale. Molecular Biology and Evolution, 38(12), 5825-5829. https://doi.org/10.1093/molbev/msab293

  • Lal Gupta, C., Kumar Tiwari, R., & Cytryn, E. (2020). Platforms for elucidating antibiotic resistance in single genomes and complex metagenomes. Environment International, 138, 105667. https://doi.org/https://doi.org/10.1016/j.envint.2020.105667

  • Alcock, B. P., Raphenya, A. R., Lau, T. T. Y., Tsang, K. K., Bouchard, M., Edalatmand, A., Huynh, W., Nguyen, A.-L. V., Cheng, A. A., Liu, S., Min, S. Y., Miroshnichenko, A., Tran, H.-K., Werfalli, R. E., Nasir, J. A., Oloni, M., Speicher, D. J., Florescu, A., Singh, B., . . . McArthur, A. G. (2019). CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic Acids Research, 48(D1), D517-D525. https://doi.org/10.1093/nar/gkz935

  • Liu, B., Zheng, D., Zhou, S., Chen, L., & Yang, J. (2021). VFDB 2022: a general classification scheme for bacterial virulence factors. Nucleic Acids Research, 50(D1), D912-D917. https://doi.org/10.1093/nar/gkab1107

  • Almagro Armenteros, J.J., Tsirigos, K.D., Sønderby, C.K. et al. SignalP 5.0 improves signal peptide predictions using deep neural networks. Nat Biotechnol 37, 420–423 (2019)

  • Teufel, F., Almagro Armenteros, J.J., Johansen, A.R. et al. SignalP 6.0 predicts all five types of signal peptides using protein language models. Nat Biotechnol (2022).

  • Krogh A, Larsson B, von Heijne G, Sonnhammer EL. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001 Jan 19;305(3):567-80

  • Lukas Käll, Anders Krogh, Erik L.L Sonnhammer, A Combined Transmembrane Topology and Signal Peptide Prediction Method, Journal of Molecular Biology, Volume 338, Issue 5, 2004, Pages 1027-1036, ISSN 0022-2836, https://doi.org/10.1016/j.jmb.2004.03.016.

  • Timothy Nugent, David T Jones. Transmembrane protein topology prediction using support vector machines, BMC Bioinformatics. 2009; 10: 159. doi: 10.1186/1471-2105-10-159

  • Edgar, R.C. PILER-CR: Fast and accurate identification of CRISPR repeats. BMC Bioinformatics 8, 18 (2007). https://doi.org/10.1186/1471-2105-8-18.

  • Bo Liu, Dandan Zheng, Qi Jin, Lihong Chen, Jian Yang, VFDB 2019: a comparative pathogenomic platform with an interactive web interface, Nucleic Acids Research, Volume 47, Issue D1, 08 January 2019, Pages D687–D692, https://doi.org/10.1093/nar/gky1080

  • National Center for Biotechnology Information (NCBI)[Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; [1988] – [cited 2022 Mar 31]. Available from: https://www.ncbi.nlm.nih.gov/.

  • Mills DB, Francis WR, Vargas S, et al. The last common ancestor of animals lacked the HIF pathway and respired in low-oxygen environments. Elife. 2018;7:e31176. Published 2018 Feb 6. doi:10.7554/eLife.31176.

  • Edgar RC. PILER-CR: fast and accurate identification of CRISPR repeats. BMC Bioinformatics. 2007;8:18. Published 2007 Jan 20. doi:10.1186/1471-2105-8-18.

  • Brown, Chris (University of Otago). March 30, 2022. “CRISPRFileToGFF.pl.” Perl. http://bioanalysis.otago.ac.nz/CRISPRgff3/.

About

Scripts belonging to Team 1 Functional Annotation Group from GATECH BIOL 7210 Computational Genomics course.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published