Scripts belonging to Team 1 Functional Annotation Group from GATECH BIOL 7210 Computational Genomics course.


Functional Annotation

Functional annotation is the process of attaching biological information to sequences of genes or proteins. The basic level of annotation is using sequence alignment tool BLAST for finding similarities, and then annotating genes or proteins based on that. It refers to the ontologies of a gene product by using its molecular function, biological role, subcellular location, etc.

The two main approaches used to functionally annotate genes are:

  1. Ab-initio - intrinsic, uses gene/protein characteristics. Ab-initio uses common features / functions shared by a protein. These tools use HMMs, Support Vector Machines (SVMs), and ML/neural networks to carry out the annotation.

Homology based - based on comparison with known sequence characteristics. These tools are built on the assumption that homologous proteins have similar functions and shared ancestry. Techniques used are BLAST. Alignment sorting, etc.

Final Pipeline


Clustering is a data mining approach that groups or clusters data components (usually represented as numeric vectors) based on how similar or different they are. Clustering genes by functional keyword association can provide direct information about the nature of genes and their functional association. Tool used for clustering - USearch

A unique sequence analysis tool which claims to be faster than BLAST search algorithms. It uses quick, heuristic methods, and rapidly looks for good hits. Uclust is used to cluster sequences because :

  • It saves computational time to clustering sequences.

  • The centroid sequence of the cluster is BLASTed against the database on the hyperparameter of cluster radius . Installation and Command used :

    wget [](

    tar -xvzpf

    <path of the USearch download> -cluster_fast <input_fasta_file> -id <percent_identity_threshold> -centroid <centroid_fasta> -uc <centroid_index>

Tools based on features to be annotated

Homology based tools

  1. eggNOG -mapper-v2 :
  • Assumes orthologous groups (OGs) more useful for annotation

  • Aligns predicted protein seqs. (homology) using multiple databases

  • Leverages “pre-computed phylogenies” for each OG to infer relationships

  • Sources multiple ontology sites to annotate Installation and Commands :


conda install -c bioconda eggnog-mapper
export PATH=/home/user/eggnog-mapper:/home/user/eggnog-mapper/eggnogmapper/bin:"$PATH"

Command: -i <inputAminoAcid.faa> -o outputFileName –decorate_gff yes

  1. CARD-RGI - Comprehensive Antibiotic Resistance Database–Resistance Gene Identifier
  • “Resistome” prediction

  • Based on homology and SNP models

  • Annotates via Antiobiotic Resistance Ontologies (AROs0 Installation: git clone [](

    conda env create -f rgi/conda_env.yml conda activate rgi


Download CARD database

$ wget []code here

Extract CARD database

$ tar -xvf data ./card.json

Load CARD database for rgi

$ rgi load --card_json ./card.json

Run rgi on CARD database

$ rgi main -i <assembly_fasta> -o <card_output> -t <contig/protein>

Virulence Factor Database (VFDB)

  • Annotates virulence factors, which are genes that encode proteins that enable/enhance a pathogen’s ability to colonise, proliferate, and cause damage in their host(s).

  • The VFDB is a comprehensive online resource that contains curated information about pathogenic bacteria’s virulence factors.

  • They have their core dataset (as well as a full dataset), which contains the representative VF genes experimentally verified available for download at

  • Using the core dataset file, one can construct a VFDB database using ncbi-blast’s makeblastdb tool.

  • We used ncbi-blast’s blastn tool to blast the contigs_uniq_consensus.fna files generated by the Gene Prediction group. The outfmt is 6 so that it could be converted into a GFF3 file using genomeGFTools’ script (available at

Installation and Commands:

  1. Install the latest version of ncbi-blast (available at, unzip the file, and then add the bin folder to your path.

  2. Run the following command using makeblastdb to generate a VFDB database from the core dataset:

makeblastdb VFDB_setA_nt.fas.gz -dbtype nucl -out VFDBr

  1. Run blastn using the generated VFDB dataset:
  2. The outfile.txt is formatted in such a way that the script will be able to generate a GFF3 formatted file. It does not include the ‘##gff-version 3’ header, however, so it was included afterwards.

blastn -db VFDB_database -query sample_file_path -num_threads 2 -

evalue 1e-10 -outfmt 6 -max_target_seqs 1 -out outfile.txt`
 ./ outfile.txt -b outfile.gff

Ab-initio based functional Annotation: TMHMM

  • Devised specially for the annotation of transmembrane helices (about 90 percent efficiency).

  • Striking advantage of this method: possible to model helix length

  • Algorithm - by finding most predictable topology using a hidden markov model

  • Pros:Performs better than TOPPRED and ALOM when topology prediction and discrimination between membrane proteins vs soluble proteins are considered. False predictions are almost NULL.

  • Cons:The main type of error made by TMHMM : predict signal peptides as transmembrane peptides in 69% of the G+ve bacteria.

Installation and commands

Installation: tar -xvf <source file>

The executable(tmhmm and decodeanhmm) is present in the unpacked bin directory.


./tmhmm -<output format> <input file> <output file>


SignalP versions 1 through 4 can only predict Sec-translocated SPs cleaved by SPase I.

  • Version5.0:

  • Based on deep neural networks

  • SignalP 5.0 distinguishes three types of signal peptides in prokaryotes: Sec substrates cleaved by SPase I (Sec/SPI), Sec substrates cleaved by SPase II (Sec/SPII), and Tat substrates cleaved by SPase I (Tat/SPI).

  • SignalP 5.0 cannot identify Tat substrates cleaved by SPase II.

Installation and Command:

Installation : tar -xvf <source file>

The executable is present in the untarred folder.

Command :

signalp -fasta <input .faa file >-org gram- -format short -prefix <outputdirectory> -gff3

Annotation of non-coding regions: Piler-CR:

  • Piler-CR is a software designed to detect and analyse Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) repeats. It can detect CRISPR arrays in a sequence file in seconds with both high sensitivity and high specificity.

  • Piler-CR will give an output of the CRISPR arrays that it detects as well as the number of copies of sequences found in each array.

Installation and Commands:

  1. Piler-CR is available to download at Download the source code, unzip the file, and then add the bin folder to your path.

  2. We stored Piler-CR’s output into a text file in which we removed the first 75 lines to allow the use of Chris Brown’s script to generate a GFF3 file. His script is available at It doesn’t add a ‘##gff-version 3’ header, however, so that was included afterwards.

Piler-CR Commands:

-   pilercr -in filepath -out outfile.txt
-   sed -i 1,75d outfile.txt
-   perl ./ -in outfile.txt -out outfile.gff

Results and graphs

  1. Clustering

  1. Eggnog


  1. VFDB

For each sampler, there was an average of ~73 forward hits and ~78 reverse hits (150 total).

  1. SignalP

The average number of predictions made by SignalP is around 330 across all 50 isolates.

Sample GFF output file:

  1. TMHMM

The average annotations of membrane proteins for the 50 isolates is around 3250.

  1. Piler-CR

There were 2 arrays found in each sample, ranging from 10-13 copies in Array 1 and 8-13 copies in Array 2.


