GitHub - omics-lab/VirusTaxo: VirusTaxo: Taxonomic classification of viruses from sequence

VirusTaxo: Taxonomic classification of viruses from metagenomic contigs

1. Running VirusTaxo

Requirements

python >= 3.8
Linux

Installation

Clone the repository

git clone https://github.com/omics-lab/VirusTaxo

Create Python Virtual Environment

cd VirusTaxo
python3 -m venv environment
source ./environment/bin/activate

Install Python Packages

pip install -r requirements.txt

2. Predict virus taxonomy using prebuilt database

Step-1: Download the latest prebuilt databse of VirusTaxo

gdown "https://drive.google.com/uc?id=1UWwtBZSmVNeuqGE9_u6RVTt1y2U0HLM3"

# Extract db files
tar -xvzf database.v3_2024.tar.gz

Database files:

Database file	Count	Description
Family_database.pkl	242 Family	k-mer database for Family level prediction
Genus_database.pkl	1933 Genus	k-mer database for Genus level prediction
Species_database.pkl	8528 Species	k-mer database for Species level prediction
sequences.fasta	12613 genome	Complete genome sequences used to build database
metadata.csv	12613 accession	Metadata associated with the dataset used to build database

Step-2: Assemble the metagenomic contigs from your metavirome or metagenomic library

Perform de novo assembly using MEGAHIT:

# paired-end
megahit -1 file_R1.fq -2 file_R2.fq --min-contig-len 500 -o contig.fasta

Step-3: Predict taxonomy using `predict.py`

Usage

python3 predict.py -h

usage: predict.py [-h] --database_path DATABASE_PATH --seq SEQ [--output_csv OUTPUT_CSV]
               [--entropy ENTROPY] [--enrichment ENRICHMENT]
               [--enrichment_spp ENRICHMENT_SPP]

options:
  -h, --help            show this help message and exit
  --database_path DATABASE_PATH
                        Absolute or relative path containing three database files
                        (Family_database.pkl, Genus_database.pkl and Species_database.pkl)
  --seq SEQ             Absolute or relative path of the input fasta sequence file
  --output_csv OUTPUT_CSV
                        Path to save the output CSV file (default:
                        VirusTaxo_predictions.csv)
  --entropy ENTROPY     Entropy threshold; entropy range is [0-1] (default: 0.5)
  --enrichment ENRICHMENT
                        Enrichment score threshold for Genus and Family; enrichment range
                        is [0-1] (default: 0.05)
  --enrichment_spp ENRICHMENT_SPP
                        Enrichment score threshold for Species; enrichment range is [0-1]
                        (default: 0.8)

Run with an example fasta file

python3 predict.py \
   --database_path /PathToDatabase/ \ # database file
   --seq test.fasta # query fasta file

3. Interpretation of output

Example output

Accession    Query_Seq_Length  Family           Family_Entropy  Family_Enrichment  Genus           Genus_Entropy  Genus_Enrichment  Species                   Species_Entropy  Species_Enrichment  Valid
AC_000001.1  33034             Adenoviridae     0.0             0.786              Mastadenovirus  0.0            0.782             Ovine mastadenovirus A    0.0              0.745               Yes
AC_000002.1  34446             Adenoviridae     0.0             0.845              Mastadenovirus  0.0            0.842             Bovine mastadenovirus B   0.0              0.811               Yes
AC_000011.1  36519             Adenoviridae     -0.0            0.57               Mastadenovirus  -0.0           0.563             Human mastadenovirus E    -0.0             0.353               Yes
AC_000189.1  34094             Adenoviridae     -0.0            0.81               Mastadenovirus  -0.0           0.803             Porcine mastadenovirus A  -0.0             0.742               Yes
NC_000852.5  330611            Phycodnaviridae  -0.0            0.121              Chlorovirus     -0.0           0.115             Unclassified              -0.0             0.014               Yes
NC_000855.1  11158             Unclassified     0.0             0.011              Unclassified    0.0            0.006             Unclassified              0.09             0.002               Yes
NC_000867.1  10079             Unclassified     -0.0            0.018              Unclassified    -0.0           0.01              Unclassified              0.089            0.002               Yes
NC_000899.1  45063             Adenoviridae     0.0             0.85               Aviadenovirus   0.0            0.844             Fowl aviadenovirus D      0.0              0.772               Yes
NC_000939.2  4415              Tombusviridae    -0.0            0.061              Aureusvirus     -0.0           0.055             Aureusvirus dioscoreae    -0.0             0.053               Yes

In the output file
- Unclassified: Entropy (default >= 0.5) or Enrichment (default <= 0.05 for Family and Genus; default <= 0.80 for Species prediction) is outside of cutoff.
- Lower Entropy (such as ≤=0.5) provides the higher level of prediction certainty. You can decrease Entropy cutoff for better prediction.
- Higher Enrichment_Score (such as >= 0.8) provides the higher level of prediction certainty. You can increase Enrichment_Score cutoff for better prediction. Enrichment_Score is the total number of k-mers mapped to the genera divided by total number of k-mers in the query sequence.
- The Valid column indicates Yes if the prediction aligns with known taxonomic ranks; otherwise, it shows No. Rarely prediction could result into exceptions to known taxonomic ranks. It is generally recommended to exclude rows marked as No unless you have verified the taxonomic assignment and are confident in its accuracy.

4. Prediction accurary of VirusTaxo

To check accuracy, 12,613 complete virus genomes were used. In 5-fold cross-validation, 80% of the sequences were randomly chosen to create the database, and the other 20% were used to calculate the accuracy shown in the table below:

Taxonomic Rank	Accuracy	Unclassified	Enrichment cutoff	Entropy cutoff	k-mer
Family	99.99%	53%	>=0.05	<=0.50	16
Genus	97.26%	57%	>=0.05	<=0.50	16
Species	85.32%	98.5%	>=0.80	<=0.50	16

5. Build custom database

Preparing a metadata file in csv format. The metadata file must contain columns named Accession, Family, Genus and Species. Example of metadata file is here:
The sequnce Accession must match with the metadata Accession. Example of input fasta file is here
The metadata.csv much with within the database directory during prediction.
Building database:

python3 build.py \
   --meta ./Dataset/metadata.csv \ # provide your metadata file
   --seq ./Dataset/seq1k.fasta \ # provide your fasta file
   --k 16 \
   --saving_path /path/

Parameters
- meta: Absolute or relative path of metadata file.
- seq: Absolute or relative path of fasta sequence file.
- k : The length of k-mer.
- saving_path: Path to save database pickle files for Family, Genus and Species.

6. Method limitation and interpretation

VirusTaxo's database is build on known virus genomes and designed to predict taxonomy of virus sequences.
Non-viral sequences may be misclassified as viral due to random k-mer matches in VirusTaxo predictions. To minimize the likelihood of such misclassifications, it is recommended to apply a higher Enrichment cutoff. This helps ensure that only sequences with stronger evidence of being viral are retained. Additionally,
- Filter out sequences by mapping to host reference genomes before using VirusTaxo. This helps remove host-derived sequences, improving the accuracy of viral predictions and reducing potential false positives.
- If your sample contains non-viral sequences, it is recommended to filter them out by using tools like blast or DeepVirFinder dvf.py -i contig.fasta -o ./.

7. Version history

VirusTaxo	Database	Data Date	Sequences	Download
v2 Family, Genus, Species prediction	database.v3_2024	Jan21_2024	DNA=9384 & RNA=9067	here
v1 Genus prediction	database.v2_2024	Jan21_2024	DNA=9384 & RNA=9067	here
v1 Genus prediction	database.v1_2022	Apr27_2022	DNA=4421 & RNA=2529	here
Used in manuscript	database.v1_2022	Apr27_2022	DNA=4421 & RNA=2529	here

8. Contact

Rashedul Islam, PhD ([email protected])

9. Citation

Rajan Saha Raju, Abdullah Al Nahid, Preonath Chondrow Dev, Rashedul Islam. VirusTaxo: Taxonomic classification of viruses from the genome sequence using k-mer enrichment . Genomics, Volume 114, Issue 4, July 2022.
Rashedul Islam, Rajan Saha Raju, Nazia Tasnim, Istiak Hossain Shihab, Maruf Ahmed Bhuiyan, Yusha Araf, Tofazzal Islam. Choice of assemblers has a critical impact on de novo assembly of SARS-CoV-2 genome and characterizing variants. Briefings in Bioinformatics, Volume 22, Issue 5, bbab102, September 2021.

10. License

This project is licensed under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
Dataset		Dataset
Script		Script
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.py		build.py
predict.py		predict.py
requirements.txt		requirements.txt
util.py		util.py
web.py		web.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VirusTaxo: Taxonomic classification of viruses from metagenomic contigs

1. Running VirusTaxo

Requirements

Installation

2. Predict virus taxonomy using prebuilt database

Step-1: Download the latest prebuilt databse of VirusTaxo

Step-2: Assemble the metagenomic contigs from your metavirome or metagenomic library

Step-3: Predict taxonomy using `predict.py`

3. Interpretation of output

4. Prediction accurary of VirusTaxo

5. Build custom database

6. Method limitation and interpretation

7. Version history

8. Contact

9. Citation

10. License

About

Releases

Packages

Contributors 2

Languages

License

omics-lab/VirusTaxo

Folders and files

Latest commit

History

Repository files navigation

VirusTaxo: Taxonomic classification of viruses from metagenomic contigs

1. Running VirusTaxo

Requirements

Installation

2. Predict virus taxonomy using prebuilt database

Step-1: Download the latest prebuilt databse of VirusTaxo

Step-2: Assemble the metagenomic contigs from your metavirome or metagenomic library

Step-3: Predict taxonomy using predict.py

3. Interpretation of output

4. Prediction accurary of VirusTaxo

5. Build custom database

6. Method limitation and interpretation

7. Version history

8. Contact

9. Citation

10. License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Step-3: Predict taxonomy using `predict.py`

Packages