- python >= 3.8
- Linux
- Clone the repository
git clone https://github.com/omics-lab/VirusTaxo
- Create Python Virtual Environment
cd VirusTaxo
python3 -m venv environment
source ./environment/bin/activate
- Install Python Packages
pip install -r requirements.txt
Step-1: Download the latest prebuilt databse of VirusTaxo
gdown "https://drive.google.com/uc?id=1UWwtBZSmVNeuqGE9_u6RVTt1y2U0HLM3"
# Extract db files
tar -xvzf database.v3_2024.tar.gz
- Database files:
Database file | Count | Description |
---|---|---|
Family_database.pkl | 242 Family | k-mer database for Family level prediction |
Genus_database.pkl | 1933 Genus | k-mer database for Genus level prediction |
Species_database.pkl | 8528 Species | k-mer database for Species level prediction |
sequences.fasta | 12613 genome | Complete genome sequences used to build database |
metadata.csv | 12613 accession | Metadata associated with the dataset used to build database |
- Perform de novo assembly using MEGAHIT:
# paired-end
megahit -1 file_R1.fq -2 file_R2.fq --min-contig-len 500 -o contig.fasta
- Usage
python3 predict.py -h
usage: predict.py [-h] --database_path DATABASE_PATH --seq SEQ [--output_csv OUTPUT_CSV]
[--entropy ENTROPY] [--enrichment ENRICHMENT]
[--enrichment_spp ENRICHMENT_SPP]
options:
-h, --help show this help message and exit
--database_path DATABASE_PATH
Absolute or relative path containing three database files
(Family_database.pkl, Genus_database.pkl and Species_database.pkl)
--seq SEQ Absolute or relative path of the input fasta sequence file
--output_csv OUTPUT_CSV
Path to save the output CSV file (default:
VirusTaxo_predictions.csv)
--entropy ENTROPY Entropy threshold; entropy range is [0-1] (default: 0.5)
--enrichment ENRICHMENT
Enrichment score threshold for Genus and Family; enrichment range
is [0-1] (default: 0.05)
--enrichment_spp ENRICHMENT_SPP
Enrichment score threshold for Species; enrichment range is [0-1]
(default: 0.8)
- Run with an example fasta file
python3 predict.py \
--database_path /PathToDatabase/ \ # database file
--seq test.fasta # query fasta file
- Example output
Accession Query_Seq_Length Family Family_Entropy Family_Enrichment Genus Genus_Entropy Genus_Enrichment Species Species_Entropy Species_Enrichment Valid
AC_000001.1 33034 Adenoviridae 0.0 0.786 Mastadenovirus 0.0 0.782 Ovine mastadenovirus A 0.0 0.745 Yes
AC_000002.1 34446 Adenoviridae 0.0 0.845 Mastadenovirus 0.0 0.842 Bovine mastadenovirus B 0.0 0.811 Yes
AC_000011.1 36519 Adenoviridae -0.0 0.57 Mastadenovirus -0.0 0.563 Human mastadenovirus E -0.0 0.353 Yes
AC_000189.1 34094 Adenoviridae -0.0 0.81 Mastadenovirus -0.0 0.803 Porcine mastadenovirus A -0.0 0.742 Yes
NC_000852.5 330611 Phycodnaviridae -0.0 0.121 Chlorovirus -0.0 0.115 Unclassified -0.0 0.014 Yes
NC_000855.1 11158 Unclassified 0.0 0.011 Unclassified 0.0 0.006 Unclassified 0.09 0.002 Yes
NC_000867.1 10079 Unclassified -0.0 0.018 Unclassified -0.0 0.01 Unclassified 0.089 0.002 Yes
NC_000899.1 45063 Adenoviridae 0.0 0.85 Aviadenovirus 0.0 0.844 Fowl aviadenovirus D 0.0 0.772 Yes
NC_000939.2 4415 Tombusviridae -0.0 0.061 Aureusvirus -0.0 0.055 Aureusvirus dioscoreae -0.0 0.053 Yes
-
In the output file
-
Unclassified
:Entropy
(default >= 0.5) orEnrichment
(default <= 0.05 for Family and Genus; default <= 0.80 for Species prediction) is outside of cutoff. -
Lower
Entropy
(such as ≤=0.5) provides the higher level of prediction certainty. You can decreaseEntropy
cutoff for better prediction. -
Higher
Enrichment_Score
(such as >= 0.8) provides the higher level of prediction certainty. You can increaseEnrichment_Score
cutoff for better prediction.Enrichment_Score
is the total number of k-mers mapped to the genera divided by total number of k-mers in the query sequence. -
The
Valid
column indicatesYes
if the prediction aligns with known taxonomic ranks; otherwise, it showsNo
. Rarely prediction could result into exceptions to known taxonomic ranks. It is generally recommended to exclude rows marked asNo
unless you have verified the taxonomic assignment and are confident in its accuracy.
-
To check accuracy, 12,613 complete virus genomes were used. In 5-fold cross-validation, 80% of the sequences were randomly chosen to create the database, and the other 20% were used to calculate the accuracy shown in the table below:
Taxonomic Rank | Accuracy | Unclassified | Enrichment cutoff | Entropy cutoff | k-mer |
---|---|---|---|---|---|
Family | 99.99% | 53% | >=0.05 | <=0.50 | 16 |
Genus | 97.26% | 57% | >=0.05 | <=0.50 | 16 |
Species | 85.32% | 98.5% | >=0.80 | <=0.50 | 16 |
-
Preparing a metadata file in
csv
format. The metadata file must contain columns namedAccession
,Family
,Genus
andSpecies
. Example of metadata file is here: -
The sequnce
Accession
must match with the metadataAccession
. Example of input fasta file is here -
The
metadata.csv
much with within the database directory during prediction. -
Building database:
python3 build.py \
--meta ./Dataset/metadata.csv \ # provide your metadata file
--seq ./Dataset/seq1k.fasta \ # provide your fasta file
--k 16 \
--saving_path /path/
-
Parameters
meta
: Absolute or relative path of metadata file.seq
: Absolute or relative path of fasta sequence file.k
: The length of k-mer.saving_path
: Path to save database pickle files for Family, Genus and Species.
-
VirusTaxo's database is build on known virus genomes and designed to predict taxonomy of virus sequences.
-
Non-viral sequences may be misclassified as viral due to random k-mer matches in VirusTaxo predictions. To minimize the likelihood of such misclassifications, it is recommended to apply a higher
Enrichment
cutoff. This helps ensure that only sequences with stronger evidence of being viral are retained. Additionally,-
Filter out sequences by mapping to host reference genomes before using VirusTaxo. This helps remove host-derived sequences, improving the accuracy of viral predictions and reducing potential false positives.
-
If your sample contains non-viral sequences, it is recommended to filter them out by using tools like blast or DeepVirFinder
dvf.py -i contig.fasta -o ./
.
-
VirusTaxo | Database | Data Date | Sequences | Download |
---|---|---|---|---|
v2 Family, Genus, Species prediction | database.v3_2024 | Jan21_2024 | DNA=9384 & RNA=9067 | here |
v1 Genus prediction | database.v2_2024 | Jan21_2024 | DNA=9384 & RNA=9067 | here |
v1 Genus prediction | database.v1_2022 | Apr27_2022 | DNA=4421 & RNA=2529 | here |
Used in manuscript | database.v1_2022 | Apr27_2022 | DNA=4421 & RNA=2529 | here |
Rashedul Islam, PhD ([email protected])
-
Rajan Saha Raju, Abdullah Al Nahid, Preonath Chondrow Dev, Rashedul Islam. VirusTaxo: Taxonomic classification of viruses from the genome sequence using k-mer enrichment . Genomics, Volume 114, Issue 4, July 2022.
-
Rashedul Islam, Rajan Saha Raju, Nazia Tasnim, Istiak Hossain Shihab, Maruf Ahmed Bhuiyan, Yusha Araf, Tofazzal Islam. Choice of assemblers has a critical impact on de novo assembly of SARS-CoV-2 genome and characterizing variants. Briefings in Bioinformatics, Volume 22, Issue 5, bbab102, September 2021.
This project is licensed under the MIT license.