The code and database described here will allow you to obtain a taxonomic label for your gut virus based on the UHGV taxonomy. This is useful to determine novelty relative to database, identify characteristics of the nearest viral group, and allow you to identify other phylogenetically related viruses in the database.
Install program using git and pip (add --user
if you don't have root access):
pip install git+https://github.com/snayfach/UHGV.git
Install external dependencies using conda:
conda install -c bioconda prodigal-gv diamond blast -y
View available modules:
uhgv-tools -h
Download and unpack the latest database:
uhgv-tools download .
UHGV-tools: download
[1/5] Checking latest version of database...
[2/5] Downloading 'uhgv-db'...
[3/5] Extracting 'uhgv-db'...
[4/5] Building BLASTN database...
[5/5] Building DIAMOND database...
Run time: 121.6 seconds
Peak mem: 1.58 GB
View command line usage for classify
module:
uhgv-tools classify -h
usage: uhgv-class.py [-h] -i PATH -o PATH -d PATH [-t THREADS] [-c]
options: -h, --help show this help message and exit
required arguments: -i PATH Path to nucleotide seqs
-o PATH Path to output directory
-d PATH Path to database directory
-t THREADS Number of threads to run program with (1)
--continue Continue where program left off
--quiet Suppress logging messages
Download a test dataset of 5 phages from Nishijima et al. using wget:
wget https://raw.githubusercontent.com/snayfach/UHGV/main/example/viral_sequences.fna -O viral_sequences.fna
Classify sequences, replacing </path/to/uhgv-db>
as appropriate:
uhgv-tools classify -i viral_sequences.fna -o output -d </path/to/uhgv-db> -t 10
UHGV-tools v0.0.1: classify
[1/10] Reading input sequences
[2/10] Reading database sequences
[3/10] Estimating ANI with blastn
[4/10] Identifying genes using prodigal-gv
[5/10] Performing self alignment
[6/10] Aligning proteins to database
[7/10] Calculating amino acid similarity scores
[8/10] Finding top database hits
[9/10] Performing phylogenetic assignment
[10/10] Writing output file(s)
There are two main output files:
output/classify_summary.tsv
: information related to classificationoutput/taxon_info.tsv
: details about the classified taxa (ex: lifestyle, genome size, host)
Here are field definitions and example values for classify_summary.tsv
:
Field | Description | Example |
---|---|---|
genome_id | user genome identifier | 0008_k141_99927 |
genome_length | length in bp | 96989 |
genome_num_genes | count of CDS | 106 |
taxon_id | UHGV taxon identifier | vSUBGEN-22354 |
class_method | nucleotide or protein based classification | protein |
class_rank | lowest classified rank | subgenus |
ani_reference | nearest reference based on ANI | UHGV-0030436 |
ani_identity | nucleotide identity | 93.65 |
ani_query_af | % of query covered | 86.6 |
ani_target_af | % of target covered | 83.58 |
ani_taxonomy | taxonomy of reference genome | vFAM-00050;vSUBFAM-00057;vGENUS-00180;vSUBGEN-22354;vOTU-000988 |
aai_reference | nearest reference based on AAI | UHGV-0030436 |
aai_shared_genes | number of proteins aligned | 93 |
aai_identity | amino acid identity | 89.33 |
aai_score | normalized, cumulative bitscore | 82.57 |
aai_taxonomy | taxonomy of reference genome | vFAM-00050;vSUBFAM-00057;vGENUS-00180;vSUBGEN-22354;vOTU-000988 |
Here field definitions and example values for taxon_info.tsv
:
Field | Description | Example |
---|---|---|
genome_id | user genome identifier | 0008_k141_99927 |
taxon_id | UHGV taxon identifier | vSUBGEN-22354 |
taxon_lineage | UHGV taxon lineage | vFAM-00050;vSUBFAM-00057;vGENUS-00180;vSUBGEN-22354 |
host_lineage | Consensus GTDB host lineage | d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Prevotella (100.0) |
ictv_lineage | Consensus ICTV taxon lineage | r__Duplodnaviria;k__Heunggongvirae;p__Uroviricota;c__Caudoviricetes;o__Crassvirales;f__Beta-crassviridae (100.0) |
lifestyle | Consensus virus lifestyle | virulent (100.0) |
genome_length_median | median genome length of viruses in lineage | 100566.0 |
genome_length_iqr | interquartile range of genome length | 100566.0 - 100566.0 |