Skip to content

Latest commit

 

History

History
106 lines (83 loc) · 4.64 KB

CLASSIFY.md

File metadata and controls

106 lines (83 loc) · 4.64 KB

Taxonomic classification of human gut viruses

The code and database described here will allow you to obtain a taxonomic label for your gut virus based on the UHGV taxonomy. This is useful to determine novelty relative to database, identify characteristics of the nearest viral group, and allow you to identify other phylogenetically related viruses in the database.

Installation

Install program using git and pip (add --user if you don't have root access):
pip install git+https://github.com/snayfach/UHGV.git

Install external dependencies using conda:
conda install -c bioconda prodigal-gv diamond blast -y

View available modules:
uhgv-tools -h

Download and unpack the latest database:
uhgv-tools download .

UHGV-tools: download
[1/5] Checking latest version of database...
[2/5] Downloading 'uhgv-db'...
[3/5] Extracting 'uhgv-db'...
[4/5] Building BLASTN database...
[5/5] Building DIAMOND database...
Run time: 121.6 seconds
Peak mem: 1.58 GB

View command line usage for classify module:
uhgv-tools classify -h

usage: uhgv-class.py [-h] -i PATH -o PATH -d PATH [-t THREADS] [-c]

options: -h, --help show this help message and exit

required arguments: -i PATH Path to nucleotide seqs
-o PATH Path to output directory
-d PATH Path to database directory
-t THREADS Number of threads to run program with (1)
--continue Continue where program left off
--quiet Suppress logging messages

Example usage

Download a test dataset of 5 phages from Nishijima et al. using wget:
wget https://raw.githubusercontent.com/snayfach/UHGV/main/example/viral_sequences.fna -O viral_sequences.fna

Classify sequences, replacing </path/to/uhgv-db> as appropriate:
uhgv-tools classify -i viral_sequences.fna -o output -d </path/to/uhgv-db> -t 10

UHGV-tools v0.0.1: classify
[1/10] Reading input sequences
[2/10] Reading database sequences
[3/10] Estimating ANI with blastn
[4/10] Identifying genes using prodigal-gv
[5/10] Performing self alignment
[6/10] Aligning proteins to database
[7/10] Calculating amino acid similarity scores
[8/10] Finding top database hits
[9/10] Performing phylogenetic assignment
[10/10] Writing output file(s)

There are two main output files:

  • output/classify_summary.tsv: information related to classification
  • output/taxon_info.tsv: details about the classified taxa (ex: lifestyle, genome size, host)

Here are field definitions and example values for classify_summary.tsv:

Field Description Example
genome_id user genome identifier 0008_k141_99927
genome_length length in bp 96989
genome_num_genes count of CDS 106
taxon_id UHGV taxon identifier vSUBGEN-22354
class_method nucleotide or protein based classification protein
class_rank lowest classified rank subgenus
ani_reference nearest reference based on ANI UHGV-0030436
ani_identity nucleotide identity 93.65
ani_query_af % of query covered 86.6
ani_target_af % of target covered 83.58
ani_taxonomy taxonomy of reference genome vFAM-00050;vSUBFAM-00057;vGENUS-00180;vSUBGEN-22354;vOTU-000988
aai_reference nearest reference based on AAI UHGV-0030436
aai_shared_genes number of proteins aligned 93
aai_identity amino acid identity 89.33
aai_score normalized, cumulative bitscore 82.57
aai_taxonomy taxonomy of reference genome vFAM-00050;vSUBFAM-00057;vGENUS-00180;vSUBGEN-22354;vOTU-000988

Here field definitions and example values for taxon_info.tsv:

Field Description Example
genome_id user genome identifier 0008_k141_99927
taxon_id UHGV taxon identifier vSUBGEN-22354
taxon_lineage UHGV taxon lineage vFAM-00050;vSUBFAM-00057;vGENUS-00180;vSUBGEN-22354
host_lineage Consensus GTDB host lineage d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Prevotella (100.0)
ictv_lineage Consensus ICTV taxon lineage r__Duplodnaviria;k__Heunggongvirae;p__Uroviricota;c__Caudoviricetes;o__Crassvirales;f__Beta-crassviridae (100.0)
lifestyle Consensus virus lifestyle virulent (100.0)
genome_length_median median genome length of viruses in lineage 100566.0
genome_length_iqr interquartile range of genome length 100566.0 - 100566.0