-
Notifications
You must be signed in to change notification settings - Fork 5
Outputs
- πΒ Virus identification
- πΒ Taxonomy classification
- πΒ Host prediction
- πΒ Lifestyle prediction
- πΒ Protein annotation
- πΒ End to end task
- πΒ Contamination
- πΒ vOTU grouping
- πΒ Phylogenetic tree
- πΒ Special usage of outputs
All the outputs can be found in the PATH_TO_OUT/final_prediction/
folder. According to the --task
you run, the total number of files may differ.
final_prediction
βββ phamer_prediction.tsv
βββ phamer_supplementary
Β Β βββ all_predicted_contigs.fa || DNA sequences > --length
Β Β βββ all_predicted_protein.fa || Proteins predicted by prodigal-gv
βββ gene_annotation.tsv || protein annotation based on blastp
Β Β βββ predicted_virus.fa || Vrial DNA sequences
Β Β βββ predicted_virus_protein.fa || Vrial proteins
Β Β βββ alignment_results.tab || blastp results against db
Β Β βββ uncertain_sequences_for_contamination_task.fa || please run contamination task
The main output phamer_prediction.tsv
is generated in tabular-separated (TSV) format composed of six fields:
Accession Length Pred Proportion PhaMerScore PhaMerConfidence
example_0 29445 virus 0.1 1.0 lower than reject threshold
example_2 5971 virus 0.86 1.0 high-confidence
- Accession: the accession or the name of the input contigs.
- Length: the length of input contigs.
- Pred: virus or non-virus.
- Proportion: the proportion of the proteins that can be aligned to the virus database (from 0 to 1).
- PhaMerScore: the prediction score given by the deep learning model.
-
PhaMerConfidence: the confidence of prediction, determined by both Proportion and PhaMerScore.
- high-confidence
- medium-confidence
- low-confidence
- lower than reject threshold (according to the --reject parameter, default: 0.1).
For the virus with low-confidence or lower than reject threshold, we recommend you to run the --task contamination
to check their sequence quality.
final_prediction
βββ phagcn_prediction.tsv
βββ phagcn_supplementary
Β Β βββ all_predicted_contigs.fa || DNA sequences > --length
Β Β βββ all_predicted_protein.fa || Proteins predicted by prodigal-gv
Β Β βββ alignment_results.tab || blastp results against db
Β Β βββ gene_annotation.tsv || protein annotation based on blastp
βββ phagcn_network_edges.tsv || network file for cytoscape
βββ phagcn_network_nodes.tsv || network file for cytoscape
The main output phagcn_prediction.tsv
is generated in tabular-separated (TSV) format composed of six fields:
Accession Length Lineage PhaGCNScore Genus GenusCluster
example_0 29445 superkingdom:Viruses;clade:Duplodnaviria;kingdom:Heunggongvirae;phylum:Uroviricota;class:Caudoviricetes 1.00;1.00;1.00;1.00;1.00;1.00;0.58;0.58 - singleton
example_103 11376 superkingdom:Viruses;clade:Duplodnaviria;kingdom:Heunggongvirae;phylum:Uroviricota;class:Caudoviricetes;genus:Jasminevirus 1.00;1.00;1.00;1.00;1.00;1.00;1.00 Jasminevirus known_genus
- Accession: the accession or the name of the input contigs.
- Length: the length of input contigs.
- Lineage: the predicted taxonomy lineage (NCBI version) of the contigs. Each rank is separated by the ';'.
- PhaGCNScore: the predicted score for each rank in the lineage. Each rank is separated by the ';'.
- Genus: whether the contig has a genus level name ('-' means unknown).
- GenusCluster: if the Genus is '-', the program will assign a genus-level grouping result: group_idx (idx = 1, 2, 3, ...) or singleton. This can be viewed as genus-level OTUs based on the average shared protein identities between sequences.
final_prediction
βββ cherry_prediction.tsv
βββ cherry_supplementary
Β Β βββ all_predicted_contigs.fa || DNA sequences > --length
Β Β βββ all_predicted_protein.fa || Proteins predicted by prodigal-gv
Β Β βββ alignment_results.tab || blastp results against db
Β Β βββ gene_annotation.tsv || protein annotation based on blastp
βββ cherry_network_edges.tsv || network file for cytoscape
βββ cherry_network_nodes.tsv || network file for cytoscape
The main output cherry_prediction.tsv
is generated in tabular-separated (TSV) format composed of five fields:
Accession Length Host CHERRYScore Method Host_NCBI_lineage Host_GTDB_lineage
example_95 16220 species:Streptomyces sp. PanSC9 0.91 CRISPR-based (DB) d__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Streptomycetales;f__Streptomycetaceae;g__Streptomyces;s__Streptomyces sp. PanSC9 d__Bacteria;p__Actinomycetota;c__Actinomycetes;o__Streptomycetales;f__Streptomycetaceae;g__Streptomyces;s__Streptomyces sp900105245
example_98 13996 species:Salinispora arenicola 0.91 CRISPR-based (DB) d__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Micromonosporales;f__Micromonosporaceae;g__Salinispora;s__Salinispora arenicola Not found
- Accession: the accession or the name of the input contigs.
- Length: the length of input contigs.
- Host: the predicted host (NCBI taxonomy) of the contigs. '-' means unknown host.
- CHERRYScore: the predicted score from the model.
-
Method:
- CRISPR-based(MAG): CRISPRs alignment results from provided MAG (if any).
- CRISPR-based(DB): CRISPRs alignment results from database.
- AAI-based: predicting host based on virus-similarity.
- Host_NCBI_lineage: full taxonomy lineage based on NCBI Taxonomy.
- Host_GTDB_lineage: full taxonomy lineage based on GTDB Taxonomy.
final_prediction
βββ phatyp_prediction.tsv
βββ phatyp_supplementary
Β Β βββ all_predicted_contigs.fa || DNA sequences > --length
Β Β βββ all_predicted_protein.fa || Proteins predicted by prodigal-gv
Β Β βββ alignment_results.tab || blastp results against db
Β Β βββ gene_annotation.tsv || protein annotation based on blastp
The main output phatyp_prediction.tsv
is generated in tabular-separated (TSV) format composed of four fields:
Accession Length TYPE PhaTYPScore
example_0 29445 virulent 1.0
example_2 5971 temperate 1.0
- Accession: the accession or the name of the input contigs.
- Length: the length of input contigs.
- TYPE: virulent or temperate (virus).
- PhaTYPScore: the prediction score given by the deep learning model.
Please note that running task end_to_end
, phamer
, phagcn
, phatyp
, and cherry
, will automatically run phavip
. The output files are the same but the supplementary files will be dumped into the corresponding task.
final_prediction
βββ phavip_prediction.tsv
βββ phavip_supplementary
Β Β βββ all_predicted_contigs.fa || DNA sequences > --length
Β Β βββ all_predicted_protein.fa || Proteins predicted by prodigal-gv
Β Β βββ alignment_results.tab || blastp results against db
Β Β βββ gene_annotation.tsv || protein annotation based on blastp
The main outputs are phavip_prediction.tsv
and gene_annotation.tsv
gene_annotation.tsv
is generated in tabular-separated (TSV) format composed of five fields:
Accession Length Protein_num Annotated_num Annotation_rate
example_0 29445 210 20 0.10
example_1 10965 100 0 0.00
example_2 5971 35 30 0.86
- Accession: the accession or the name of the input contigs.
- Length: the length of input contigs.
- Protein_num: total number of predicted proteins.
- Annotated_num: number of proteins that have significant alignments.
- Annotation_rate: percentage of proteins that have annotations.
gene_annotation.tsv
is generated in tabular-separated (TSV) format composed of four fields:
Genome ORF Start End Strand GC Annotation pident coverage
example_0 example_0_1 1 72 -1 0.375 hypothetical protein (no hit) 0.00 0.00
example_0 example_0_2 74 1048 -1 0.55 DNA methyltransferase 40.70 1.00
example_0 example_0_3 1045 3228 -1 0.477 DNA methylase 45.30 1.00
- Genome: the accession or the name of the input contigs.
- ORF: the ID of the translated protein.
- Start: start position on the genome.
- End: end position on the genome.
- Strand: forward (1) or backward(-1).
- GC: GC content.
- Annotation: the annotation of the proteins.
Please note that there are two kinds of hypothetical protein
:
- hypothetical protein (no hit): a protein has no alignment results to the reference database.
- hypothetical protein: a protein has alignment results but the annotation is "hypothetical protein"
final_prediction
βββ final_prediction_summary.tsv
βββ phamer_supplementary
βΒ Β βββ all_predicted_contigs.fa
βΒ Β βββ all_predicted_protein.fa
β βββ gene_annotation.tsv || outputs of phavip
βΒ Β βββ predicted_virus.fa
βΒ Β βββ predicted_virus_protein.fa
βΒ Β βββ alignment_results.tab
βΒ Β βββ uncertain_sequences_for_contamination_task.fa || please run contamination task
βββ phagcn_supplementary
β βββ phagcn_network_edges.tsv
β βββ phagcn_network_nodes.tsv
βββ cherry_supplementary
β βββ cherry_network_edges.tsv
β βββ cherry_network_nodes.tsv
βββ phatyp_supplementary
In the end-to-end
mode, except for the aforementioned xxx_prediction.tsv
files, a final_prediction_summary.tsv
is generated by merging the outputs of all subprograms.
In addition, prediction with non-virus will not be used in the following taxonomy, host, and lifestyle prediction tasks.
final_prediction
βββ contamination_prediction.tsv
βββ contamination_supplementary
Β Β βββ proviruses.fa || proteinal provirues
Β Β βββ low_quality_virus.fa || low quality viruses
Β Β βββ medium_quality_virus.fa || medium quality viruses
Β Β βββ high_quality_virus.fa || high quality viruses
Β Β βββ candidate_provirus.tsv || information of the provirus
Β Β βββ marker_gene_from_contamination_search.tsv || marker gene annoation
The main output contamination_prediction.tsv
is generated in tabular-separated (TSV) format composed of nine fields:
Accession Length Total_genes Viral_genes Prokaryotic_genes Kmer_freq Contamination Provirus Pure_viral
example_270 6617 6 2 0 1.0 0 No High quality
example_271 17630 28 9 0 1.0 0 No High quality
The file is generated in tabular-separated (TSV) format composed of nine fields:
- Accession: the accession or the name of the input contigs.
- Length: the length of input contigs.
- Total_genes: number of genes in the contigs (predicted by prodigal-gv)
- Viral_genes: number of viral marker genes
- Prokaryotic_genes: number of prokaryotic marker genes
-
Kmer_freq: average frequency of 20-mer.
- This is a value to estimate the copy number of the genes; usually, the Kmer_freq of 99.9% virus is less than 1.25.
- Contamination:
- Provirus: Whether the sequence is a provirus
- Pure_viral: High quality or Medium quality or Low quality
final_prediction
βββ ANI_based_vOTU.tsv (ANI-based)
βββ AAI_based_vOTU.tsv (AAI-based)
The main output xxx_based_vOTU.tsv
is generated in tabular-separated (TSV) format composed of four fields:
Sequence vOTU Representative Length
contig_33 group_19 contig_33 49448
contig_34 group_19 contig_33 4484
- Accession: the accession or the name of the input contigs.
- vOTU: the cluster ID.
- Representative: the representative genome.
- Length: the length of input contigs.
final_prediction
βββ combined_marker.msa (if msa =='Y') || concatenate the MSA between different marker
βββ combined.tree (if tree == 'Y') || phylogenetic tree based on FastTree
βββ tree_supplementary
Β Β βββ finded_marker_xxx_combined_db.fa || the fined marker and database marker
The metadata about the reference proteins can be found in the phabox database phabox_db_v2/marker/marker_stats.tsv
Some outputs from PhaBOX can help you to draw figures for your research. We will show some examples below. Hope they will help
The protein annotation file gene_annotation.tsv
can be used to generate the protein organization using PyGenomeViz.
# Make sure you have installed the PyGenomeViz
# pip install pygenomeviz
#Load the data
import pandas as pd
data = pd.read_csv('gene_annotation.tsv', sep='\t')
# Convert your data into a format suitable for pygenomeviz. You need to extract relevant information:
# Extract the relevant columns
annotations = []
for index, row in data.iterrows():
annotations.append({
'seq_id': row['Genome'],
'start': row['Start'],
'end': row['End'],
'strand': row['Strand'],
'annotation': row['Annotation'],
})
from pygenomeviz import GenomeViz
# Initialize GenomeViz
gv = GenomeViz()
# Add each annotation as a track
for ann in annotations:
gv.add_feature(
seq_id=ann['seq_id'],
start=ann['start'],
end=ann['end'],
strand=ann['strand'],
label=ann['annotation']
)
# Render the visualization
gv.render()
# Save to a file
gv.savefig('genome_viz.png')
# Or display it
gv.show()
An example is:
The network files xxx_edges.csv
and xxx_nodes.csv
can be used to input the Cytoscape.
Step1 Import the network:
- Go to File > Import > Table from File
- select xxx_edges.csv
Step2 Import the network:
- Go to File > Import > Table from File
- select xxx_nodes.csv
Setp3 Adjust the visualization:
- choose a layout
- color the nodes
An example is:
The tree file combined.tree
can be used as inputs to iTOL.
The metadata of the reference genes can be found in PhaBOX2's database marker/marker_stats.tsv
.
An example is: