Skip to content

Outputs

SHANG Jiayu edited this page Nov 15, 2024 · 24 revisions

Table of Contents

All the outputs can be found in the PATH_TO_OUT/final_prediction/ folder. According to the --task you run, the total number of files may differ.

πŸ“•Β  PhaMer (Virus identification)

final_prediction
β”œβ”€β”€ phamer_prediction.tsv
└── phamer_supplementary
 Β Β  β”œβ”€β”€ all_predicted_contigs.fa    || DNA sequences > --length
 Β Β  β”œβ”€β”€ all_predicted_protein.fa    || Proteins predicted by prodigal-gv
    β”œβ”€β”€ gene_annotation.tsv         || protein annotation based on blastp
 Β Β  β”œβ”€β”€ predicted_virus.fa          || Vrial DNA sequences
 Β Β  β”œβ”€β”€ predicted_virus_protein.fa  || Vrial proteins
 Β Β  β”œβ”€β”€ alignment_results.tab       || blastp results against db
 Β Β  └── uncertain_sequences_for_contamination_task.fa      || please run contamination task

The main output phamer_prediction.tsv is generated in tabular-separated (TSV) format composed of six fields:

Accession       Length  Pred    Proportion      PhaMerScore     PhaMerConfidence
example_0       29445   virus   0.1             1.0             lower than reject threshold
example_2       5971    virus   0.86            1.0             high-confidence
  1. Accession: the accession or the name of the input contigs.
  2. Length: the length of input contigs.
  3. Pred: virus or non-virus.
  4. Proportion: the proportion of the proteins that can be aligned to the virus database (from 0 to 1).
  5. PhaMerScore: the prediction score given by the deep learning model.
  6. PhaMerConfidence: the confidence of prediction, determined by both Proportion and PhaMerScore.
    • high-confidence
    • medium-confidence
    • low-confidence
    • lower than reject threshold (according to the --reject parameter, default: 0.1).

For the virus with low-confidence or lower than reject threshold, we recommend you to run the --task contamination to check their sequence quality.

πŸ“•Β  PhaGCN (Taxonomy)

final_prediction
β”œβ”€β”€ phagcn_prediction.tsv
└── phagcn_supplementary
 Β Β  β”œβ”€β”€ all_predicted_contigs.fa   || DNA sequences > --length
 Β Β  β”œβ”€β”€ all_predicted_protein.fa   || Proteins predicted by prodigal-gv
 Β Β  β”œβ”€β”€ alignment_results.tab      || blastp results against db
 Β Β  β”œβ”€β”€ gene_annotation.tsv        || protein annotation based on blastp
    β”œβ”€β”€ phagcn_network_edges.tsv   || network file for cytoscape
    └── phagcn_network_nodes.tsv   || network file for cytoscape

The main output phagcn_prediction.tsv is generated in tabular-separated (TSV) format composed of six fields:

Accession       Length  Lineage PhaGCNScore     Genus   GenusCluster
example_0       29445   superkingdom:Viruses;clade:Duplodnaviria;kingdom:Heunggongvirae;phylum:Uroviricota;class:Caudoviricetes 1.00;1.00;1.00;1.00;1.00;1.00;0.58;0.58 -       singleton
example_103     11376   superkingdom:Viruses;clade:Duplodnaviria;kingdom:Heunggongvirae;phylum:Uroviricota;class:Caudoviricetes;genus:Jasminevirus      1.00;1.00;1.00;1.00;1.00;1.00;1.00      Jasminevirus      known_genus
  1. Accession: the accession or the name of the input contigs.
  2. Length: the length of input contigs.
  3. Lineage: the predicted taxonomy lineage (NCBI version) of the contigs. Each rank is separated by the ';'.
  4. PhaGCNScore: the predicted score for each rank in the lineage. Each rank is separated by the ';'.
  5. Genus: whether the contig has a genus level name ('-' means unknown).
  6. GenusCluster: if the Genus is '-', the program will assign a genus-level grouping result: group_idx (idx = 1, 2, 3, ...) or singleton. This can be viewed as genus-level OTUs based on the average shared protein identities between sequences.

πŸ“•Β  CHERRY (Host)

final_prediction
β”œβ”€β”€ cherry_prediction.tsv
└── cherry_supplementary
 Β Β  β”œβ”€β”€ all_predicted_contigs.fa   || DNA sequences > --length
 Β Β  β”œβ”€β”€ all_predicted_protein.fa   || Proteins predicted by prodigal-gv
 Β Β  β”œβ”€β”€ alignment_results.tab      || blastp results against db
 Β Β  β”œβ”€β”€ gene_annotation.tsv        || protein annotation based on blastp
    β”œβ”€β”€ cherry_network_edges.tsv   || network file for cytoscape
    └── cherry_network_nodes.tsv   || network file for cytoscape

The main output cherry_prediction.tsv is generated in tabular-separated (TSV) format composed of five fields:

Accession       Length  Host                                            CHERRYScore     Method    Host_NCBI_lineage	Host_GTDB_lineage
example_95	16220	species:Streptomyces sp. PanSC9	0.91	CRISPR-based (DB)	d__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Streptomycetales;f__Streptomycetaceae;g__Streptomyces;s__Streptomyces sp. PanSC9	d__Bacteria;p__Actinomycetota;c__Actinomycetes;o__Streptomycetales;f__Streptomycetaceae;g__Streptomyces;s__Streptomyces sp900105245
example_98	13996	species:Salinispora arenicola	0.91	CRISPR-based (DB)	d__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Micromonosporales;f__Micromonosporaceae;g__Salinispora;s__Salinispora arenicola	Not found
  1. Accession: the accession or the name of the input contigs.
  2. Length: the length of input contigs.
  3. Host: the predicted host (NCBI taxonomy) of the contigs. '-' means unknown host.
  4. CHERRYScore: the predicted score from the model.
  5. Method:
    • CRISPR-based(MAG): CRISPRs alignment results from provided MAG (if any).
    • CRISPR-based(DB): CRISPRs alignment results from database.
    • AAI-based: predicting host based on virus-similarity.
  6. Host_NCBI_lineage: full taxonomy lineage based on NCBI Taxonomy.
  7. Host_GTDB_lineage: full taxonomy lineage based on GTDB Taxonomy.

πŸ“•Β  PhaTYP (Lifestyle)

final_prediction
β”œβ”€β”€ phatyp_prediction.tsv
└── phatyp_supplementary
 Β Β  β”œβ”€β”€ all_predicted_contigs.fa   || DNA sequences > --length
 Β Β  β”œβ”€β”€ all_predicted_protein.fa   || Proteins predicted by prodigal-gv
 Β Β  β”œβ”€β”€ alignment_results.tab      || blastp results against db
 Β Β  └── gene_annotation.tsv        || protein annotation based on blastp

The main output phatyp_prediction.tsv is generated in tabular-separated (TSV) format composed of four fields:

Accession       Length  TYPE            PhaTYPScore
example_0       29445   virulent        1.0
example_2       5971    temperate       1.0
  1. Accession: the accession or the name of the input contigs.
  2. Length: the length of input contigs.
  3. TYPE: virulent or temperate (virus).
  4. PhaTYPScore: the prediction score given by the deep learning model.

πŸ“•Β  PhaVIP (Annotation)

Please note that running task end_to_end, phamer, phagcn, phatyp, and cherry, will automatically run phavip. The output files are the same but the supplementary files will be dumped into the corresponding task.

final_prediction
β”œβ”€β”€ phavip_prediction.tsv
└── phavip_supplementary
 Β Β  β”œβ”€β”€ all_predicted_contigs.fa   || DNA sequences > --length
 Β Β  β”œβ”€β”€ all_predicted_protein.fa   || Proteins predicted by prodigal-gv
 Β Β  β”œβ”€β”€ alignment_results.tab      || blastp results against db
 Β Β  └── gene_annotation.tsv        || protein annotation based on blastp

The main outputs are phavip_prediction.tsv and gene_annotation.tsv

gene_annotation.tsv is generated in tabular-separated (TSV) format composed of five fields:

Accession  Length  Protein_num     Annotated_num   Annotation_rate
example_0       29445   210     20      0.10
example_1       10965   100     0       0.00
example_2       5971    35      30      0.86
  1. Accession: the accession or the name of the input contigs.
  2. Length: the length of input contigs.
  3. Protein_num: total number of predicted proteins.
  4. Annotated_num: number of proteins that have significant alignments.
  5. Annotation_rate: percentage of proteins that have annotations.

gene_annotation.tsv is generated in tabular-separated (TSV) format composed of four fields:

Genome  ORF     Start   End     Strand  GC      Annotation      pident  coverage
example_0       example_0_1     1       72      -1      0.375   hypothetical protein (no hit)   0.00    0.00
example_0       example_0_2     74      1048    -1      0.55    DNA methyltransferase   40.70   1.00
example_0       example_0_3     1045    3228    -1      0.477   DNA methylase   45.30   1.00
  1. Genome: the accession or the name of the input contigs.
  2. ORF: the ID of the translated protein.
  3. Start: start position on the genome.
  4. End: end position on the genome.
  5. Strand: forward (1) or backward(-1).
  6. GC: GC content.
  7. Annotation: the annotation of the proteins.

Please note that there are two kinds of hypothetical protein:

  • hypothetical protein (no hit): a protein has no alignment results to the reference database.
  • hypothetical protein: a protein has alignment results but the annotation is "hypothetical protein"

πŸ“šΒ  End to end task

final_prediction
β”œβ”€β”€ final_prediction_summary.tsv
β”œβ”€β”€ phamer_supplementary
β”‚Β Β  β”œβ”€β”€ all_predicted_contigs.fa
β”‚Β Β  β”œβ”€β”€ all_predicted_protein.fa
β”‚   β”œβ”€β”€ gene_annotation.tsv         || outputs of phavip
β”‚Β Β  β”œβ”€β”€ predicted_virus.fa
β”‚Β Β  β”œβ”€β”€ predicted_virus_protein.fa
β”‚Β Β  β”œβ”€β”€ alignment_results.tab
β”‚Β Β  └── uncertain_sequences_for_contamination_task.fa      || please run contamination task
β”œβ”€β”€ phagcn_supplementary
β”‚   β”œβ”€β”€ phagcn_network_edges.tsv
β”‚   └── phagcn_network_nodes.tsv
β”œβ”€β”€ cherry_supplementary
β”‚   β”œβ”€β”€ cherry_network_edges.tsv
β”‚   └── cherry_network_nodes.tsv
└── phatyp_supplementary

In the end-to-end mode, except for the aforementioned xxx_prediction.tsv files, a final_prediction_summary.tsv is generated by merging the outputs of all subprograms.

In addition, prediction with non-virus will not be used in the following taxonomy, host, and lifestyle prediction tasks.

πŸ“—Β  Contamination

final_prediction
β”œβ”€β”€ contamination_prediction.tsv  
└── contamination_supplementary
 Β Β  β”œβ”€β”€ proviruses.fa                               || proteinal provirues
 Β Β  β”œβ”€β”€ low_quality_virus.fa                        || low quality viruses
 Β Β  β”œβ”€β”€ medium_quality_virus.fa                     || medium quality viruses
 Β Β  β”œβ”€β”€ high_quality_virus.fa                       || high quality viruses
 Β Β  β”œβ”€β”€ candidate_provirus.tsv                      || information of the provirus
 Β Β  └── marker_gene_from_contamination_search.tsv   || marker gene annoation

The main output contamination_prediction.tsv is generated in tabular-separated (TSV) format composed of nine fields:

Accession      Length   Total_genes  Viral_genes  Prokaryotic_genes  Kmer_freq  Contamination  Provirus   Pure_viral
example_270    6617            6            2            0           1.0        0              No         High quality
example_271    17630           28           9            0           1.0        0              No         High quality

The file is generated in tabular-separated (TSV) format composed of nine fields:

  1. Accession: the accession or the name of the input contigs.
  2. Length: the length of input contigs.
  3. Total_genes: number of genes in the contigs (predicted by prodigal-gv)
  4. Viral_genes: number of viral marker genes
  5. Prokaryotic_genes: number of prokaryotic marker genes
  6. Kmer_freq: average frequency of 20-mer.
    • This is a value to estimate the copy number of the genes; usually, the Kmer_freq of 99.9% virus is less than 1.25.
  7. Contamination:
  8. Provirus: Whether the sequence is a provirus
  9. Pure_viral: High quality or Medium quality or Low quality

πŸ“˜Β  vOTU grouping

final_prediction
β”œβ”€β”€ ANI_based_vOTU.tsv (ANI-based)
└── AAI_based_vOTU.tsv (AAI-based)

The main output xxx_based_vOTU.tsv is generated in tabular-separated (TSV) format composed of four fields:

Sequence        vOTU            Representative  Length
contig_33       group_19        contig_33       49448
contig_34       group_19        contig_33       4484
  1. Accession: the accession or the name of the input contigs.
  2. vOTU: the cluster ID.
  3. Representative: the representative genome.
  4. Length: the length of input contigs.

πŸ“™Β  Pylogenetic tree

final_prediction
β”œβ”€β”€ combined_marker.msa (if msa =='Y')    || concatenate the MSA between different marker
β”œβ”€β”€ combined.tree (if tree == 'Y')        || phylogenetic tree based on FastTree
└── tree_supplementary
 Β Β  └── finded_marker_xxx_combined_db.fa  || the fined marker and database marker

The metadata about the reference proteins can be found in the phabox database phabox_db_v2/marker/marker_stats.tsv

πŸ““Β  Special usage of PhaBOX outputs

Some outputs from PhaBOX can help you to draw figures for your research. We will show some examples below. Hope they will help

Portein visualization

The protein annotation file gene_annotation.tsv can be used to generate the protein organization using PyGenomeViz.

# Make sure you have installed the PyGenomeViz
# pip install pygenomeviz

#Load the data
import pandas as pd
data = pd.read_csv('gene_annotation.tsv', sep='\t')

# Convert your data into a format suitable for pygenomeviz. You need to extract relevant information:
# Extract the relevant columns
annotations = []
for index, row in data.iterrows():
    annotations.append({
        'seq_id': row['Genome'],
        'start': row['Start'],
        'end': row['End'],
        'strand': row['Strand'],
        'annotation': row['Annotation'],
    })

from pygenomeviz import GenomeViz
# Initialize GenomeViz
gv = GenomeViz()
# Add each annotation as a track
for ann in annotations:
    gv.add_feature(
        seq_id=ann['seq_id'],
        start=ann['start'],
        end=ann['end'],
        strand=ann['strand'],
        label=ann['annotation']
    )

# Render the visualization
gv.render()
# Save to a file
gv.savefig('genome_viz.png')
# Or display it
gv.show()

An example is:

Network visualization

The network files xxx_edges.csv and xxx_nodes.csv can be used to input the Cytoscape.

Step1 Import the network: 
    - Go to File > Import > Table from File
    - select xxx_edges.csv
Step2 Import the network: 
    - Go to File > Import > Table from File
    - select xxx_nodes.csv
Setp3 Adjust the visualization:
    - choose a layout
    - color the nodes

An example is:

Tree visualization

The tree file combined.tree can be used as inputs to iTOL. The metadata of the reference genes can be found in PhaBOX2's database marker/marker_stats.tsv.

An example is:

Tree