Phage-Host Interaction Search Tool
A tool to predict prokaryotic hosts for phage (meta)genomic sequences. PHIST links viruses to hosts based on the number of k-mers shared between their sequences.
git clone --recurse-submodules https://github.com/refresh-bio/PHIST
cd PHIST
make
./phist.py ./example/virus ./example/host ./out/
PHIST uses Kmer-db as a submodule, therefore a recursive repository clone must be performed:
git clone --recurse-submodules https://github.com/refresh-bio/PHIST
Under Linux/OS X the package can be built by running MAKE in the project directory (G++ 5.3 tested):
cd PHIST
make
Under Windows one have to build Visual Studio 2015 solutions on kmer-db and utils subdirectories (use Release 64-bit configuration, as Python script depends on the default VS output directory structure).
PHIST takes as input genomic sequences of viruses and candidate hosts in FASTA files (gzipped or not). Virus genomes may be provided in a single FASTA file or in a directory containing multiple FASTA files (one genome per file). Candidate host genomes should be stored individually in a directory (one genome per FASTA file) (see example).
./phist.py [options] <virus_path> <host_dir> <out_dir>
Positional arguments:
virus_path
Input FASTA file or directory with files (plain or gzip)host_dir
Input directory w/ host FASTA files (plain or gzip)out_dir
Output directory (will be created if it does not exist)
Options:
-k <kmer-length>
k-mer length (default: 25, max: 30)-t <num-threads>
Number of threads (default: number of cores)-h, --help
Show this help message and exit--keep_temp
Keep temporary kmer-db files [False]--version
Show tool's version number and exit
./phist.py example/virus/ example/host/ out/
./phist.py example/virus_multifasta.fna example/host/ out/
PHIST outputs two CSV files. One containing a table of common k-mers between phages and hosts, and second file with virus-host predictions.
The common_kmers.csv file stores numbers of common k-mers between phages (in columns) and hosts (in rows) in a sparse form. Specifically, zeros are omitted while non-zero k-mer counts are represented as pairs (column_number : value) with 1-based column indexing. Thus, rows may have different number of elements, e.g.:
kmer-length: k fraction: f | phages | φ1 | φ2 | ... | φn |
hosts | total-kmers | |φ1| | |φ2| | ... | |φn| |
h1 | |h1| | i11 : |h1 ∩ φi11| | i12 : |h1 ∩ φi12| | ||
h2 | |h2| | i21 : |h2 ∩ φi21| | i22 : |h2 ∩ φi22| | i23 : |h2 ∩ φi23| | |
h2 | |h2| | ||||
... | ... | ... | |||
hm | |hm| | im1 : |hm ∩ φim1| |
where:
- k - k-mer length,
- φ1, φ2, ..., φn - phage names,
- h1, h2, ..., hm - host names,
- |a| - number of k-mers in sample a,
- |a ∩ b| - number of k-mers common for samples a and b.
The predictions.csv file assigns each phage to its most likely host (i.e., the one having most k-mers in common). If there are multiple potential hosts with same number of common k-mers, all are reported. Each virus-host interaction is followed by p-value and adjusted p-value for multiple comparisons.
phage | host | common k-mers | p-value | adj. p-value |
---|---|---|---|---|
φ1 | host( φ1) | |φ1 ∩ host(φ1)| | ... | ... |
φ2 | host( φ2) | |φ2 ∩ host(φ2)| | ... | ... |
φ3 | host1( φ3) | |φ3 ∩ host1(φ3)| | ... | ... |
φ3 | host2( φ3) | |φ3 ∩ host2(φ3)| | ... | ... |
... | ... | ... | ... | ... |
The utils/matcher
tool retrieves the list of all exact matches of legnth >= k for a given pair of phage and host FASTA sequences. The matches are provided with their coordinates in the viral and corresponding bacterial genome (a reversed interval in the latter indicates a reverse complement match).
./utils/matcher [options] <virus> <host> <output>
Positional arguments:
virus
virus FASTA file (gzipped or not),host
host FASTA file (gzipped or not),output
output CSV file
Options:
-k --k <kmer-length>
k-mer length (default: 25, max: 30, may be different than the one used in the PHIST execution),
./utils/matcher example/virus/NC_024123.fna example/host/NC_017548.fna shared_regions.csv
example/virus/NC_024123.fna,example/host/NC_017548.fna
NC_024123.1:52942-52968,NC_017548.1:1456873-1456847
NC_024123.1:52970-53009,NC_017548.1:1456845-1456806
NC_024123.1:53011-53102,NC_017548.1:1456804-1456713
NC_024123.1:53107-53147,NC_017548.1:1456708-1456668
NC_024123.1:53830-53854,NC_017548.1:2647971-2647947
NC_024123.1:54794-54827,NC_017548.1:679998-679965
Zielezinski A, Deorowicz S, Gudyś A. PHIST: fast and accurate prediction of prokaryotic hosts from metagenomic viral sequences, Bioinformatics. 2022, 38(5):1447-9. doi:10.1093/bioinformatics/btab837.