PlotMI is a program for visualization of pairwise interactions in a set of input sequences by computing the pairwise mutual information between positional k-mer distributions. Description of the method is in the following preprint:
Tuomo Hartonen, Teemu Kivioja and Jussi Taipale, "PlotMI: visualization of pairwise interactions and positional preferences learned by a deep learning model from sequence data", bioRxiv 2021.03.14.435285; doi: https://doi.org/10.1101/2021.03.14.435285
All scripts are pure Python, so no compiling is needed. Easiest way is to clone the repository to your own computer. In a desired location, type:
git clone https://github.com/hartonen/plotMI.git
The scripts in this repository need specific Python packages to function properly. The easiest way to make sure everything works is to create a virtual environment (https://docs.python.org/3/library/venv.html#module-venv) containing the tested versions of each package and then run the scripts in this environment. This is done by first creating a new virtual environment:
python3 -m venv /path/to/new/virtual/environment
Then one needs to install all the required packages. These are listed in data/plotMI_requirements.txt
. So activate the new virtual environment:
source /path/to/new/virtual/environment/bin/activate
and install the packages with pip:
pip install -r data/plotMI_requirements.txt
Help message can be evoked by typing:
plotMI.py -h
usage: plotMI.py [-h] [--outdir OUTDIR] [--seqs SEQS]
[--distance {MI,JS,JS_inv,BC,BC_inv,HE}] [--nproc NPROC]
[--figtype {pdf,png}] [--k K] [--v {0,1}] [--p P]
[--alphabet ALPHABET] [--colorscale {lin,log}]
[--minmi MINMI] [--step STEP] [--save_distributions {yes,no}]
[--randomized_pairs RANDOMIZED_PAIRS]
optional arguments:
-h, --help show this help message and exit
--outdir OUTDIR Full path to the output directory.
--seqs SEQS Full path to the plain text input sequence file. Each
sequence must be of same length and on its separate
line.
--distance {MI,JS,JS_inv,BC,BC_inv,HE}
Distance used to compare positional k-mer
distributions. MI=mutual information (default),
JS=Jensen-Shannon divergence, JS_inv=inverted Jensen-
Shannon distance, BC=Bhattacharyya distance,
BC_inv=inverted Bhattacharyya distance, HE=Hellinger
distance.
--nproc NPROC Number of parallel processes used when computing MI
(default=1).
--figtype {pdf,png} png or pdf (default=png).
--k K length of k-mer distributions used to calculate MI
(default=3).
--v {0,1} Verbosity level, 0=none, 1=print info on screen
(default=1).
--p P Multiplier for pseudocount mass added to k-mer count.
Total pseudocount mass added is p*(number of
sequences) (default=5).
--alphabet ALPHABET A string containing each individual letter in the
alphabet used (default=ACGT). NOTE! This is case-
sensitive.
--colorscale {lin,log}
If set to log, colormap is scaled logarithmically
(default=lin, meaning linear scaling).
--minmi MINMI Set minimum value for colormap, helpful if you want to
be sure that the minimum value is 0 (default=minimum
value in MI matrix).
--step STEP Step size for axis ticks in MI-plot (default=20).
--save_distributions {yes,no}
If yes, save the positional and pairwise k-mer
distributions and the MI contributions from each k-mer
and position pair into a separate file (default=no).
Note that this is a large file, possibly many GBs.
--randomized_pairs RANDOMIZED_PAIRS
EXPERIMENTAL FEATURE: File containing list of position
pairs whose k-mer distributions will be randomized to
flat uniform distribution according to given alphabet
before computing MI. Position indices should start
from 0 and be saved in tab-separated format with one
pair on a single row.
PlotMI accepts sequences in plain text format where each row is separated from each other by line change. Note that all rows must be of equal length and each letter that appears in the input sequences must be present in the string given with the --alphabet
flag. For example, if we set --alphabet ACGT
, plotMI will not be able to handle sequences that have a letter N
in them.
The most basic use case for plotMI only requires input sequences that have been filtered using a machine learning model. Here as an example, we reproduce Figure 1d from the plotMI manuscript using random uniform DNA sequences that have been filtered using a convolutional neural network model that has been trained to recognize human promoter sequences (see the manuscript for details). The file data/model-492-0.882.h5-random-uniform-1M-prob-more-09.seq
contains 49,261 sequences that have a probability >0.9 of being active human promoters based on the CNN model. Using these sequences, we can visualize the interactions learned by the human promoter CNN model with:
plotMI.py --outdir ./test- --seqs model-492-0.882.h5-random-uniform-1M-prob-more-09.seq --nproc 4 --figtype png
Alphabet: ACTG
Read in 49261 sequences of length 100 in 0.04489850997924805 seconds.
Calculated the positional 3-mer frequencies in 2.0755910873413086 seconds.
Computed mutual information in 83.16843390464783 seconds.
Plotting done in 0.4934566020965576 seconds.
This should produce two output files identical to files data/test-MI.png
and data/test-MI.txt.gz
.
In the following we show how to replicate the figures 2a-c from the plotMI-manuscript. For this we will need two other Python scripts from the authors from, randomReads and promoterAnalysis repositories. We will also use the Seqkit tool for manipulating fasta-files.
First we generate 10 million random DNA sequences (10 separate files for faster scoring with the CNN model) with uniform nucleotide background using a script made for this purpose:
for i in {1..10}; do randomReads.py random_uniform_L200_N1M_sample"$i".fasta --L 200 --N 1000000; done;
Next we use the pre-trained CNN model to score all these sequences. The pre-trained model can be downloaded from Zenodo: https://doi.org/10.5281/zenodo.5508698
for i in {1..10}; do scorePromoters.py --outfile model-36-0.990.h5-random_uniform_L200_N1M_sample"$i"_preds.txt --model model-36-0.990.h5 --sequences random_uniform_L200_N1M_sample"$i".fasta --nproc 4 & done;
Then select the fasta-IDs of sequences that score higher than 0.9 according to the model:
for i in {1..10}; do awk '$2>0.9' model-36-0.990.h5-random_uniform_L200_N1M_sample"$i"_preds.txt | cut -f1,1 > model-36-0.990.h5-random_uniform_L200_N1M_sample"$i"_prob_more_09_ids.txt; done;
Then we use the wonderful seqkit-package to extract the sequences matching to these IDs and convert them to sequence-only input for mutual information plotting:
for i in {1..10}; do seqkit grep -n -f model-36-0.990.h5-random_uniform_L200_N1M_sample"$i"_prob_more_09_ids.txt random_uniform_L200_N1M_sample"$i".fasta | seqkit seq --seq > model-36-0.990.h5-random_uniform_L200_N1M_sample"$i"_prob_more_09.seq; done;
Combine the sequences to a single file for MI plotting:
cat model-36-0.990.h5-random_uniform_L200_N1M_sample*_prob_more_09.seq > model-36-0.990.h5-random_uniform_L200_N10M_prob_more_09.seq
These sequences can then be visualized using plotMI (Figure 1b):
plotMI.py --outdir model-36-0.990.h5-random_uniform_L200_N1M0_prob_more_09- --seqs model-36-0.990.h5-random_uniform_L200_N10M_prob_more_09.seq --nproc 16 --figtype png --k 3 --v 1 --p 5
To recreate Figures 2 b and c, we use the helper script plotDiagonals.py
from the plotMI repository, which reads in the MI-matrix (flag --MI) and plots the main diagonal as well as the average and maximum MI at each diagonal of the MI matrix:
plotDiagonals.py --outdir model-36-0.990.h5-random_uniform_L200_N1M0_prob_more_09- --MI model-36-0.990.h5-random_uniform_L200_N1M0_prob_more_09-MI.txt.gz --k 3
By default plotMI will output three files: an image that contains the MI plot, a gzipped tab-delimited text-file that contains the corresponding MI matrix and a log file that contains only the command used to run plotMI. Optionally, one can set the flag --save_distribution yes
, which will then output the estimated probabilities and MI contributions for each k-mer and position pair. Note that depending on the length of the model, this is a large file (>1GB) An example of ten first lines of this file:
#i j a b MI_ij(a,b) P_ij(a,b) P_i(a) P_j(b)
0 3 GGG GGG -3.475747040270734e-05 0.00020345052083333334 0.01510622461859291 0.015161103336626056
0 3 GGG GGC 4.352340047630113e-05 0.0002583292388664801 0.01510622461859291 0.015215982054659204
0 3 GGG GGT -4.621769658297288e-05 0.00020345052083333334 0.01510622461859291 0.01576476923499067
0 3 GGG GGA -4.926708687151659e-05 0.00020345052083333334 0.01510622461859291 0.01592940538909011
0 3 GGG GCG 0.00012695982387922662 0.00031320795689962684 0.01510622461859291 0.015655011798924378
0 3 GGG GCC 2.516479878216178e-05 0.0002583292388664801 0.01510622461859291 0.015984284107123256
0 3 GGG GCT -3.155270299332963e-05 0.00020345052083333334 0.01510622461859291 0.014996467182526617
0 3 GGG GCA 2.516479878216178e-05 0.0002583292388664801 0.01510622461859291 0.015984284107123256
0 3 GGG GTG 4.893921467700538e-05 0.0002583292388664801 0.01510622461859291 0.014996467182526617