Skip to content

Visualize interactions learned by a machine learning model by plotting mutual information between pairwise k-mer distributions

Notifications You must be signed in to change notification settings

hartonen/plotMI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 

Repository files navigation

PlotMI

PlotMI is a program for visualization of pairwise interactions in a set of input sequences by computing the pairwise mutual information between positional k-mer distributions. Description of the method is in the following preprint:

Tuomo Hartonen, Teemu Kivioja and Jussi Taipale, "PlotMI: visualization of pairwise interactions and positional preferences learned by a deep learning model from sequence data", bioRxiv 2021.03.14.435285; doi: https://doi.org/10.1101/2021.03.14.435285

1. Installation

All scripts are pure Python, so no compiling is needed. Easiest way is to clone the repository to your own computer. In a desired location, type:

git clone https://github.com/hartonen/plotMI.git

The scripts in this repository need specific Python packages to function properly. The easiest way to make sure everything works is to create a virtual environment (https://docs.python.org/3/library/venv.html#module-venv) containing the tested versions of each package and then run the scripts in this environment. This is done by first creating a new virtual environment:

python3 -m venv /path/to/new/virtual/environment

Then one needs to install all the required packages. These are listed in data/plotMI_requirements.txt. So activate the new virtual environment:

source /path/to/new/virtual/environment/bin/activate

and install the packages with pip:

pip install -r data/plotMI_requirements.txt

2. Usage

Help message can be evoked by typing:

plotMI.py -h
usage: plotMI.py [-h] [--outdir OUTDIR] [--seqs SEQS]
                 [--distance {MI,JS,JS_inv,BC,BC_inv,HE}] [--nproc NPROC]
                 [--figtype {pdf,png}] [--k K] [--v {0,1}] [--p P]
                 [--alphabet ALPHABET] [--colorscale {lin,log}]
                 [--minmi MINMI] [--step STEP] [--save_distributions {yes,no}]
                 [--randomized_pairs RANDOMIZED_PAIRS]

optional arguments:
  -h, --help            show this help message and exit
  --outdir OUTDIR       Full path to the output directory.
  --seqs SEQS           Full path to the plain text input sequence file. Each
                        sequence must be of same length and on its separate
                        line.
  --distance {MI,JS,JS_inv,BC,BC_inv,HE}
                        Distance used to compare positional k-mer
                        distributions. MI=mutual information (default),
                        JS=Jensen-Shannon divergence, JS_inv=inverted Jensen-
                        Shannon distance, BC=Bhattacharyya distance,
                        BC_inv=inverted Bhattacharyya distance, HE=Hellinger
                        distance.
  --nproc NPROC         Number of parallel processes used when computing MI
                        (default=1).
  --figtype {pdf,png}   png or pdf (default=png).
  --k K                 length of k-mer distributions used to calculate MI
                        (default=3).
  --v {0,1}             Verbosity level, 0=none, 1=print info on screen
                        (default=1).
  --p P                 Multiplier for pseudocount mass added to k-mer count.
                        Total pseudocount mass added is p*(number of
                        sequences) (default=5).
  --alphabet ALPHABET   A string containing each individual letter in the
                        alphabet used (default=ACGT). NOTE! This is case-
                        sensitive.
  --colorscale {lin,log}
                        If set to log, colormap is scaled logarithmically
                        (default=lin, meaning linear scaling).
  --minmi MINMI         Set minimum value for colormap, helpful if you want to
                        be sure that the minimum value is 0 (default=minimum
                        value in MI matrix).
  --step STEP           Step size for axis ticks in MI-plot (default=20).
  --save_distributions {yes,no}
                        If yes, save the positional and pairwise k-mer
                        distributions and the MI contributions from each k-mer
                        and position pair into a separate file (default=no).
                        Note that this is a large file, possibly many GBs.
  --randomized_pairs RANDOMIZED_PAIRS
                        EXPERIMENTAL FEATURE: File containing list of position
                        pairs whose k-mer distributions will be randomized to
                        flat uniform distribution according to given alphabet
                        before computing MI. Position indices should start
                        from 0 and be saved in tab-separated format with one
                        pair on a single row.

2.1 Input sequence format

PlotMI accepts sequences in plain text format where each row is separated from each other by line change. Note that all rows must be of equal length and each letter that appears in the input sequences must be present in the string given with the --alphabet flag. For example, if we set --alphabet ACGT, plotMI will not be able to handle sequences that have a letter N in them.

3. Usage examples

3.1 Simple example with pre-filtered input data

The most basic use case for plotMI only requires input sequences that have been filtered using a machine learning model. Here as an example, we reproduce Figure 1d from the plotMI manuscript using random uniform DNA sequences that have been filtered using a convolutional neural network model that has been trained to recognize human promoter sequences (see the manuscript for details). The file data/model-492-0.882.h5-random-uniform-1M-prob-more-09.seq contains 49,261 sequences that have a probability >0.9 of being active human promoters based on the CNN model. Using these sequences, we can visualize the interactions learned by the human promoter CNN model with:

plotMI.py --outdir ./test- --seqs model-492-0.882.h5-random-uniform-1M-prob-more-09.seq --nproc 4 --figtype png
Alphabet: ACTG
Read in 49261 sequences of length 100 in 0.04489850997924805 seconds.
Calculated the positional 3-mer frequencies in 2.0755910873413086 seconds.
Computed mutual information in 83.16843390464783 seconds.
Plotting done in 0.4934566020965576 seconds.

This should produce two output files identical to files data/test-MI.png and data/test-MI.txt.gz.

3.2 Reproducing figure 2a from plotMI-manuscript

In the following we show how to replicate the figures 2a-c from the plotMI-manuscript. For this we will need two other Python scripts from the authors from, randomReads and promoterAnalysis repositories. We will also use the Seqkit tool for manipulating fasta-files.

First we generate 10 million random DNA sequences (10 separate files for faster scoring with the CNN model) with uniform nucleotide background using a script made for this purpose:

for i in {1..10}; do randomReads.py random_uniform_L200_N1M_sample"$i".fasta --L 200 --N 1000000; done;

Next we use the pre-trained CNN model to score all these sequences. The pre-trained model can be downloaded from Zenodo: https://doi.org/10.5281/zenodo.5508698

for i in {1..10}; do scorePromoters.py --outfile model-36-0.990.h5-random_uniform_L200_N1M_sample"$i"_preds.txt --model model-36-0.990.h5 --sequences random_uniform_L200_N1M_sample"$i".fasta --nproc 4 & done;

Then select the fasta-IDs of sequences that score higher than 0.9 according to the model:

for i in {1..10}; do awk '$2>0.9' model-36-0.990.h5-random_uniform_L200_N1M_sample"$i"_preds.txt | cut -f1,1 > model-36-0.990.h5-random_uniform_L200_N1M_sample"$i"_prob_more_09_ids.txt; done;

Then we use the wonderful seqkit-package to extract the sequences matching to these IDs and convert them to sequence-only input for mutual information plotting:

for i in {1..10}; do seqkit grep -n -f model-36-0.990.h5-random_uniform_L200_N1M_sample"$i"_prob_more_09_ids.txt random_uniform_L200_N1M_sample"$i".fasta | seqkit seq --seq > model-36-0.990.h5-random_uniform_L200_N1M_sample"$i"_prob_more_09.seq; done;

Combine the sequences to a single file for MI plotting:

cat model-36-0.990.h5-random_uniform_L200_N1M_sample*_prob_more_09.seq > model-36-0.990.h5-random_uniform_L200_N10M_prob_more_09.seq

These sequences can then be visualized using plotMI (Figure 1b):

plotMI.py --outdir model-36-0.990.h5-random_uniform_L200_N1M0_prob_more_09- --seqs model-36-0.990.h5-random_uniform_L200_N10M_prob_more_09.seq --nproc 16 --figtype png --k 3 --v 1 --p 5

To recreate Figures 2 b and c, we use the helper script plotDiagonals.py from the plotMI repository, which reads in the MI-matrix (flag --MI) and plots the main diagonal as well as the average and maximum MI at each diagonal of the MI matrix:

plotDiagonals.py --outdir model-36-0.990.h5-random_uniform_L200_N1M0_prob_more_09- --MI model-36-0.990.h5-random_uniform_L200_N1M0_prob_more_09-MI.txt.gz --k 3

4. Output description

By default plotMI will output three files: an image that contains the MI plot, a gzipped tab-delimited text-file that contains the corresponding MI matrix and a log file that contains only the command used to run plotMI. Optionally, one can set the flag --save_distribution yes, which will then output the estimated probabilities and MI contributions for each k-mer and position pair. Note that depending on the length of the model, this is a large file (>1GB) An example of ten first lines of this file:

#i      j       a       b       MI_ij(a,b)      P_ij(a,b)       P_i(a)  P_j(b)
0       3       GGG     GGG     -3.475747040270734e-05  0.00020345052083333334  0.01510622461859291     0.015161103336626056
0       3       GGG     GGC     4.352340047630113e-05   0.0002583292388664801   0.01510622461859291     0.015215982054659204
0       3       GGG     GGT     -4.621769658297288e-05  0.00020345052083333334  0.01510622461859291     0.01576476923499067
0       3       GGG     GGA     -4.926708687151659e-05  0.00020345052083333334  0.01510622461859291     0.01592940538909011
0       3       GGG     GCG     0.00012695982387922662  0.00031320795689962684  0.01510622461859291     0.015655011798924378
0       3       GGG     GCC     2.516479878216178e-05   0.0002583292388664801   0.01510622461859291     0.015984284107123256
0       3       GGG     GCT     -3.155270299332963e-05  0.00020345052083333334  0.01510622461859291     0.014996467182526617
0       3       GGG     GCA     2.516479878216178e-05   0.0002583292388664801   0.01510622461859291     0.015984284107123256
0       3       GGG     GTG     4.893921467700538e-05   0.0002583292388664801   0.01510622461859291     0.014996467182526617

About

Visualize interactions learned by a machine learning model by plotting mutual information between pairwise k-mer distributions

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages