Home

BMF User Guide

Summary

Knowing the basis of protein-RNA recognition is essential to understanding regulatory processes in the cell. Many RNA-binding proteins (RBPs) form complexes or have multiple domains that allow binding to the RNA molecule in a multivalent manner. Through cooperative binding these proteins can reach higher specificity and affinity than those of single RNA-binding domains. However, current approaches to RNA de novo motif discovery do not take the modularity of binding events into account. Here we present Bipartite Motif Finder (BMF), an RNA motif finder that is based on a thermodynamic model of an RBP with two binding sites acting cooperatively in targeting an RNA molecule. We show that bipartite binding is a common strategy among RBPs to achieve higher levels of sequence specificity. We furthermore illustrate that the spacial geometry between the two binding sites can be learnt from bound RNA sequences and that this information enhances the model's accuracy in predicting new binding sites. These bipartite motifs are consistent with previously known motifs and binding behaviors. Our results demonstrate the importance of multivalent binding for RNA-binding proteins and highlight the value of bipartite motif models in representing the multivalency of protein-RNA interactions.

BMF is also available as a webserver:

Link: bmf.soedinglab.org
Web server repository: soedinglab/bmf-webserver

System requirements

BMF requires AVX2 extension capable processor. You can check if AVX2 is supported by executing cat /proc/cpuinfo | grep avx2 on Linux and sysctl -a | grep machdep.cpu.leaf7_features | grep AVX2 on MacOS).
python>3.6
numpy
cython

Installation

To get the requirements, you can create a new conda environment with python, numpy, and cython:

conda create -n bmf python=3.6 numpy cython
conda activate bmf

Install BMF with pip:

pip install https://github.com/soedinglab/bipartite_motif_finder/releases/download/v1.0.0a/bmf_tool-1.0.0.tar.gz

See BMF help page:

bmf --help

BMF guide

BMF has three main functionalities: (1) learning de novo bipartite motifs from enriched and background sequence sets, (2) plotting the motif and predicting if the RNA-binding protein has a bipartite motif or not, and (3) using the trained BMF model to predict binding to new sequences.

In the following sections, we describe how to use BMF to perform each of these functionalities.

Motif discovery

You can call the command-line tool bmf to perform de novo motif discovery. Here is a list of parameters that you can pass to bmf for training:

positional arguments:

  sequences             path to positive sequences enriched with the
                        motif.

compulsory arguments:

  --BGsequences BGSEQUENCES
                        path to background sequences.

optional arguments:

  --input_type {fasta,fastq,seq}
                        format of input sequences. Can be "fasta", "fastq",
                        or "seq". Default value is "fasta".

  --motif_length MOTIF_LENGTH
                        the length of each core in the bipartite motif.
                        Default value is 3.

  --no_tries NO_TRIES   the number of times the program is run with random
                        initializations.
                        Default value is 1000.

  --output_prefix OUTPUT_PREFIX
                        output file prefix. You can specify a directory e.g. 
                        "--output_prefix output_dir/my_prefix"
                        Default value is "bipartite".

  --var_thr VAR_THR     variability threshold condition to stop ADAM

  --batch_size BATCH_SIZE
                        the number of sequences processed in each batch.
                        Default value is 512.

  --max_iterations MAX_ITERATIONS
                        max number of iterations before stopping ADAM.
                        Default value is 1000.

  --no_cores NO_CORES   the numbers of CPU cores used
                        Default value is 4.

Run BMF with multiple random initializations

You can run BMF with with n random parameter initializations by specifying --no_tries n. Even though BMF is robust to parameter initializations in most cases, this ensures that the best likelihood model would be found. In our manuscript we run BMF with --no_tries 5. We develop the BMF workflow in a way that when multiple initializations are performed, the best likelihood solution will be used to generate the sequence logo and to predict binding.

Output file name

You can specify the output file name with --output_prefix path-to-file/file-name. BMF will generate the following outputs for each round of parameter initialization i:

Plots of parameter changes over iterations and training set ROC curve: path-to-file/file-name_cs{motif_length}_{i}.pdf & .png
Model parameters path-to-file/file-name_cs{motif_length}_{i}.txt

Input file formats

You can run BMF with traditional "fasta" and "fastq" file formats. Additionally you can provide just the sequences in the following format which we refer to as "seq":

AGGCTCGGTTACGTGCAGGGCCTGATGTTCTTGATCTGTT
CTTCCAAGGAAGCTTTGACTCACAGAAATGGTAAAGTCCA
TCCCTTCGCTAAGTAGGGACGCCTCGGGCGAGACAATAGC
GAGGTGGGCTCGCGTACCTCACTTACACCATGCGCCTCAT
...

Note: The input sequences should be of equal lengths, and can only consist of the characters: A, C, G, T, U, and N.

Generating motif logo

You can generate bmf logo plots, using the parameter files generated via bmf in the next step. To do so you need to call bmf_logo with the following parameters:

positional arguments:
  parameter_prefix      path-to-bmf-param-file that specifies model parameters or
                        when multiple parameters exist, their common root.

optional arguments:
  --motif_length MOTIF_LENGTH
                        the length of each core in the bipartite motif
                        Default value is 3.

Please note that parameter_prefix corresponds to output_prefix in the previous step. When multiple initializations were used, bmf_logo reads all and selects the best likelihood solution to generate the motif logo.

The BMF logo plot is stored at {parameter_prefix}_seqLogo.pdf & .png.

Predict binding to new sequences

You can use the trained BMF model parameters to predict binding scores for new sequences. To do so you should run bmf with --predict. Here is a list of parameters that you can pass to bmf for predicting:

positional arguments:

  sequences             path to test sequences.

compulsory arguments:

  --test

  --model_parameters MODEL_PARAMETERS
                        path to .txt file that specifies model parameters, 
                        or the output_prefix used when training bmf.

optional arguments:

  --input_type {fasta,fastq,seq}
                        format of input sequences. Can be "fasta", "fastq",
                        or "seq". Default value is "fasta".

  --motif_length MOTIF_LENGTH
                        the length of each core in the bipartite motif.
                        Default value is 3.

The binding score for each sequence is saved in the file {model_parameters}.predictions. Note: these values correspond to the summation of statistical weights over all possible configurations. Higher values correspond to a higher binding probability. Based on our thermodynamic model, these values can be converted to binding probabilities with the following formula:

$p(\text{bound} |\x) = 1 - \frac{1}{Z(\x)}$

License terms

The software is made available under the terms of the GNU General Public License v3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly