-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Knowing the basis of protein-RNA recognition is essential to understanding regulatory processes in the cell. Many RNA-binding proteins (RBPs) form complexes or have multiple domains that allow binding to the RNA molecule in a multivalent manner. Through cooperative binding these proteins can reach higher specificity and affinity than those of single RNA-binding domains. However, current approaches to RNA de novo motif discovery do not take the modularity of binding events into account. Here we present Bipartite Motif Finder (BMF), an RNA motif finder that is based on a thermodynamic model of an RBP with two binding sites acting cooperatively in targeting an RNA molecule. We show that bipartite binding is a common strategy among RBPs to achieve higher levels of sequence specificity. We furthermore illustrate that the spacial geometry between the two binding sites can be learnt from bound RNA sequences and that this information enhances the model's accuracy in predicting new binding sites. These bipartite motifs are consistent with previously known motifs and binding behaviors. Our results demonstrate the importance of multivalent binding for RNA-binding proteins and highlight the value of bipartite motif models in representing the multivalency of protein-RNA interactions.
BMF is also available as a webserver:
- Link: bmf.soedinglab.org
- Web server repository: soedinglab/bmf-webserver
- BMF requires AVX2 extension capable processor. You can check if AVX2 is supported by executing
cat /proc/cpuinfo | grep avx2
on Linux andsysctl -a | grep machdep.cpu.leaf7_features | grep AVX2
on MacOS). python>3.6
numpy
cython
python>3.6
numpy
cython
Create a new conda environment with python
, numpy
, and cython
:
conda create -n bmf python=3.6 numpy cython
conda activate bmf
sudo apt-get update
sudo apt-get install python3.6 python3-pip
pip3 install numpy cython
brew install python3
pip install numpy cython
-
Optional: BMF is also available as a faster version for running on AVX2 extension capable processor. You can check if AVX2 is supported by executing
cat /proc/cpuinfo | grep avx2
on Linux andsysctl -a | grep machdep.cpu.leaf7_features | grep AVX2
on MacOS). If your processor supports AVX2, run the following command to compile a faster version of BMF:
export USE_AVX=1
- Install BMF with pip:
pip install https://github.com/soedinglab/bipartite_motif_finder/releases/download/v1.0.0a/bmf_tool-1.0.0.tar.gz
See BMF help page:
bmf --help
BMF has three main functionalities: (1) learning de novo bipartite motifs from enriched and background sequence sets, (2) plotting the motif and predicting if the RNA-binding protein has a bipartite motif or not, and (3) using the trained BMF model to predict binding to new sequences.
In the following sections, we describe how to use BMF to perform each of these functionalities.
You can call the command-line tool bmf
to perform de novo motif discovery. Here is a list of parameters that you can pass to bmf
for training:
positional arguments:
sequences path to positive sequences enriched with the
motif.
compulsory arguments:
--BGsequences BGSEQUENCES
path to background sequences.
optional arguments:
--input_type {fasta,fastq,seq}
format of input sequences. Can be "fasta", "fastq",
or "seq". Default value is "fasta".
--motif_length MOTIF_LENGTH
the length of each core in the bipartite motif.
Default value is 3.
--no_tries NO_TRIES the number of times the program is run with random
initializations.
Default value is 1000.
--output_prefix OUTPUT_PREFIX
output file prefix. You can specify a directory e.g.
"--output_prefix output_dir/my_prefix"
Default value is "bipartite".
--var_thr VAR_THR variability threshold condition to stop ADAM
--batch_size BATCH_SIZE
the number of sequences processed in each batch.
Default value is 512.
--max_iterations MAX_ITERATIONS
max number of iterations before stopping ADAM.
Default value is 1000.
--no_cores NO_CORES the numbers of CPU cores used
Default value is 4.
You can run BMF with with n
random parameter initializations by specifying --no_tries n
. Even though BMF is robust to parameter initializations in most cases, this ensures that the best likelihood model would be found. In our manuscript we run BMF with --no_tries 5
. We develop the BMF workflow in a way that when multiple initializations are performed, the best likelihood solution will be used to generate the sequence logo and to predict binding.
You can specify the output file name with --output_prefix path-to-file/file-name
. BMF will generate the following outputs for each round of parameter initialization i
:
- Plots of parameter changes over iterations and training set ROC curve:
path-to-file/file-name_cs{motif_length}_{i}.pdf & .png
- Model parameters
path-to-file/file-name_cs{motif_length}_{i}.txt
You can run BMF with traditional "fasta" and "fastq" file formats. Additionally you can provide just the sequences in the following format which we refer to as "seq":
AGGCTCGGTTACGTGCAGGGCCTGATGTTCTTGATCTGTT
CTTCCAAGGAAGCTTTGACTCACAGAAATGGTAAAGTCCA
TCCCTTCGCTAAGTAGGGACGCCTCGGGCGAGACAATAGC
GAGGTGGGCTCGCGTACCTCACTTACACCATGCGCCTCAT
...
Note: The input sequences should be of equal lengths, and can only consist of the characters: A, C, G, T, U, and N.
You can generate bmf logo plots, using the parameter files generated via bmf
in the next step. To do so you need to call bmf_logo
with the following parameters:
positional arguments:
parameter_prefix path-to-bmf-param-file that specifies model parameters or
when multiple parameters exist, their common root.
optional arguments:
--motif_length MOTIF_LENGTH
the length of each core in the bipartite motif
Default value is 3.
Please note that parameter_prefix
corresponds to output_prefix
in the previous step. When multiple initializations were used, bmf_logo
reads all and selects the best likelihood solution to generate the motif logo.
The BMF logo plot is stored at {parameter_prefix}_seqLogo.pdf & .png
.
You can use the trained BMF model parameters to predict binding scores for new sequences. To do so you should run bmf
with --predict
. Here is a list of parameters that you can pass to bmf
for predicting:
positional arguments:
sequences path to test sequences.
compulsory arguments:
--test
--model_parameters MODEL_PARAMETERS
path to .txt file that specifies model parameters,
or the output_prefix used when training bmf.
optional arguments:
--input_type {fasta,fastq,seq}
format of input sequences. Can be "fasta", "fastq",
or "seq". Default value is "fasta".
--motif_length MOTIF_LENGTH
the length of each core in the bipartite motif.
Default value is 3.
The binding score for each sequence is saved in the file {model_parameters}.predictions
.
Note: these values correspond to the summation of statistical weights over all possible configurations. Higher values correspond to a higher binding probability. Based on our thermodynamic model, these values can be converted to binding probabilities with the following formula:
You can find the fasta files needed to run this example in data
directory. Here we run BMF with one random parameter initialization. You can change the
--no_tries
to increase the number of BMF runs with new initial parameter values. The best likelihood solution would be used in this case to plot the BMF logo, and to predict binding to new sequences.
You can use bmf
in training mode for de novo motif discovery. By default, BMF runs over a maximum of 1000 iterations.
bmf positives_AAA_CCC.fasta --BGsequences negatives_AAA_CCC.fasta --input_type fasta --output_prefix AAA_CCC --motif_length 3 --no_tries 1
You can use bmf_logo
to plot the best likelihood motif model generated by BMF. Specify the output_prefix
from the previous step to allow bmf_logo
to find all associated parameter files. Here we use AAA_CCC
to specify the outputs from the previous run:
bmf_logo AAA_CCC --motif_length 3
You can use the trained BMF model parameters to predict binding scores for new sequences. To specify --model_parameters
, use the output_prefix
from the first step (here AAA_CCC
).
bmf test_sequences.fasta --predict --input_type fasta --model_parameters AAA_CCC --output_prefix predict_test_sequences
The software is made available under the terms of the GNU General Public License v3.