rnascan is a (mostly) Python suite to scan RNA sequences and secondary structures with sequence and secondary structure PFMs. Secondary structure is represented as weights in different secondary structure contexts, similar to how a PFM represents weights of different nucleotides or amino acids. This allows representation and use of secondary structures in a way that is similar to how PFMs are used to scan nucleotide sequences, and also allows for some flexibility in the structure, as you might find in the boltzmann distribution of secondary structures.
The secondary structure alphabet is as follows:
- B - bulge loop
- E - external (unpaired) RNA
- H - hairpin loop
- L - left paired RNA (i.e., a '(' in dot-bracket format)
- M - multiloop
- R - right paired RNA (i.e., a ')' in dot-bracket format)
- T - internal loop
The rnascan suite consists of two tools:
run_folding
: Calculate an average structural context profile of an RNA sequence by folding overlapping 100 nt subsequences and averaging across.rnascan
: Scan RNA sequences and secondary structures with sequence and secondary structure PFMs.
Read the following steps to install rnascan. If you do not plan on using the
run_folding
tool to fold sequences, you may skip the steps with an asterisk (*).
To predict secondary structures, the program RNAfold
from the ViennaRNA package is used. Please follow the installation instructions on their website.
git clone [email protected]:morrislab/rnascan.git
cd rnascan
The compiled binary must be saved in a location where it can be executed (i.e. is listed in your PATH
environment variable). Here, we use the user's local bin
:
g++ -o ~/bin/parse_secondary_structure scripts/parse_secondary_structure.cpp
This package requires Python 2.7+ or Python 3.5+. To install the package, run the following:
python setup.py install
# alternatively, for user-specific installation:
python setup.py install --user
Dependencies (pandas, numpy, and biopython) will be automatically downloaded and installed, if necessary.
For full documentation of options, refer to the help messages using the -h
option for each command.
run_folding sequences.fasta /path/to/output_dir
The second argument /path/to/output_dir
is the directory where the average structure profiles will be saved. One file per FASTA record will be outputted.
Scanning can be performed in four modes:
- Sequence only (using
-p
to specify the sequence PFM) - Structure only (using
-q
to specify the structure PFM) - Sequence and structure (
-p
and-q
) - Sequence and averaged structure (
-p
and-q
)
Here are some example commands using minimal options:
# To run a test sequence
rnascan -p pfm_seq.txt -t AGTTCCGGTCCGGCAGAGATCGCG > hits.tab
# Sequence-only (use -p)
rnascan -p pfm_seq.txt sequences.fasta > hits.tab
# Structure-only (use -q)
rnascan -q pfm_struct.txt structures.fasta > hits.tab
# Sequence and structure
rnascan -p pfm_seq.txt -q pfm_struct.txt sequences.fasta structures.fasta > hits.tab
# Sequence and averaged structure
rnascan -p pfm_seq.txt -q pfm_struct.txt sequences.fasta averaged_structures/ > hits.tab
Note that in the last example, the second positional argument is the path to a
directory containing the average structure profiles generated by run_folding
.
rnascan
will look inside the directory and automatically search for files
that look like structure.<sequence_id>.txt
.
To print the score at every position, change the default threshold using the
-m
option to -inf
. To change the number of processing cores, use -c
:
rnascan -p pfm_seq.txt -q pfm_struct.txt -m ' -inf' -c 8 sequences.fasta averaged_structures/ > hits.tab
By default, rnascan
computes the background probabilities from the input
sequences at the beginning of the run. To apply a uniform
background, use the option -u
:
rnascan -p pfm_seq.txt -u sequences.fasta > hits.tab
To compute the background probabilities of a set of input sequences and save it
for future use, use the option --bgonly
:
rnascan -p pfm_seq.txt --bgonly sequences.fasta > background.txt
rnascan -q pfm_struct.txt --bgonly structures.fasta > background.txt
In this mode, rnascan
computes the background probabilities, outputs to standard output (in the form of a Python dictionary), and exits (no scanning is performed). To re-use this background later, use the option --bg_seq
or
--bg_struct
with the background file:
rnascan -p pfm_seq.txt --bg_seq background.txt sequences.fasta > hits.tab
Cook, K.B., Vembu, S., Ha, K.C.H., Zheng, H., Laverty, K.U., Hughes, T.R., Ray, D., Morris, Q.D., 2017. RNAcompete-S: Combined RNA sequence/structure preferences for RNA binding proteins derived from a single-step in vitro selection. Methods 126, 18–28. http://www.sciencedirect.com/science/article/pii/S1046202317300312