RDI (Repeat Detection Index) is a Codon library used to build an index for finding repeated substrings. RDI leverages this index to peform de novo genome assembly.
bash -c "$(curl -fsSL https://exaloop.io/install.sh)"
export platform=$(uname -s | awk '{print tolower($0)}')-$(uname -m)
curl -L https://github.com/exaloop/seq/releases/download/v0.11.3/seq-${platform}.tar.gz \
| tar zxvf - -C ~/.codon/lib/codon/plugins
Example for Python interoperability:
export CODON_PYTHON=/usr/lib/python3.8/config-3.8-x86_64-linux-gnu/libpython3.8.so
- CMake:
sudo apt install cmake
- Clone the repo:
git clone [email protected]:curtisupshall/rdi
- Fetch submodules:
make submodules
- Install libdivsufsort:
make libdivsufsort
- Compile RDI:
make rdi
RDI runs in two modes. In index
mode, RDI builds a repeat detection index and writes it to disk next to your input file. In query
mode, you can make
queries against the index.
./rdi index path/to/your/file.fa
./rdi query -l 10 -r 6
Name | Type | Description |
---|---|---|
-h , --help |
- | Help |
-r , --repeats |
int |
Repeats |
-l , --length |
int |
Kmer length |
-i , --input |
string |
Path to batch query file |
Write each query as an ordered pair of length, then repeat count. Example:
$ cat my-file.txt
:
30 20 10 50
./rdi query data/test.fa -i my-file.txt
:
10 GGCCAAGGCG @712432 10 TAATCCTAGC @562644 10 TACAGGTGCC @1220261 10 TAGCTGGGTG @1831828 10 TGCGGTGGCT @1821871 10 TTTGCCATGT @266207 10 TTTTGCCATG @790679 > Ran 2 queries in 0.240325927734375 ms
- Indexing strategy; particularly around perfect minimal hashing
- Parallelization
Many thanks to M. Oguzhan Kulekci for providing the indexing algorithm used in this project, as well as pseudocode, examples, and general guidance.