TideHunter: efficient and sensitive tandem repeat detection from noisy long reads using seed-and-chain
- Update to the latest abPOA
Download the latest release:
wget https://github.com/yangao07/TideHunter/releases/download/v1.4.3/TideHunter-v1.4.3.tar.gz
tar -zxvf TideHunter-v1.4.3.tar.gz && cd TideHunter-v1.4.3
Make from source and run with test data:
make; ./bin/TideHunter ./test_data/test_50x4.fa > cons.fa
Or, install via conda and run with test data:
conda install -c bioconda tidehunter
TideHunter ./test_data/test_50x4.fa > cons.fa
- Introduction
- Installation
- Getting started with toy example in
test_data
- Usage
- Commands and options
- Input
- Output
- Contact
TideHunter is an efficient and sensitive tandem repeat detection and consensus calling tool which is designed for tandemly repeated long-read sequence (INC-seq, R2C2, NanoAmpli-Seq).
It works with Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) sequencing data at error rates up to 20% and does not have any limitation of the maximal repeat pattern size.
On Linux/Unix and Mac OS, TideHunter can be installed via
conda install -c bioconda tidehunter
You can also build TideHunter from source files. Make sure you have gcc (>=6.4.0) and zlib installed before compiling. It is recommended to download the latest release of TideHunter from the release page.
wget https://github.com/yangao07/TideHunter/releases/download/v1.4.3/TideHunter-v1.4.3.tar.gz
tar -zxvf TideHunter-v1.4.3.tar.gz
cd TideHunter-v1.4.3; make
Or, you can use git clone
command to download the source code.
Don't forget to include the --recursive
to download the codes of abPOA.
This gives you the latest version of TideHunter, which might be still under development.
git clone --recursive https://github.com/yangao07/TideHunter.git
cd TideHunter; make
If you meet any compiling issue, please try the pre-built binary file:
wget https://github.com/yangao07/TideHunter/releases/download/v1.4.3/TideHunter-v1.4.3_x64-linux.tar.gz
tar -zxvf TideHunter-v1.4.3_x64-linux.tar.gz
TideHunter ./test_data/test_1000x10.fa > cons.fa
TideHunter ./test_data/test_1000x10.fa > cons.fa
TideHunter -f 2 ./test_data/test_1000x10.fa > cons.out
TideHunter -5 ./test_data/5prime.fa -3 ./test_data/3prime.fa ./test_data/full_length.fa > cons_full.fa
TideHunter -u ./test_data/test_1000x10.fa > unit.fa
TideHunter -u -f 2 ./test_data/test_1000x10.fa > unit.out
Options:
Seeding:
-k --kmer-length INT k-mer length (no larger than 16). [8]
-w --window-size INT window size. [1]
-s --step-size INT step size. [1]
-H --HPC-kmer use homopolymer-compressed k-mer. [False]
Tandem repeat criteria:
-c --min-copy INT minimum copy number of tandem repeat. [2]
-e --max-diverg INT maximum allowed divergence rate between two consecutive repeats. [0.25]
-p --min-period INT minimum period size of tandem repeat. (>=2) [30]
-P --max-period INT maximum period size of tandem repeat. (<=4294967295) [10K]
Scoring parameters for partial order alignment:
-M --match INT match score [2]
-X --mismatch INT mismatch penalty [4]
-O --gap-open INT(,INT) gap opening penalty (O1,O2) [4,24]
-E --gap-ext INT(,INT) gap extension penalty (E1,E2) [2,1]
TideHunter provides three gap penalty modes, cost of a g-long gap:
- convex (default): min{O1+g*E1, O2+g*E2}
- affine (set O2 as 0): O1+g*E1
- linear (set O1 as 0): g*E1
Adapter sequence:
-5 --five-prime STR 5' adapter sequence (sense strand). [NULL]
-3 --three-prime STR 3' adapter sequence (anti-sense strand). [NULL]
-a --ada-mat-rat FLT minimum match ratio of adapter sequence. [0.80]
Output:
-o --output STR output file. [stdout]
-u --unit-seq only output the unit sequences of each tandem repeat, no consensus sequence. [False]
-l --longest only output the consensus sequence of the tandem repeat that covers the longest read sequence. [False]
-F --full-len only output full-length consensus sequence. [False]
-f --out-fmt INT output format. [1]
- 1: FASTA
- 2: Tabular
Computing resource:
-t --thread INT number of threads to use. [4]
General options:
-h --help print this help usage information.
-v --version show version number.
TideHunter works with FASTA, FASTQ, gzip'd FASTA(.fa.gz) and gzip'd FASTQ(.fq.gz) formats.
Additional adapter sequence files can be provided to TideHunter with -5
and -3
options.
TideHunter uses adapter information to search for the full-length sequence from the generated consensus.
Once two adapters are found, TideHunter trims and reorients the consensus sequence.
TideHunter can output consensus sequence in FASTA format by default, it can also provide output in tabular format.
For tabular format, 9 columns will be generated for each consensus sequence:
No. | Column name | Explanation |
---|---|---|
1 | readName | the original read name |
2 | repN | N is the ID number of the tandem repeat, within each read, starts from 0 |
3 | readLen | length of the original long read |
4 | start | start coordinate of the tandem repeat, 1-based |
5 | end | end coordinate of the tandem repeat, 1-based |
6 | consLen | length of the consensus sequence |
7 | copyNum | copy number of the tandem repeat |
8 | aveMatch | average percent of matches between each unit sequence and the consensus sequence (# matched bases / unit length) |
9 | fullLen | 0: not a full-length sequence, 1: sense strand full-length, 2: anti-sense strand full-length |
10 | subPos | start coordinates of all the tandem repeat unit sequence, followed by the end coordinate of the last tandem repeat unit sequence, separated by , , all coordinates are 1-based, see examples below |
11 | consSeq | consensus sequence |
For example, here are the output for a non-full-length consensus sequence generated from test_data/test_50x4.fa and the adiagram that illustrates all the coordiantes in the output:
test_50x4 rep0 300 51 250 50 4.0 100.0 0 59,109,159,208 CGATCGATCGGCATGCATGCATGCTAGTCGATGCATCGGGATCAGCTAGT
In this example, TideHunter identifies three consecutive tandem repeat units, [59,108], [109,158], [159,208], from the raw read which is 300 bp long. A consensus sequence with 50 bp is generated from the three repeat units. TideHunter further extends the tandem repeat boundary to [51, 250] by aligning the consensus sequence back to the raw read on both sides of the three repeat units.
Another example of the output for a full-length consensus sequence generated from test_data/full_length.fa:
8f2f7766-4b8e-4c0d-9e2b-caf0e5527b19 rep0 5231 31 5215 203 8.8 95.7 1 207,798,1386,1976,2563,3155,3746,4333,4930 ACTAATAAGATCAACAGAATCAGAGTAGATAGTTCCTTGATCGGAACCAAAGGACCCCGTGCCTCAATCTCTATCCTGATGTCATGGGAGTCCTAGCAAAGCTATAGACTCAAGCAAGGCTTGGGGTCCTTTATGGAACCCAAGGATGACTCAGCAATAAAATATTTTGGTTTTGGTTTATAAAAAAAAAAAAAAAAAAAAAA
In this example, the consLen
(i.e., 203) is the length of the full-length consensus sequence excluding the 5' and 3' adapter sequences and the subPos
(i.e., 207,798,1386,1976,2563,3155,3746,4333,4930) contains the coordinate information of the identified tandem repeat units.
For FASTA output format, the read name contains detailed information of the detected tandem repeat, i.e., the above columns 1 ~ 10. The sequence is the consensus sequence.
The read name of each consensus sequence has the following format:
>readName_repN_readLen_start_end_consLen_copyNum_aveMatch_fullLen_subPos
TideHunter can output the unit sequences without performing the consensus calling step when option -u/--unit-seq
is enabled. Then, only the following information will be output for the tabular format:
No. | Column name | Explanation |
---|---|---|
1 | readName | the original read name |
2 | repN | N is the ID number of the tandem repeat, within each read, starts from 0 |
3 | unitX | X is the ID number of the unit sequence, starts from 0 |
4 | unitSeq | unit sequence |
And for the FASTA format:
>readName_repN_unitX
unitSeq X
>readName_repN_unitY
unitSeq Y
Yan Gao [email protected]
Yadong Wang [email protected]
Yi Xing [email protected]