-
Notifications
You must be signed in to change notification settings - Fork 6
/
README.SHRiMP132
1121 lines (821 loc) · 46.4 KB
/
README.SHRiMP132
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
$Id: README 509 2010-01-30 16:40:39Z matei $
README for SHRiMP: SHort Read Mapping Package version 1.3.2
http://compbio.cs.toronto.edu/shrimp
This document describes the programs of which SHRiMP is comprised, including
their use, output formats, and various parameters. The algorithms employed are
also briefly described. For more details, see the following publication:
Rumble SM, Lacroute P, Dalca AV, Fiume M, Sidow A, et al.
(2009) SHRiMP: Accurate Mapping of Short Color-space Reads.
PLoS Comput Biol 5(5): e1000386. doi:10.1371/journal.pcbi.1000386
SHRiMP is brought to you by:
Stephen M. Rumble
Michael Brudno
Phil Lacroute
Vladimir Yanovsky
Marc Fiume
Adrian Dalca
Matei David
The authors may be contacted at shrimp at cs dot toronto dot edu.
We would like to thank Dr. Lucian Ilie of the University of Western Ontario for
providing us with sets of spaced seeds especially designed to achieve good
sensitivity.
--------------------------------------------------------------------------------
Table of Contents
--------------------------------------------------------------------------------
1. Overview
2. Quick Start
3. Usage
4. Program Parameters
5. Output Format
6. SHRiMP Algorithms
7. SHRiMP Statistics
8. Mate Pairs
9. Contact
10. Acknowledgements
--------------------------------------------------------------------------------
1. Overview
--------------------------------------------------------------------------------
SHRiMP is a software package for aligning genomic reads against a target genome.
It was primarily developed with the multitudinous short reads of Next Generation
Sequencing (NGS) machines in mind. It allows for rapid mapping, accurate
alignment, and p-value computation for Illumina/Solexa as well as Applied
Biosystems' SOLiD colourspace genomic representation. SHRiMP is a suite of
several programs, which, employed in succession, search for appropriate
alignments, analyze alignment statistics, and print out visual alignments for
further study.
SHRiMP uses two techniques to rapidly match reads to a genome: we use an
effective implementation of the q-gram filtering technique (utilizing spaced
seeds) to rapidly identify potentially homologous regions, and a vectored
Smith-Waterman implementation to speed up the accurate alignment of these
regions to reads. We also include in the SHRiMP package tools to analyze the
resultant alignments, including programs that compute for every alignment the
probability that the match would occur by chance (in a genome with iid
nucleotides), a prettyprint tool that can help visualize the alignments for each
read, and a variation output tool (shrimp_var) what will output the variations
detected for a specific hit in detail. For AB SOLiD colour space reads we use a
novel color-space to letter-space Smith Waterman algorithm to identify
sequencing errors as well as SNPs and micro-indels. The details of the algorithm
and the methods we use to compute p-values are laid out in Sections 6 and 7,
respectively.
--------------------------------------------------------------------------------
2. Quick Start
--------------------------------------------------------------------------------
This section provides a very brief and straightforward introduction to using
SHRiMP. For more details, see Sections 3 and 4.
2.1 Aligning reads
---
Use bin/rmapper-foo to align reads, where 'foo' is either 'ls', 'cs', or
'hs' for letter-space (454, Illumina/Solexa), colour-space (AB SOLiD), and
Helicos-space (Helicos HeliScope SMS 2-pass reads), respectively.
E.g.:
bin/rmapper-cs reads.csfasta /path/to/genomedir/chr*.fa > output.rmapper
If you want full, pretty-printed alignments, use the -P flag. For more
details about parameters, see Sections 3 and 4.
2.2 Calculating Probabilities
---
Use bin/probcalc on the output created by rmapper to generate the statistics
described in Section 7. The first parameter is the total genome length.
E.g.:
bin/probcalc 7000000 output.rmapper > output.probcalc
2.3 Pretty Printing Reads
---
If the -P flag is not specified to rmapper, full alignments are omitted. To
print alignments after the fact, just run bin/prettyprint-foo (where 'foo'
is one of 'ls', 'cs', or 'hs').
E.g.:
bin/prettyprint-cs output.probcalc /path/to/genomedir/ reads.csfasta
2.4 Printing hit variations
---
To print the variations in hit alignments in detail, run shrimp_var on the
shrimp/probcalc output.
E.g.:
bin/shrimp_var -r output.probcalc
2.5 Combine Reads into Mate Pairs
---
Use bin/probcalc_mp to combine the read mappings into mate pairs. First, one
should sort the probcalc output in one large file (or a binary file with the
datastructure specified by dbtypes.h/mapping_t). To have one large sorted file
pipe the probcalc output through '> sort' and finally combine all files by
using '> sort -m'. Finally, run probcalc_mp:
./probcalc_mp -m sortedfile -f _F3 -b _R3 -g 3080419480 -M 2000
--------------------------------------------------------------------------------
3. Usage
--------------------------------------------------------------------------------
The distribution makes use of several programs. The first and most important is
'rmapper'. 'rmapper' performs Smith-Waterman alignments of multiple reads within
one fasta file against one or more reftigs in other fasta files. rmapper was
designed to map a set of reads against the entire genome (all contigs and their
reverse-complements) in one invocation. Parallelism can be achieved by splitting
the set of reads into N chunks, where N is the desired level of parallelism.
It is important to note that 'rmapper' is always given a reference genome in
letter-space, regardless of whether the reads are in letter-space or AB SOLiD's
colour-space representation. For letter-space reads, use 'rmapper-ls', for
colour-space reads, 'rmapper-cs', and for Helicos SMS reads, 'rmapper-hs'. Other
SHRiMP utilities that make assumptions about the input read format come in
pairs, i.e. 'foo-ls' and 'foo-cs', for letters and colours, respectively. Other
utilities lacking any suffix are format-agnostic.
Once 'rmapper' has been run, the standard output format of all alignments may be
parsed via the 'probcalc' program. This code analyzes all alignment output,
saving the top 'n' matches per read, and calculates the probability of the match
randomly occurring in the genome and matching the read, and the normalized odds
(P(read match)/P(random)/normalization_factor). Thresholds can be set for each
to remove undesirably high or low values. Sorting can also be done on any one of
the three aforementioned parameters.
The output produced by the 'probcalc' utility is essentially a subset of the
input, with added 'pgenome', 'pchance' and 'normodds' fields. These matches may
then have their full alignments printed using the 'prettyprint' utility
(although one could also feed rmapper output to prettyprint directly). Also, the
shrimp_var utility will print the variations for each hit in detail.
What follows is a complete example of scanning a set of reads against an entire
genome (and its reverse-complement) from the Ciona Savignyi organism. We'll
then calculate the associated probabilities and print out pretty alignments.
We shall assume a large set of colourspace reads exist in a single file
'reads.csfasta' and the entire Ciona genome exists in 'ciona.fasta' (as a single
contig), both in the present working directory. $S represents the path to
SHRiMP.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
PLEASE NOTE:
This example uses a lot of parameters to exhibit the configurability of
SHRiMP. Settings shown here are not necessarily appropriate for your reads.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
1) mkdir reads
2) mkdir results
3) cd reads
4) python $S/utils/splitreads.py 1000 ../reads.csfasta
5) cd ..
6) $S/bin/rmapper-cs -m 8 -i -30 -g -35 -e -7 -x -25 -h 70% -v 65% \
-M 70bp,fast \
-o 100 -B \
reads/0_to_999.csfasta ciona.fasta > results/ciona.0_to_999.out
or
$S/bin/rmapper-cs -m 100 -i -70 -g -100 -e -70 -x -200 -h 70% -v 65% \
-s 111100111 -n 2 -t 4 -w 140% \
-o 10 -B \
reads/0_to_999.csfasta ciona.fasta > results/ciona.0_to_999.out
7) $S/bin/probcalc -p 0.4 -t 10 173673243 \
results/ > ciona.0_to_999.probcalc.out
8) $S/bin/prettyprint-cs -m 100 -i -70 -g -100 -e -70 -x -200 \
ciona.0_to_999.probcalc.out genome/ reads/
9) $S/bin/shrimp_var -p \
ciona.0_to_999.probcalc.out > ciona.0_to_999.probcalc.var
Lines 1&2 set up the necessary directory structure. reads/ contains one or more
fasta files containing letterspace or colourspace reads. results/ contains the
results of the 'rmapper' pass, which are fed to 'probcalc' to generate the
probabilities of them being poor matches.
Lines 3&4 split the reads.csfasta file, which contains a large number of
colourspace reads into smaller chunks. This both saves memory, and permits us to
parallelize the computation.
Line 6 maps all reads split into the file 0_to_999.csfasta against ciona.fasta.
The various parameters are verbosely documented later in this file. By default,
rmapper will employ settings geared toward human reads of about 50 bases.
Biological Relevance
--
The options shown on the first line (-m ... -v) are used to set the scores and
the score thresholds for matches. Even though 'rmapper' includes some
sensible default settings for these scores, we suggest that you provide your
own scores and score thresholds, because 'rmapper' has no way of knowing what
is biologically relevant for the sequences being scanned.
Speed vs Sensitivity
--
The options shown on the second line (-M or -s ... -w) affect the tradeoff
between speed and sensitivity. By default, 'rmapper' uses 4 spaced seeds of
weight 12 and requires 4 matches between a read and a genome window in order
to further investigate that particular match. These settings are targeted for
dealing with a large (>100,000) number of reads, of about 50bp each.
********** NEW in SHRiMP 1.3.0
* 'rmapper' comes with several predefined sets of parameters, that can be
* loaded using the option '-M' (explained below). We _strongly_ recommend
* using one of these sets until becomes familiar with the individual effects
* of the other options.
**********
Some further guidelines:
Seeds (-s):
- Using as many (4) seeds of smaller weight (e.g., 11 or 10) mildly
increases sensitivity but greatly increases running time.
- Using a single seed, even of smaller weight (e.g., 9), is sometimes fast
when dealing with few (<10,000) reads.
- We recommend using the default 4 seeds of weight 12 in most cases.
Matches per Window (-n):
- Increasing this parameter decreases the running time, but also decreases
sensitivity.
- The default is 4, which is targeted towards dealing with 50bp reads.
- For smaller reads (e.g., 35bp or 25bp), a setting of 3 or even 2 might be
more desirable in order to achieve good sensitivity.
- For longer reads, (e.g. 70bp or 100bp), settings above 6 no longer provide
substantial speed benefits.
Window Length (-w):
- By default, reads are aligned against genome windows of 135% the length of
the read.
- For smaller reads (e.g., 35bp or 25bp), we recommend increasing this to
150% or even 170%.
- For longer reads, we recommend a slightly lower setting, say 125%.
Note: The options given to 'rmapper' need not be ordered in the way they are
in the example.
The output of 'rmapper' is piped into a file in the results/ directory for later
evaluation. This step could be parallelized across all *_to_*.csfasta files
created by splitreads. Note that the genomic file may contain one or several
contigs, that multiple genome files may be specified, and that reverse
complements are automatically scanned as well.
Line 7 calculates the probability of each hit generated by 'rmapper' being bad.
It takes two mandatory parameters: the total concatenated genome length, and at
least one shrimp results file or directory of results file(s) (multiple files
and/or directories may be specified). The output is piped into a further results
file for later evaluation by the 'prettyprint-cs' program. Note that 'probcalc'
should be run once enough results have been gathered to generate reasonable
statistics. Since this phase takes a relatively short amount of time, it is
probably best done once sequentially for all output generated. However, one may
also use the -G and -g flags to probcalc to run statistics generation in
parallel, sequentially merge the statistics (i.e. just concatenate outputs), and
calculate probabilities in parallel as well. See Section 4 for more details.
Also note that probcalc mandates that results for a single read do not appear in
multiple files. If rmapper has been run against the entire genome for some set
of reads this assumption will hold. See section 4 for details regarding
probcalc's parameters.
Line 8 prints pretty visual alignments of our resultant mappings. It requires
knowing all genomic and reads files in order to locate each read referenced in
the input file (generated by either 'rmapper' or 'probcalc') and aligns them
against the appropriate contig in the genome. Alternatively, one could have run
rmapper with the -P flag to generate these in the initial output (although that
could consume a considerable amount of disk space). Note that rmapper and
prettyprint share the same default parameters. Deviating from the defaults in
rmapper means having to provide the appropriate S-W penalties to prettyprint as
well.
Line 9 outputs the variations of the hits computed by rmapper in detail,
specifically SNPs, insertions and deletions. The needed flag is the input file
type, that is -r for rmapper and -p for probcalc. If the -R flag is used in
rmapper, it should also be supplied to shrimp_var.
--------------------------------------------------------------------------------
4. Program Parameters
--------------------------------------------------------------------------------
'rmapper' takes a variety of parameters, which differ sightly depending on
whether colour-space or letter-space reads are being employed. What follows is a
run-down of these options (in some strange, non-alphabetical order).
rmapper-cs and rmapper-ls parameters (common parameters):
[ -M mode ]
Selects one of the default sets of parameters. The possible values are
{'fast'/'sensitive'}, {'35bp'/'50bp'/'70bp'} or simply 'mirna'. Several
modes can be given as '-M 35bp -M sensitive' or comma-separated as '-M
fast,70bp', but you should specify at most one value from each set
above. By default, 'rmapper' uses '50bp' and 'fast'. Do not mix the
'mirna' mode with any of the others.
The values 'fast'/'sensitive' provide basic tradeoffs between speed and
sensitivity. The values '35bp'/'50bp'/'70bp' load settings suggested for
dealing with reads of (around) that length.
The speed and readlength mode specifiers currently set: the seeds,
'seed_matches_per_window', 'seed_taboo_length' and 'seed_window_length'
(see their individual descriptions below). If any of these is
explicitly set with its own flag, that value overrides the value that
would have been selected by the mode.
The 'mirna' mode loads some settings that have been suggested to us for
the analysis of AB Solid (colour space) micro RNA data. (Thank you
Alessandro Guffanti.) Specifically, '-M mirna' is equivalent to:
'-s 00111111001111111100,00111111110011111100,00111111111100111100,\
00111111111111001100,00111111111111110000 -H -n 1 -w 100% -ungapped'
Notes about -M mirna: to disable hashing, add -H as a separate
parameter; if specifying either -n or -w explicitly on the command line,
those values override the defaults in the mirna mode. E.g., '-M mirna -H
-n 2 -w 30' is equivalent to '-s [as above] -n 2 -w 30 -ungapped'.
Modes are not available in Helicos space.
[ -s spaced_seed ]
The spaced seed is a single contiguous string of 0's and 1's. 0's
represent wildcards, or positions which will always be considered as
matching, whereas 1's dictate positions that must match. A string of all
1's will result in a simple kmer scan.
Multiple -s arguments may be provided, in which case rmapper will use
all of the spaced seeds. Multiple seeds can also be specified as a
comma-separated list, e.g., '-s 1101,111011'. 'rmapper' comes with
several predefined sets of spaced seeds for specific weights. These can
be selected by prefixing the weight with a 'w', e.g., '-s w10' loads the
default (4) seeds of weight 10.
Note that, by default, our implementation creates a lookup table based
on the kmer size (spaced seed 'weight', or number of 1's). Hence memory
usage increases by a factor of four for each addition 1. At 16, we're
looking at a 32GB hash table allocation for 32-bit architectures. See
the -H option for an alternative method, which creates a hash table,
permitting much larger seeds.
[ -n seed_matches_per_window ]
The number of seed matches per window dictates how many seeds must match
within some window length of the genome before that region is considered
for Smith-Waterman alignment. A lower value will increase sensitivity
while drastically increasing running time. Higher values will have the
opposite effect.
[ -t seed_taboo_length ]
The seed taboo length specifies how many target genome bases or colours
must exist prior to a previous seed match in order to count another seed
match as a hit.
[ -w seed_window_length ]
This parameter specifies the length in bases (or colours) of a genomic
window against which a read is aligned. It is either an absolute value
(e.g., '-w 67'), or a percentage of the read length (e.g., '-w 150%').
In order for a read to be given consideration by the Smith-Waterman
alignment machinery, 'seed_matches_per_window' must be detected within
such a genomic window.
[ -W window_overlap_length ]
After a genomic window A is found to contain a significant match to a
given read, no genomic window B is investigated for matches against that
same read as long as B overlaps A by more than 'window_overlap_length'.
[ -o maximum_hits_per_read ]
This parameter specifies how many hits to remember for each read. If
more hits are encountered, ones with lower scores are dropped to make
room.
[ -d kmer_std_dev_limit ]
This option permits pruning read kmers, which occur with frequencies
greater than 'kmer_std_dev_limit' standard deviations above the
average. This can shorten running time at the cost of some sensitivity.
NB: A negative value disables this option.
[ -m sw_match_value ]
The value applied to matches during the Smith-Waterman score
calculation.
[ -i sw_mismatch_value ]
The value applied to mismatches during the Smith-Waterman score
calculation.
[ -g sw_gap_open_penalty_reference ]
The value applied to gap opens along the reference sequence during the
Smith-Waterman score calculation.
Note that for backward compatibility, if -g is set and -q is not set,
the gap open penalty for the query will be set to the same value as
specified for the reference.
[ -q sw_gap_open_penalty_query ]
The value applied to gap opens along the query sequence during the
Smith-Waterman score calculation.
[ -e sw_gap_extend_penalty_reference ]
The value applied to gap extends during the Smith-Waterman score
calculation.
Note that for backward compatibility, if -e is set and -f is not set,
the gap extend penalty for the query will be set to the same value as
specified for the reference.
[ -f sw_gap_extend_penalty_query ]
The value applied to gap extends during the Smith-Waterman score
calculation.
[ -h sw_threshold ]
NB: This option differs slightly in meaning between letter-space and
colour-space.
In letter-space, this parameter determines the threshold score for both
vectored and full Smith-Waterman alignments. Any values less than this
quantity will be thrown away.
In colour-space, this parameter affects only the full Smith- Waterman
alignment, which is performed in letter-space. The threshold of the
colour-space fast vectored alignment can be specified by the -v
option. Generally, the -h parameter should be stricter (higher) than the
-v option, since naive colour-space alignments using regular
Smith-Waterman suffer additionally due to artifacts such as single SNPs
resulting in two colour mismatches.
[ -Z ]
Disable hash-and-cache mechanism during genomic scan (see Algorithms
section for more details).
[ -Y min_cache_size,max_cache_size ]
Controls the per-read cache size used by the hash-and-cache mechanism
(see Algorithms section for more details). The first time a potential
match is found for a read, a cache is initialized for that read, with
size 'min_cache_size'. Subsequently, on cache misses with a full cache,
the size of the cache is doubled until it reaches 'max_cache_size'. The
absolute maximum is 128.
[ -A anchor_width ]
For every match of a read against a genomic window, 'rmapper' saves the
positions of the 'seed_matches_per_window' kmers that triggered that
match. Subsequently, during the final Smith-Waterman alignment,
'rmapper' restricts the computed region of the 2D matrix by adding
"anchors" or "necks" of width 'anchor_width' around the matched
kmers. The default value is 8. Use of anchors is disabled by providing
-1 as width.
[ -ungapped ]
Perform gapless alignment. When this flag is specified, rmapper looks
for gapless alignment between reads and the database. In letter space,
this means only mismatches are tolerated. In colour space, there can be
both mismatches and crossovers. The scoring scheme still applies. -G
also implies -A 0 -g -255 -q -255.
[ -B ]
This option simply prints a progress bar to stderr during the spaced
seed scan and vectored Smith-Waterman phases. It exists to give a
general feel for run-time when testing parameters. Since it will slow
down execution speed noticeably (25% or so), it is not enabled by default
and should only be used during manual, interactive execution.
[ -C ]
Only process the negative strand of the genome (the reverse
`C'omplement). By default, both strands are scanned.
[ -F ]
Only process the positive strand of the genome (going `F'orwards). By
default, both strands are scanned.
[ -H ]
Use a hash table, rather than a direct lookup table for seeds. A direct
lookup will be more efficient for small seeds, but quickly becomes
prohibitively large for longer ones. For seeds of weight greater than
about 14, this option should be used.
Note that rmapper will not automatically switch to using a hash table if
the seed is too large for a direct lookup, the reason being that this
option has simply not been as well-tested.
Note also that this cannot be used in conjunction with the -d parameter.
[ -P ]
'rmapper' has two output formats. The first, and default, prints a list
of appropriately scoring reads and various parameters, such as where
they occurred in the genome (index), how many matches, mismatches, and
gaps there were, and so forth. The '-P' flag enables a 'pretty print'
output, which displays similar parameters, but also a full alignment.
These alignments can also be obtained after the fact by running the
default output file through the 'probcalc' program.
[ -R ]
Include the entire read sequence in the output generated. This will
consume huge gobs of disk space for large reads and is disabled by
default. Saved reads are placed in the 'r_seq' key.
[ -T ]
Reverse the order in which tie-breaks are resolved in the full aligner
when doing the negative strand. This should help line up gaps when
negative matches are reverse-complemented and compared with positive
matches.
[ -U ]
After performing all alignments and outputting the results, generate a
list of unmapped reads (i.e. reads for which no sufficiently
high-scoring alignment was found in a contig). The reads are listed one
per line and follow a special comment header to separate them from the
alignment results. For example, the output file will eventually look
something like the following:
...
>1035_1688_1316_F3 reftig_9 - 3721634 ...
>1035_1688_1316_F3 reftig_9 - 3721162 ...
>1035_1688_1316_F3 reftig_9 - 3722237 ...
#
#UNMAPPED READS:
#
1035_1577_840_F3
1035_1587_820_F3
1035_1594_1563_F3
...
To extract the list of unmapped reads from the output file, simply list
every line following "#UNMAPPED READS:". This can be accomplished with
the utils/extractunmapped.py script:
python utils/extractunmapped.py /path/to/results.out
Alternatively, one can use the following grep command:
grep -A 2147483647 "^#UNMAPPED READS:" /path/to/results.out
`2147483647' is the number of lines after the matching context to
output. In the case where we want them all, 2147483647 is the largest
acceptable value for GNU grep.
rmapper-cs-specific parameters:
[ -x crossover_penalty ]
This specifies the penalty applied when transitioning between
Smith-Waterman matrices during the full scan phase. While the vectored
scan applies to colour-space, the final full alignment is done in
letter-space. Since each next letter in letter-space depends on the
previous letter and colour, any error on the colour space read will
affect all following letters when converting to text space. For this
reason, we must perform our alignment of all four possible letter space
translations of the read and permit jumping between matrices (at the
crossover_penalty) cost, when errors occur.
[ -v sw_vector_hit_threshold ]
Unlike in letter-space, where the vectored Smith-Waterman and full
Smith-Waterman alignments are done both in letter-space and present
identical scores, in colour-space the vectored score represents the
original colour-space read aligned to the colour-space translation of
the genome. This will differ from the final alignment, which is done
purely in letter-space. Since the function of the vectored pass exists
merely to prune insufficiently good alignments and the vectored pass is
not a true textual alignment, scores for both passes are likely to
differ. Generally, since a single SNP in letter-space will result in two
colour changes, the threshold for the colour-space alignment should be
less than that of letter-space.
rmapper-ls-specific parameters:
None for now.
'prettyprint' currently takes only one optional parameter.
prettyprint-cs and prettyprint-ls parameters (common parameters):
[ -m sw_match_value ]
See the 'rmapper' -m description.
[ -i sw_mismatch_value ]
See the 'rmapper' -i description.
[ -g sw_gap_open_penalty_reference ]
See the 'rmapper' -g description.
[ -q sw_gap_open_penalty_query ]
See the 'rmapper' -q description.
[ -e sw_gap_extend_penalty_reference ]
See the 'rmapper' -e description.
[ -f sw_gap_extend_penalty_query ]
See the 'rmapper' -f description.
[ -R ]
Include the entire read sequence in the output generated. This will only
have an effect for input lines containing the 'r_seq' key.
[ -T ]
Reverse the order in which tie-breaks are resolved in the full aligner
when doing the negative strand. This should help line up gaps when
negative matches are reverse-complemented and compared with positive
matches.
prettyprint-cs-specific parameters:
[ -x crossover_penalty ]
See the 'rmapper-cs' -x description.
prettyprint-ls-specific parameters:
None for now.
'probcalc' takes a few optional parameters as well:
[-g rates_file]
Rather than calculating rates of alignment events using the provided
output file, assume the ones provided in 'rates_file'. This may be used
to run probcalc in parallel by first running all instances with the -G
flag, concatenating those outputs into a single 'rates_file', and then
running once more with the -g parameter to generate probabilities.- .
[-n normodds_cutoff]
Set a threshold for normodds. Any values lower than this threshold will
be suppressed.
[-o pgenome_cutoff]
Set a threshold for pgenome. Any values lower than this threshold will
be suppressed.
[-p pchance_cutoff]
Set a threshold for pchance. Any values higher than this threshold will
be suppressed.
[-r erate,srate,irate,mrate]
Rather than calculate rates of various events, provide them explicitly
and use those values in the calculations. Here, 'erate', 'srate',
'irate' and 'mrate' mean errors (miscalls), substitutions (SNPs), indels
and matches, respectively.
Values should be in the range 0.0 to 1.0.
[-s normodds|pgenome|pchance]
Sort based on normodds, pgenome, or pchance given the appropriate
string.
[-t top_matches]
Save only top_matches best matches.
[ -B ]
Print a progress bar to stderr during various phases of computation.
[ -G ]
Only generate rates of various alignment events and send them to stdout.
This can be used to run probcalc in parallel. See also '-g'.
[ -R ]
Include the entire read sequence in the output generated. This will only
have an effect for input lines containing the 'r_seq' key.
[ -S ]
Do everything in a single pass. Typically probcalc will run over all
rmapper results files twice in order to save memory. This option will
use one pass only at the expense of using far more memory than usual,
however, a significant speed advantage is gained. Additionally, while
probcalc normally mandates that results from rmapper for a specific read
appear in at most one file, this option does not have the same
restriction.
'probcalc_mp' takes quite a few parameters.
Mandatory parameters:
[-m mapping_filename]
the mapping file (normal text or binary)
[-f forward_suffix]
the suffix for forward reads. For e.g.: "_F3"
[-b reverse_suffix]
the suffix for backwards (reverse) reads. For e.g.: "_R3"
[-g genome_length]
the genome length, for purposes of pchance computation
[-M hard_distance_limit]
for computing the mean statistics, the hard M limit
Optional parameters:
[-L nr_mate_pairs]
the number of *good* mate pairs to include in the statistics
calculations. do -L 0 to include *all* good mate pairs. default: 50,000
[-i file_type]
the type of file, file_type can be 'ascii' or 'binary'
[-R]
indicator that the read sequence is included in the input, as used by
probcalc
[-x max_reads_to_expect]
the maximum number of mappings one should expect for one read
[-d]
discordant analysis.
[-u]
in computing the mean, use unique mappings only
[-T max_reads_to_output]
the maximum(ish) number of mappings to output per mate pair
[-C PCHANCE_CUTOFF]
pchance cutoff
[-G PGENOME_CUTOFF]
pgenome cutoff
[-s nr_stdevs]
after computing stats, M (hard_distance_limit) is set to the mean +
nr_stdev * stdev. Default: nr_stdevs = 2;
[-c ]
do not print mate-pairs with hits on different chromosomes
'shrimp_var' takes only few parameters:
Mandatory
[-r | -p]
tell shrimp_var weather the input files are rmapper or probcalc input.
Optional
[-R ]
indicator that the read sequence is included in the input, as used by
probcalc
--------------------------------------------------------------------------------
5.0 Output Format
--------------------------------------------------------------------------------
rmapper probcalc, and prettyprint all adhere to a common output format: lines
corresponding to individual hits with tab-delimited fields. Such lines always
begin with a '>' character in the first position. All utilities ignore any lines
that do not begin with '>', such as alignments, comments, etc.
Here's an example ('\' indicates continuation of the same logical line on the
next line of this README file and does not appear in the actual output):
>947_1567_1384_F3 reftig_991 + 22901 22923 3 \
25 25 2020 18x2x3
Additionally, the beginning of each output file begins with a specification of
the tab-delimited fields. For example, the following applies to the above read
hit:
#FORMAT: readname contigname strand contigstart contigend readstart readend\
readlength score editstring
The #FORMAT: line allows new fields to be unambiguously added, or others removed
or reordered without requiring alteration to the parsing routines.
Descriptions of the columns are as follows:
'readname' Read tag name
'contigname' Genome (Contig/Chromosome) name
'strand' Genome strand ('+' or '-')
'contigstart' Start of alignment in genome (beginning with 1, not 0).
'contigend' End of alignment in genome (inclusive).
'readstart' Start of alignment in read (beginning with 1, not 0).
'readend' End of alignment in read (inclusive).
'readlength' Length of the read in bases/colours.
'score' Alignment score
'editstring' Edit string: read to reference summary; see below.
Additionally, probcalc output adds the following columns:
'pchance' Probability that a read will align with a genome with
as good a score or better by chance.
'pgenome' Probability that a hit was generated via common
evolutionary events characteristic of the genome.
'normodds' Normalized pgenome/pchance.
The 'editstring' column contains a short description of the alignment with
reference to the database sequence. This allows at-a-glance analysis of
alignments, including identification of miscalls, SNPs, indels, etc.
The edit string consists of numbers, characters and the following additional
symbols: '-', '(' and ')'. It is constructed as follows:
<number> = size of a matching substring
<letter> = mismatch, value is the tag letter
(<letters>) = gap in the reference, value shows the letters in the tag
- = one-base gap in the tag (i.e. insertion in the reference)
x = crossover (inserted between the appropriate two bases)
For example:
A perfect match for 25-bp tags is: "25"
A SNP at the 16th base of the tag is: "15A9"
A four-base insertion in the reference: "3(TGCT)20"
A four-base deletion in the reference: "5----20"
Two sequencing errors: "4x15x6" (i.e. 25 matches with 2 crossovers)
The 'probcalc_mp' program has the following output:
#FORMAT: fwd_name fwd_chr fwd_editstring fwd_strand fwd_start\
fwd_end fwd_pg rev_name rev_chr rev_editstring rev_strand rev_start\
rev_end rev_pg distance normodds pgenome pchance
for example:
0 1000_1003_1062_F3 chr10 25 - 71183122\
71183146 1.000 1000_1003_1062_R3 chr10 25\
- 71183699 71183723 1.000 601 1.000 0.967 0.0000000010
Descriptions of these columns are as follows:
'fwd_name' Forward Read tag name
'fwd_chr' Forward Genome (Contig/Chromosome) name
'fwd_editstring' Forward Edit string: read to reference summary;
see above.
'fwd_strand' Forward Genome strand ('+' or '-')
'fwd_start' Forward Start of alignment in genome (beginning
with 1, not 0).
'fwd_end' Forward End of alignment in genome (inclusive).
'fwd_pg' Forward pgenome.
'rev_name' Reverse Read tag name
'rev_chr' Reverse Genome (Contig/Chromosome) name
'rev_editstring' Reverse Edit string: read to reference summary;
see above.
'rev_strand' Reverse Genome strand ('+' or '-')
'rev_start' Reverse Start of alignment in genome (beginning
with 1, not 0).
'rev_end' Reverse End of alignment in genome (inclusive).
'rev_pg' Reverse pgenome.
'distance' Distance between the mate pairs
'pgenome' Probability that the mate-pair hits were
generated via common evolutionary events
characteristic of the genome.
'pchance' Probability that a mate pair will align with a
genome with as good a score or better by chance.
'normodds' Normalized pgenome/pchance.
The shrimp_var tool has a significantly different output format:
>read_name edit_string contig_start #SNPs #ins #dels var_id1 var_id2 ...
for example:
>example_1 2G1---12(AC)2 101 1 1 1 s-G-103 d-3-105 i-AC-119
Description of these columns are as follows:
'read_name' The name of the read
'edit_string' Edit string: read to reference summary; see above
'contig_start' Start of alignment in genome (beginning with 1, not 0).
'#SNPs' The number of SNPs present in the hit alignment
'#ins' The number of insertions present in the hit alignment
'#dels' The number of deletions present in the hit alignment
'var_idX' The variation id, description, and position. Details:
if variation is a SNP, the var_idX is s-L-pos, where s means
SNP, L is a letter that the residue is changed to, and pos is
the contig position of the SNP; if variation is a insertion, the
var_idX is i-INS-pos where i signals an insertion, INS includes
the inserted residues and pos is the contig position of the
insertion; if variation is a deletion, the var_idX is of form
d-len-pos, with d signaling a deletion, len representing the
length of the deletion, and pos is the contig position of the
deletion. More examples and a discussion on negative strand
handling is available in pdf via email.
--------------------------------------------------------------------------------
6. SHRiMP Algorithms
--------------------------------------------------------------------------------
SHRiMP uses a fast k-mer scan in order to locate the regions of potential
homology. One important feature is that we index the reads, rather than the
reference genome. This makes our memory usage independent of the size of the
genome (but proportional to the size of the largest contig). All of the k-mers
in all of the source reads are generated and a lookup table mapping the k-mers
to the reads where they are present is built. We use spaced k-mers (also used in
the PatternHunter as well as many other tools) and allow the user to specify the
particular pattern of ones (positions that must match) and zeros (positions that
may differ).
For each read containing the kmer, a notation is added that a kmer hit was just
made. If 'n' kmer hits can be fit in a genomic region of length 'm', the hit and
the region are fed into a vectored Smith-Waterman algorithm to generate an
alignment score. If the score is sufficient, the score and genomic offset
(index) is remembered. Once this process has completed for the entire genome,
top scoring saved read/index pairs are fed into a full Smith-Waterman algorithm,
and a detailed output is generated.
In order to deal with reads that match the genome many times, 'rmapper' uses a
hash-and-cache mechanism. Every time a genomic region is investigated for a
match against a given read (by the vectored SW algorithm), we compute a hash
value for that region, and we save it along with its resulting score in a
per-read cache. Subsequently, before feeding another genomic region into the