README.SHRiMP132

$Id: README 509 2010-01-30 16:40:39Z matei $

README for SHRiMP: SHort Read Mapping Package version 1.3.2
http://compbio.cs.toronto.edu/shrimp

This  document describes the  programs of  which SHRiMP  is comprised, including
their use,  output formats, and various  parameters. The algorithms employed are
also briefly described. For more details, see the following publication:

    Rumble SM, Lacroute P, Dalca AV, Fiume M, Sidow A, et al.
    (2009) SHRiMP: Accurate Mapping of Short Color-space Reads.
    PLoS Comput Biol 5(5): e1000386. doi:10.1371/journal.pcbi.1000386

SHRiMP is brought to you by:
	Stephen M. Rumble
	Michael Brudno
	Phil Lacroute
	Vladimir Yanovsky
	Marc Fiume
	Adrian Dalca
	Matei David

The authors may be contacted at shrimp at cs dot toronto dot edu.

We would like to thank Dr. Lucian Ilie of the  University of Western Ontario for
providing  us  with sets  of spaced  seeds  especially designed  to achieve good
sensitivity.


--------------------------------------------------------------------------------
Table of Contents
--------------------------------------------------------------------------------
  1.   Overview
  2.   Quick Start
  3.   Usage
  4.   Program Parameters
  5.   Output Format
  6.   SHRiMP Algorithms
  7.   SHRiMP Statistics
  8.   Mate Pairs
  9.   Contact
  10.  Acknowledgements


--------------------------------------------------------------------------------
1. Overview
--------------------------------------------------------------------------------

SHRiMP is a software package for aligning genomic reads against a target genome.
It was primarily developed with the multitudinous short reads of Next Generation
Sequencing (NGS)   machines  in mind. It   allows  for  rapid mapping,  accurate
alignment,  and p-value   computation  for Illumina/Solexa   as well as  Applied
Biosystems'  SOLiD colourspace   genomic representation. SHRiMP  is   a suite of
several    programs, which,  employed   in   succession, search for  appropriate
alignments, analyze alignment statistics, and  print  out visual alignments  for
further study.

SHRiMP uses two techniques to    rapidly match reads    to a genome: we use   an
effective implementation  of  the q-gram filtering  technique  (utilizing spaced
seeds)  to rapidly  identify potentially  homologous   regions, and  a  vectored
Smith-Waterman  implementation  to speed up  the    accurate alignment of  these
regions to  reads. We also  include in the SHRiMP  package  tools to analyze the
resultant alignments, including programs that  compute  for every alignment  the
probability   that the match  would   occur by chance (in   a   genome with  iid
nucleotides), a prettyprint tool that can help visualize the alignments for each
read, and a  variation output tool  (shrimp_var) what will output the variations
detected for a specific hit in detail. For AB SOLiD colour  space reads we use a
novel  color-space to   letter-space   Smith  Waterman  algorithm   to  identify
sequencing errors as well as SNPs and micro-indels. The details of the algorithm
and the  methods we use to  compute p-values are laid out  in Sections  6 and 7,
respectively.


--------------------------------------------------------------------------------
2. Quick Start
--------------------------------------------------------------------------------

  This section provides a very  brief and straightforward introduction to  using
  SHRiMP. For more details, see Sections 3 and 4.

  2.1 Aligning reads
  ---

    Use bin/rmapper-foo to  align reads, where  'foo' is  either 'ls',  'cs', or
    'hs'  for letter-space (454,  Illumina/Solexa), colour-space (AB SOLiD), and
    Helicos-space (Helicos HeliScope SMS 2-pass reads), respectively.

    E.g.:
        bin/rmapper-cs reads.csfasta /path/to/genomedir/chr*.fa > output.rmapper

    If  you want  full,  pretty-printed alignments,  use  the -P flag.  For more
    details about parameters, see Sections 3 and 4.

  2.2 Calculating Probabilities
  ---

    Use bin/probcalc on the output created by rmapper to generate the statistics
    described in Section 7. The first parameter is the total genome length.

    E.g.:
	bin/probcalc 7000000 output.rmapper > output.probcalc

  2.3 Pretty Printing Reads
  ---

    If the -P flag is not specified to rmapper,  full alignments are omitted. To
    print alignments  after the fact,  just run bin/prettyprint-foo (where 'foo'
    is one of 'ls', 'cs', or 'hs').

    E.g.:
        bin/prettyprint-cs output.probcalc /path/to/genomedir/ reads.csfasta

  2.4 Printing hit variations
  ---

    To print the variations  in hit alignments  in detail, run shrimp_var on the
    shrimp/probcalc output.
  	
    E.g.:
  	bin/shrimp_var -r output.probcalc
        
  2.5 Combine Reads into Mate Pairs
  ---

  Use bin/probcalc_mp to combine  the read mappings into  mate pairs. First, one
  should sort the probcalc  output in one large file  (or a binary file with the
  datastructure specified by dbtypes.h/mapping_t). To have one large sorted file
  pipe the  probcalc output through  '> sort'  and finally combine  all files by
  using '> sort -m'. Finally, run probcalc_mp:

  	./probcalc_mp -m  sortedfile -f _F3 -b _R3 -g 3080419480 -M 2000 


--------------------------------------------------------------------------------
3. Usage
--------------------------------------------------------------------------------

The distribution makes use of several programs. The  first and most important is
'rmapper'. 'rmapper' performs Smith-Waterman alignments of multiple reads within
one fasta file  against one or more  reftigs in other  fasta files. rmapper  was
designed to map a set of reads against the entire genome  (all contigs and their
reverse-complements) in one invocation. Parallelism can be achieved by splitting
the set of reads into N chunks, where N is the desired level of parallelism.

It is important to note  that 'rmapper' is  always given  a reference genome  in
letter-space, regardless of whether the reads  are in letter-space or AB SOLiD's
colour-space representation. For   letter-space  reads, use   'rmapper-ls',  for
colour-space reads, 'rmapper-cs', and for Helicos SMS reads, 'rmapper-hs'. Other
SHRiMP  utilities that  make assumptions  about  the input  read format  come in
pairs, i.e. 'foo-ls' and 'foo-cs', for letters  and colours, respectively. Other
utilities lacking any suffix are format-agnostic.

Once 'rmapper' has been run, the standard output format of all alignments may be
parsed via the  'probcalc'  program. This code   analyzes all alignment  output,
saving the top 'n' matches per read, and calculates the probability of the match
randomly occurring in the genome and matching the  read, and the normalized odds
(P(read match)/P(random)/normalization_factor).  Thresholds can be  set for each
to remove undesirably high or low values. Sorting can also be done on any one of
the three aforementioned parameters.

The  output produced by  the 'probcalc' utility  is  essentially a subset of the
input, with added 'pgenome', 'pchance' and 'normodds'  fields. These matches may
then  have  their  full   alignments  printed  using the 'prettyprint'   utility
(although one could also feed rmapper output to prettyprint directly). Also, the
shrimp_var utility will print the variations for each hit in detail.

What follows is a complete example of scanning a set of  reads against an entire
genome (and  its  reverse-complement) from the Ciona   Savignyi organism.  We'll
then calculate the associated probabilities and print out pretty alignments.

We  shall  assume  a large   set  of colourspace reads  exist  in  a single file
'reads.csfasta' and the entire Ciona genome exists in 'ciona.fasta' (as a single
contig), both   in  the present working  directory.  $S  represents the  path to
SHRiMP.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
PLEASE NOTE:
    This example uses a lot of parameters to exhibit the configurability of
    SHRiMP. Settings shown here are not necessarily appropriate for your reads.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

  1)  mkdir reads
  2)  mkdir results
  3)  cd reads
  4)  python $S/utils/splitreads.py 1000 ../reads.csfasta
  5)  cd ..
  6)  $S/bin/rmapper-cs -m 8 -i -30 -g -35 -e -7 -x -25 -h 70% -v 65% \
          -M 70bp,fast \
          -o 100 -B \
          reads/0_to_999.csfasta ciona.fasta > results/ciona.0_to_999.out
  or
      $S/bin/rmapper-cs -m 100 -i -70 -g -100 -e -70 -x -200 -h 70% -v 65% \
          -s 111100111 -n 2 -t 4 -w 140% \
          -o 10 -B \
          reads/0_to_999.csfasta ciona.fasta > results/ciona.0_to_999.out
  7)  $S/bin/probcalc -p 0.4 -t 10 173673243 \
          results/ > ciona.0_to_999.probcalc.out
  8)  $S/bin/prettyprint-cs -m 100 -i -70 -g -100 -e -70 -x -200 \
         ciona.0_to_999.probcalc.out genome/ reads/
  9)  $S/bin/shrimp_var -p \
         ciona.0_to_999.probcalc.out > ciona.0_to_999.probcalc.var 

Lines 1&2 set up the necessary directory  structure. reads/ contains one or more
fasta files containing letterspace  or colourspace reads. results/  contains the
results of  the 'rmapper'  pass, which  are  fed to  'probcalc' to  generate the
probabilities of them being poor matches.

Lines 3&4   split the  reads.csfasta  file,  which  contains a  large  number of
colourspace reads into smaller chunks. This both saves memory, and permits us to
parallelize the computation.

Line 6 maps all reads split into the  file 0_to_999.csfasta against ciona.fasta.
The various parameters are verbosely documented  later in this file. By default,
rmapper will  employ  settings geared  toward  human reads  of about   50 bases.

  Biological Relevance
  --

  The options shown on the first line (-m ... -v) are used to set the scores and
  the  score  thresholds  for  matches.   Even though  'rmapper'  includes  some
  sensible default settings for these  scores, we suggest  that you provide your
  own scores and score thresholds, because 'rmapper' has  no way of knowing what
  is biologically relevant for the sequences being scanned.

  Speed vs Sensitivity
  --

  The  options shown on  the second line  (-M or -s ... -w)  affect the tradeoff
  between speed and sensitivity.  By  default, 'rmapper' uses  4 spaced seeds of
  weight 12 and requires 4 matches between  a read and  a genome window in order
  to further investigate that particular match.  These settings are targeted for
  dealing with a  large (>100,000) number  of reads,  of  about 50bp  each.

    ********** NEW in SHRiMP 1.3.0
    * 'rmapper' comes with  several predefined sets of  parameters,  that can be
    * loaded using the  option '-M' (explained  below).  We _strongly_ recommend
    * using one of these sets until becomes familiar with the individual effects
    * of the other options.
    **********

 Some further guidelines:

    Seeds (-s):

    - Using as many  (4)   seeds of smaller  weight   (e.g.,  11 or  10)  mildly
      increases sensitivity but greatly increases running time.

    - Using  a single seed,  even of smaller weight (e.g., 9), is sometimes fast
      when dealing with few (<10,000) reads.

    - We recommend using the default 4 seeds of weight 12 in most cases.

    Matches per Window (-n):

    - Increasing this parameter  decreases the running time,  but also decreases
      sensitivity.

    - The default is 4, which is targeted towards dealing with 50bp reads.

    - For smaller reads (e.g.,  35bp or 25bp), a setting of 3 or even 2 might be
      more desirable in order to achieve good sensitivity.

    - For longer reads, (e.g. 70bp or 100bp), settings above 6 no longer provide
      substantial speed benefits.

    Window Length (-w):

    - By default, reads are aligned against genome windows of 135% the length of
      the read.

    - For  smaller reads (e.g., 35bp  or 25bp), we  recommend increasing this to
      150% or even 170%.

    - For longer reads, we recommend a slightly lower setting, say 125%.

  Note: The options given to 'rmapper' need not  be ordered in  the way they are
  in the example.

The output of 'rmapper' is piped into a file in the results/ directory for later
evaluation.   This step could be   parallelized  across all *_to_*.csfasta files
created by splitreads.  Note that the genomic  file  may contain one or  several
contigs,  that  multiple  genome  files  may  be  specified,  and  that  reverse
complements are automatically scanned as well.

Line 7 calculates the probability of each hit generated  by 'rmapper' being bad.
It takes two mandatory parameters: the total  concatenated genome length, and at
least  one shrimp results file or  directory of  results file(s) (multiple files
and/or directories may be specified). The output is piped into a further results
file for later evaluation by the  'prettyprint-cs' program. Note that 'probcalc'
should be run  once  enough results  have been  gathered to generate  reasonable
statistics. Since  this phase  takes a  relatively  short amount of time,  it is
probably best done once sequentially for all output  generated. However, one may
also use the   -G and -g   flags to probcalc   to run  statistics generation  in
parallel, sequentially merge the statistics (i.e. just concatenate outputs), and
calculate probabilities in parallel as well. See Section 4 for more details.

Also note that probcalc mandates that results for a single read do not appear in
multiple files. If rmapper has  been run against the  entire genome for some set
of  reads this assumption    will hold. See    section 4 for   details regarding
probcalc's parameters.

Line  8 prints pretty visual  alignments of our  resultant mappings. It requires
knowing all genomic  and reads files in order  to locate each read referenced in
the  input file (generated  by either  'rmapper' or 'probcalc')  and aligns them
against the appropriate contig in the genome.  Alternatively, one could have run
rmapper with the -P flag to generate these  in the initial output (although that
could  consume a  considerable  amount of   disk  space). Note that  rmapper and
prettyprint share the same default   parameters. Deviating from the defaults  in
rmapper means having to provide the appropriate  S-W penalties to prettyprint as
well.

Line  9  outputs the  variations   of the hits   computed  by rmapper in detail,
specifically SNPs, insertions and deletions.  The needed flag  is the input file
type, that is  -r for rmapper  and -p for probcalc. If   the -R flag is  used in
rmapper, it should also be supplied to shrimp_var.

--------------------------------------------------------------------------------
4. Program Parameters
--------------------------------------------------------------------------------

'rmapper' takes a variety  of  parameters, which   differ sightly depending   on
whether colour-space or letter-space reads are being employed. What follows is a
run-down of these options (in some strange, non-alphabetical order).

    rmapper-cs and rmapper-ls parameters (common parameters):

	[ -M mode ]

	Selects one of the default sets of parameters.   The possible values are
	{'fast'/'sensitive'}, {'35bp'/'50bp'/'70bp'} or simply 'mirna'.  Several
	modes can be  given as '-M 35bp -M  sensitive' or comma-separated as '-M
	fast,70bp',  but  you should specify at  most   one value from  each set
	above. By default,  'rmapper'  uses '50bp' and  'fast'.  Do not  mix the
	'mirna' mode with any of the others.

	The values 'fast'/'sensitive' provide basic  tradeoffs between speed and
	sensitivity. The values '35bp'/'50bp'/'70bp' load settings suggested for
	dealing with reads of (around) that length.

	The speed  and readlength  mode   specifiers currently set:  the  seeds,
	'seed_matches_per_window', 'seed_taboo_length' and  'seed_window_length'
	(see  their    individual descriptions below).   If   any   of these  is
	explicitly set with its  own flag, that  value overrides the value  that
	would have been selected by the mode.

	The 'mirna' mode loads some settings that have  been suggested to us for
	the   analysis of AB Solid (colour   space) micro  RNA  data. (Thank you
	Alessandro Guffanti.) Specifically, '-M mirna' is equivalent to:

	'-s 00111111001111111100,00111111110011111100,00111111111100111100,\
	00111111111111001100,00111111111111110000 -H -n 1 -w 100% -ungapped'

	Notes about   -M  mirna: to  disable   hashing, add  -H as   a  separate
	parameter; if specifying either -n or -w explicitly on the command line,
	those values override the defaults in the mirna mode. E.g., '-M mirna -H
	-n 2 -w 30' is equivalent to '-s [as above] -n 2 -w 30 -ungapped'.

	Modes are not available in Helicos space.

	[ -s spaced_seed ]

	The  spaced seed is  a  single contiguous  string of  0's  and 1's.  0's
	represent wildcards,  or positions  which will  always  be considered as
	matching, whereas 1's dictate positions that must match. A string of all
	1's will result in a simple kmer scan.

	Multiple -s arguments  may be provided,  in which case rmapper  will use
	all of the spaced   seeds.  Multiple seeds  can  also be specified as  a
	comma-separated  list, e.g.,   '-s 1101,111011'.  'rmapper'   comes with
	several predefined sets of spaced  seeds for specific weights. These can
	be selected by prefixing the weight with a 'w', e.g., '-s w10' loads the
	default (4) seeds of weight 10.

	Note that, by  default, our implementation  creates a lookup table based
	on the kmer size (spaced seed 'weight', or number of 1's).  Hence memory
	usage increases by a  factor of four for each  addition 1. At 16,  we're
	looking at  a 32GB hash table allocation  for  32-bit architectures. See
	the -H option for  an alternative method,   which creates a hash  table,
	permitting much larger seeds.

	[ -n seed_matches_per_window ]

	The number of seed matches per window dictates how many seeds must match
	within some window length of the genome before that region is considered
	for  Smith-Waterman alignment.  A lower  value will increase sensitivity
	while drastically increasing  running time. Higher  values will have the
	opposite effect.

	[ -t seed_taboo_length ]

	The seed taboo length specifies how  many target genome bases or colours
	must exist prior to a previous seed match in order to count another seed
	match as a hit.

	[ -w seed_window_length ]

	This parameter specifies  the length in bases (or  colours) of a genomic
	window against which a read  is aligned. It is  either an absolute value
	(e.g., '-w 67'), or a percentage  of the read  length (e.g., '-w 150%').
	In  order for  a read  to  be given consideration by  the Smith-Waterman
	alignment machinery, 'seed_matches_per_window'  must be  detected within
	such a genomic window.

	[ -W window_overlap_length ]

	After a genomic  window A is found  to contain a  significant match to a
	given read, no genomic window B is investigated for matches against that
	same read as long as B overlaps A by more than 'window_overlap_length'.

	[ -o maximum_hits_per_read ]

	This parameter specifies how  many  hits to remember  for each  read. If
	more hits are encountered,  ones with lower scores  are dropped  to make
	room.

	[ -d kmer_std_dev_limit ]

	This option permits pruning  read  kmers, which occur with   frequencies
	greater    than 'kmer_std_dev_limit'   standard  deviations    above the
	average. This can shorten running time at the cost of some sensitivity.

	NB: A negative value disables this option.

	[ -m sw_match_value ]

	The   value    applied  to matches     during the   Smith-Waterman score
	calculation.

	[ -i sw_mismatch_value ]

	The  value applied  to   mismatches   during the  Smith-Waterman   score
	calculation.
		
	[ -g sw_gap_open_penalty_reference ]

	The value applied to gap  opens along the  reference sequence during the
	Smith-Waterman score calculation.

	Note  that for backward compatibility, if  -g is set  and -q is not set,
	the  gap open penalty  for the query will  be  set to  the same value as
	specified for the reference.

	[ -q sw_gap_open_penalty_query ]

	The value applied to   gap opens along   the query sequence during   the
	Smith-Waterman score calculation.

	[ -e sw_gap_extend_penalty_reference ]

	The  value  applied to  gap  extends   during  the  Smith-Waterman score
	calculation.

	Note that for  backward compatibility, if -e is  set and -f  is not set,
	the gap  extend penalty for the  query will be  set to the same value as
	specified for the reference.

	[ -f sw_gap_extend_penalty_query ]

	The value   applied  to  gap extends   during  the  Smith-Waterman score
	calculation.

	[ -h sw_threshold ]

	NB: This option  differs  slightly in meaning between  letter-space  and
	colour-space.

	In letter-space, this parameter determines the  threshold score for both
	vectored  and full Smith-Waterman alignments. Any  values less than this
	quantity will be thrown away.

	In colour-space,  this parameter affects only   the full Smith- Waterman
	alignment, which   is performed in   letter-space. The threshold  of the
	colour-space  fast  vectored  alignment can  be    specified  by the  -v
	option. Generally, the -h parameter should be stricter (higher) than the
	-v   option, since    naive   colour-space   alignments  using   regular
	Smith-Waterman suffer additionally due  to artifacts such as single SNPs
	resulting in two colour mismatches.

	[ -Z ]

	Disable hash-and-cache mechanism  during  genomic scan  (see  Algorithms
	section for more details).

	[ -Y min_cache_size,max_cache_size ]

	Controls the  per-read cache size used   by the hash-and-cache mechanism
	(see Algorithms section for more  details). The  first time a  potential
	match is found for  a read, a cache  is initialized for that read,  with
	size 'min_cache_size'. Subsequently, on  cache misses with a full cache,
	the size of the cache is doubled until  it reaches 'max_cache_size'. The
	absolute maximum is 128.

	[ -A anchor_width ]

	For every match of a read against  a genomic window, 'rmapper' saves the
	positions  of the 'seed_matches_per_window'   kmers that triggered  that
	match.    Subsequently,  during  the  final    Smith-Waterman alignment,
	'rmapper'  restricts the computed region   of  the 2D matrix by   adding
	"anchors"  or  "necks" of    width 'anchor_width'  around    the matched
	kmers. The default value  is 8. Use of anchors  is disabled by providing
	-1 as width.

	[ -ungapped ]

	Perform gapless  alignment. When this flag  is specified,  rmapper looks
	for gapless alignment between  reads and the  database. In letter space,
	this means only mismatches are tolerated.  In colour space, there can be
	both mismatches  and crossovers. The  scoring  scheme still  applies. -G
	also implies -A 0 -g -255 -q -255.

	[ -B ]

	This option simply  prints a  progress bar to  stderr during  the spaced
	seed scan   and  vectored Smith-Waterman phases.  It   exists to  give a
	general feel for run-time when  testing parameters.  Since it will  slow
	down execution speed noticeably (25% or so), it is not enabled by default
	and should only be used during manual, interactive execution.

	[ -C ]

	Only   process   the  negative  strand  of    the  genome  (the  reverse
	`C'omplement). By default, both strands are scanned.

	[ -F ]

	Only  process the positive strand  of the  genome (going `F'orwards). By
	default, both strands are scanned.

	[ -H ]

	Use a hash table, rather than a direct lookup table for seeds.  A direct
	lookup  will  be more  efficient for  small  seeds,  but quickly becomes
	prohibitively large  for longer ones. For seeds  of  weight greater than
	about 14, this option should be used.

	Note that rmapper will not automatically switch to using a hash table if
	the  seed is too  large for a direct  lookup, the reason being that this
	option has simply not been as well-tested.

	Note also that this cannot be used in conjunction with the -d parameter.

	[ -P ]

	'rmapper' has two output formats. The first,  and default, prints a list
	of appropriately   scoring reads and  various  parameters, such as where
	they occurred  in the genome (index),  how many matches, mismatches, and
	gaps there were, and so  forth.  The '-P' flag  enables a 'pretty print'
	output, which displays similar parameters, but also a full alignment.

	These  alignments can  also  be obtained after  the  fact by running the
	default output file through the 'probcalc' program.

	[ -R ]

	Include  the entire  read sequence  in the  output  generated. This will
	consume huge  gobs of  disk space  for  large reads  and  is disabled by
	default. Saved reads are placed in the 'r_seq' key.

	[ -T ]

	Reverse the  order in which tie-breaks are  resolved in the full aligner
	when doing  the negative  strand.  This  should help line  up  gaps when
	negative matches  are  reverse-complemented  and compared with  positive
	matches.

	[ -U ]

	After performing  all alignments and outputting  the results, generate a
	list   of unmapped reads    (i.e.    reads for which  no    sufficiently
	high-scoring alignment was found in a contig). The  reads are listed one
	per line and follow  a special comment header  to separate them from the
	alignment  results. For  example, the  output file  will eventually look
	something like the following:

	    ...
	    >1035_1688_1316_F3      reftig_9        -       3721634 ... 
	    >1035_1688_1316_F3      reftig_9        -       3721162 ... 
	    >1035_1688_1316_F3      reftig_9        -       3722237 ... 
	    #
	    #UNMAPPED READS:
	    #
	    1035_1577_840_F3
	    1035_1587_820_F3
	    1035_1594_1563_F3
	    ...

	To extract the list of unmapped reads from  the output file, simply list
	every line following "#UNMAPPED READS:".  This can be accomplished  with
	the utils/extractunmapped.py script:

		    python utils/extractunmapped.py /path/to/results.out

	Alternatively, one can use the following grep command: 

		    grep -A 2147483647 "^#UNMAPPED READS:" /path/to/results.out

        `2147483647'  is the  number  of  lines after  the   matching context to
        output. In the  case where we want them  all, 2147483647 is  the largest
        acceptable value for GNU grep.

    rmapper-cs-specific parameters:

	[ -x crossover_penalty ]

	This specifies  the  penalty     applied when   transitioning    between
	Smith-Waterman matrices during the  full scan phase. While the vectored
	scan applies  to  colour-space,  the  final full  alignment is  done  in
	letter-space.   Since each next letter  in   letter-space depends on the
	previous letter  and colour,  any error on   the colour space  read will
	affect  all following  letters when converting   to text space. For this
	reason, we must perform our alignment of all  four possible letter space
	translations of the read  and permit jumping  between matrices (at  the
	crossover_penalty) cost, when errors occur.

	[ -v sw_vector_hit_threshold ]

	Unlike in   letter-space, where  the  vectored Smith-Waterman   and full
	Smith-Waterman  alignments are done both    in letter-space and  present
	identical   scores, in colour-space  the   vectored score represents the
	original colour-space  read aligned to   the colour-space translation of
	the  genome. This will  differ from the  final  alignment, which is done
	purely in letter-space.  Since the  function of the vectored pass exists
	merely to prune insufficiently good  alignments and the vectored pass is
	not a  true  textual alignment, scores  for  both passes are   likely to
	differ. Generally, since a single SNP in letter-space will result in two
	colour  changes, the threshold  for the colour-space alignment should be
	less than that of letter-space.

    rmapper-ls-specific parameters:

	None for now.

'prettyprint' currently takes only one optional parameter.

    prettyprint-cs and prettyprint-ls parameters (common parameters):

	[ -m sw_match_value ]

	See the 'rmapper' -m description.

	[ -i sw_mismatch_value ]

	See the 'rmapper' -i description.

	[ -g sw_gap_open_penalty_reference ]

	See the 'rmapper' -g description.

	[ -q sw_gap_open_penalty_query ]

	See the 'rmapper' -q description.

	[ -e sw_gap_extend_penalty_reference ]

	See the 'rmapper' -e description.

	[ -f sw_gap_extend_penalty_query ]

	See the 'rmapper' -f description.

	[ -R ]

	Include the entire read sequence in the output generated. This will only
	have an effect for input lines containing the 'r_seq' key.

	[ -T ]

	Reverse the  order in which tie-breaks  are resolved in the full aligner
	when doing the negative  strand.  This  should  help line up  gaps  when
	negative  matches  are reverse-complemented and  compared  with positive
	matches.

    prettyprint-cs-specific parameters:

	[ -x crossover_penalty ]

	See the 'rmapper-cs' -x description.

    prettyprint-ls-specific parameters:

	None for now.


'probcalc' takes a few optional parameters as well:

	[-g rates_file]

	Rather  than calculating  rates  of alignment  events using the provided
	output file, assume the ones provided in 'rates_file'.  This may be used
	to run probcalc in parallel by  first running all  instances with the -G
	flag, concatenating  those outputs into a  single 'rates_file', and then
	running once more with the -g parameter to generate probabilities.- .

	[-n normodds_cutoff]

	Set a threshold for normodds. Any values lower  than this threshold will
	be suppressed.
	
	[-o pgenome_cutoff]

	Set  a threshold for pgenome. Any  values lower than this threshold will
	be suppressed.
	
	[-p pchance_cutoff]

	Set a threshold for pchance. Any values higher  than this threshold will
	be suppressed.

	[-r erate,srate,irate,mrate]

	Rather than calculate rates  of various events, provide them  explicitly
	and use  those values   in the  calculations. Here,   'erate',  'srate',
	'irate' and 'mrate' mean errors (miscalls), substitutions (SNPs), indels
	and matches, respectively.

	Values should be in the range 0.0 to 1.0.
	
	[-s normodds|pgenome|pchance]

	Sort  based  on normodds,  pgenome,   or pchance given  the  appropriate
	string.

	[-t top_matches]

	Save only top_matches best matches.

 	[ -B ]
 
 	Print a progress bar to stderr during various phases of computation.

	[ -G ]

	Only generate rates of various alignment events and send them to stdout.
	This can be used to run probcalc in parallel. See also '-g'.

	[ -R ]
 
	Include the entire read sequence in the output generated. This will only
	have an effect for input lines containing the 'r_seq' key.

 	[ -S ]
 
 	Do everything in  a single pass. Typically  probcalc  will run  over all
 	rmapper results files  twice in order to  save memory. This option  will
 	use one pass only  at the expense of  using far more memory than  usual,
 	however, a significant speed advantage   is gained. Additionally,  while
 	probcalc normally mandates that results from rmapper for a specific read
 	appear  in at  most  one  file, this   option  does  not have the   same
 	restriction.


'probcalc_mp' takes quite a few parameters. 

    Mandatory parameters:

	[-m mapping_filename] 
		
	the mapping file (normal text or binary)
		
	[-f forward_suffix]
		
	the suffix for forward reads. For e.g.: "_F3"
		
	[-b reverse_suffix] 
		
	the suffix for backwards (reverse) reads. For e.g.: "_R3"
		
	[-g genome_length] 
		
	the genome length, for purposes of pchance computation
		
	[-M hard_distance_limit] 
		
	for computing the mean statistics, the hard M limit 
		
    Optional parameters:
	
	[-L nr_mate_pairs] 
		
	the  number  of  *good* mate    pairs  to include   in  the   statistics
	calculations. do -L 0 to include *all* good mate pairs.  default: 50,000
		
	[-i file_type]
		
	the type of file, file_type can be 'ascii' or 'binary'
		
	[-R] 
		
	indicator  that the read  sequence is included in  the input, as used by
	probcalc
		
	[-x max_reads_to_expect] 
		
	the maximum number of mappings one should expect for one read
		
	[-d]
		
	discordant analysis.
		
	[-u]
		
	in computing the mean, use unique mappings only
		
	[-T max_reads_to_output] 
		
	the maximum(ish) number of mappings to output per mate pair
		
	[-C PCHANCE_CUTOFF]
		
	pchance cutoff
		
	[-G PGENOME_CUTOFF]
		
	pgenome cutoff
		
	[-s nr_stdevs] 
		
	after computing  stats, M  (hard_distance_limit) is set   to the mean  +
	nr_stdev * stdev. Default: nr_stdevs = 2;
		
	[-c ] 
		
	do not print mate-pairs with hits on different chromosomes


'shrimp_var' takes only few parameters:

    Mandatory

	[-r | -p]
		
	tell shrimp_var weather the input files are rmapper or probcalc input.
		
    Optional
	
	[-R ]
		
	indicator that the  read sequence is included  in the input, as  used by
	probcalc

		
--------------------------------------------------------------------------------
5.0 Output Format
--------------------------------------------------------------------------------

rmapper  probcalc, and prettyprint all  adhere to a  common output format: lines
corresponding  to individual hits  with tab-delimited fields.  Such lines always
begin with a '>' character in the first position. All utilities ignore any lines
that do not begin with '>', such as alignments, comments, etc.

Here's an example ('\'  indicates continuation of the same  logical line on  the
next line of this README file and does not appear in the actual output):

    >947_1567_1384_F3       reftig_991      +       22901   22923   3       \
    25      25      2020    18x2x3

Additionally, the beginning  of each output file begins  with a specification of
the tab-delimited fields.  For example, the  following applies to the above read
hit:

    #FORMAT: readname contigname strand contigstart contigend readstart readend\
     readlength score editstring

The #FORMAT: line allows new fields to be unambiguously added, or others removed
or reordered without requiring alteration to the parsing routines.

Descriptions of the columns are as follows:

	'readname'	Read tag name
	'contigname'	Genome (Contig/Chromosome) name
	'strand'	Genome strand ('+' or '-')
	'contigstart'	Start of alignment in genome (beginning with 1, not 0).
	'contigend'	End of alignment in genome (inclusive).
	'readstart'	Start of alignment in read (beginning with 1, not 0).
	'readend'	End of alignment in read (inclusive).
	'readlength'	Length of the read in bases/colours.
	'score'		Alignment score
	'editstring'	Edit string: read to reference summary; see below.

Additionally, probcalc output adds the following columns:

	'pchance'	Probability that a read will align with a genome with 
			as good a score or better by chance.
	'pgenome'	Probability that a hit was generated via common 
			evolutionary events characteristic of the genome.
	'normodds'	Normalized pgenome/pchance.

The  'editstring' column  contains  a short   description of the  alignment with
reference  to the database    sequence.  This allows  at-a-glance   analysis  of
alignments, including identification of miscalls, SNPs, indels, etc.

The edit  string consists of numbers,  characters  and the  following additional
symbols: '-', '(' and ')'. It is constructed as follows:

    <number> = size of a matching substring
    <letter> = mismatch, value is the tag letter
    (<letters>) = gap in the reference, value shows the letters in the tag
    - = one-base gap in the tag (i.e. insertion in the reference)
    x = crossover (inserted between the appropriate two bases)

For example:

    A perfect match for 25-bp tags is: "25"
    A SNP at the 16th base of the tag is: "15A9"
    A four-base insertion in the reference: "3(TGCT)20"
    A four-base deletion in the reference: "5----20"
    Two sequencing errors: "4x15x6"	(i.e. 25 matches with 2 crossovers)
 
The 'probcalc_mp' program has the following output:

	#FORMAT: fwd_name fwd_chr fwd_editstring fwd_strand fwd_start\
	fwd_end fwd_pg rev_name rev_chr rev_editstring rev_strand rev_start\
	rev_end rev_pg distance normodds pgenome pchance
	
	for example:
	0       1000_1003_1062_F3       chr10   25      -       71183122\
	71183146        1.000   1000_1003_1062_R3       chr10   25\
	-       71183699    71183723 1.000   601     1.000   0.967  0.0000000010

	Descriptions of these columns are as follows:
	'fwd_name'		Forward Read tag name
	'fwd_chr'		Forward Genome (Contig/Chromosome) name
	'fwd_editstring'	Forward Edit string: read to reference summary;
				see above.
	'fwd_strand'		Forward Genome strand ('+' or '-')
	'fwd_start'		Forward Start of alignment in genome (beginning
				with 1, not 0).
	'fwd_end'		Forward End of alignment in genome (inclusive).
	'fwd_pg'		Forward pgenome.	
	'rev_name'		Reverse Read tag name
	'rev_chr'		Reverse Genome (Contig/Chromosome) name
	'rev_editstring'	Reverse Edit string: read to reference summary;
				see above.
	'rev_strand'		Reverse Genome strand ('+' or '-')
	'rev_start'		Reverse Start of alignment in genome (beginning
				with 1, not 0).
	'rev_end'		Reverse End of alignment in genome (inclusive).
	'rev_pg'		Reverse pgenome.
	'distance'		Distance between the mate pairs
	'pgenome'		Probability that the mate-pair hits were
				generated via common evolutionary events
				characteristic of the genome.
	'pchance'		Probability that a mate pair will align with a
				genome with as good a score or better by chance.
	'normodds'		Normalized pgenome/pchance.
	
The shrimp_var tool has a significantly different output format:
	>read_name edit_string contig_start #SNPs #ins #dels var_id1 var_id2 ...
	
	for example:
	>example_1 2G1---12(AC)2 101 1 1 1 s-G-103 d-3-105 i-AC-119
	
	Description of these columns are as follows:
	'read_name' 	The name of the read
	'edit_string'	Edit string: read to reference summary; see above
	'contig_start'	Start of alignment in genome (beginning with 1, not 0).
	'#SNPs'		The number of SNPs present in the hit alignment 
	'#ins'		The number of insertions present in the hit alignment
	'#dels' 	The number of deletions	present in the hit alignment
	'var_idX'	The variation id, description, and position. Details:

		if  variation is a  SNP, the  var_idX is s-L-pos,  where s means
		SNP, L is a  letter that the residue  is changed to, and  pos is
		the contig position of the SNP; if variation is a insertion, the
		var_idX is i-INS-pos where i  signals an insertion, INS includes
		the inserted   residues and pos  is  the contig position  of the
		insertion; if variation  is a deletion, the  var_idX is of  form
		d-len-pos,  with d signaling   a deletion, len representing  the
		length of the  deletion, and pos is the  contig position  of the
		deletion. More examples  and a   discussion on negative   strand
		handling is available in pdf via email.


--------------------------------------------------------------------------------
6. SHRiMP Algorithms
--------------------------------------------------------------------------------

SHRiMP  uses  a fast k-mer   scan in order  to  locate the  regions of potential
homology. One  important feature is  that we  index the  reads,  rather than the
reference genome.  This makes  our memory usage  independent of  the size of the
genome (but proportional to the  size of the  largest contig). All of the k-mers
in all of the source reads are  generated and a lookup  table mapping the k-mers
to the reads where they are present is built. We use spaced k-mers (also used in
the PatternHunter as well as many other tools) and allow the user to specify the
particular pattern of ones (positions that must match) and zeros (positions that
may differ).

For each read containing the kmer, a notation is added  that a kmer hit was just
made. If 'n' kmer hits can be fit in a genomic region of length 'm', the hit and
the  region are  fed into a   vectored  Smith-Waterman algorithm to generate  an
alignment score.  If   the score is  sufficient,  the score and  genomic  offset
(index) is remembered.  Once this process  has completed for the entire  genome,
top scoring saved read/index pairs are fed into a full Smith-Waterman algorithm,
and a detailed output is generated.

In order to deal with reads that match  the genome many  times, 'rmapper' uses a
hash-and-cache mechanism.  Every  time  a genomic region  is  investigated for a
match  against a given read (by  the vectored SW algorithm),   we compute a hash
value  for  that region, and  we  save it along  with its  resulting  score in a
per-read cache.  Subsequently, before  feeding another  genomic region into  the
vectored SW   algorithm, we compute its  hash  value and we look   it up  in the
cache. In the case of a cache hit, we use the saved score, bypassing the call to
the vector SW algorithm.  Hash collisions are possible  and they might (or might
not) result in   missed  matches, but  we trade  those   for non-negligible time
savings. This mechanism can be entirely disabled by giving the '-Z' option.  The
cache is implemented as a simple queue.

In order  to  reduce the cost of   the full, non-vectored,  final Smith-Waterman
alignment,  'rmapper' uses "anchors" or "necks"  to limit the  region  of the 2D
matrix that is computed.  For   each match, the  locations  of the kmers   which
triggered that match  during the  genomic scan are   saved; later, if  the match
makes it to  the  final stage, we   compute the minimum  rectangular region that
encloses all the kmer  matches, we add 'anchor_width'/2  space to its south-west
and north-east, and we only  compute the DP  match scores  for that region.  The
'anchor_width' is given  using '-A'. The  default is 8.  A  value of -1 disables
the use of anchors.

The speed  of execution depends largely on  the spaced seed employed, the number
of desired matches within a window,  the window length, and Smith-Waterman score
thresholds.   For example, aligning   22e6  25-colour  reads against  the  Ciona
Savignyi genome (173.67 Mbases) takes approximately 4.5 hours on a cluster of 50
4-way P4 Xeon  3.2GHz  Hyperthreaded CPUs  (200 cores) using   a spaced seed  of
length  8,  weight 7,  a   window length of  30,  2  matches per window,   and a
Smith-Waterman threshold of 1875 (penalties: 100 for  matches, -70 mismatch, -75
gap open, -25 gap extend, -200 crossover).

The closer   the expected matches are   to perfect, the   larger the spaced seed
weight can be, and  the faster the k-mer scan  runs. By decreasing the weight by
one, execution time increases by approximately  a factor of  four. Due to memory
limitations of our implementation, weights quickly become impractical around 16.
Longer weights are therefore handled is a  slightly different manner to overcome
this. Similarly, exceedingly small weights will offer  little or no benefit over
a full Smith-Waterman run.

Please  note that the  vectored Smith-Waterman algorithm is written specifically
for the SSE2 instruction set. Hence this  program will not run on non-x86/x86_64
processors. Porting those algorithms  to   similar vector processors  should  be
relatively straightforward.


--------------------------------------------------------------------------------
7. SHRiMP Statistics
--------------------------------------------------------------------------------

The 'probcalc' utility computes the probability that a hit would be generated by
chance in a random genome, the probability that it is generated by the reference
genome, and the normalized odds.

To compute the probability that a hit happened  by chance (p_chance), we compute
the probability that a match  equally good or better would  occur in a genome of
equal length where  at every position  each nucleotide can be randomly  selected
with  probability 0.25. To  compute this we  compute the number of k-mer strings
that have fewer then the observed number changes. In  the case of mismatches and
(for colour-space) crossovers, the number of such  strings is: \sum(i=0 -> subs)
{ (length  choose  i) *  3^i   }. For indels,  the  number  of such strings  is:
\sum(k=len_ref  -> len_read)  {  (k-1    choose  len_read-1) *  1^(len_ref)    *
3^(k-len_ref) * 4^(len_read-k)  }. Explanations of  these formulae will be  made
available in the  statistics section of  the SHRiMP paper. If  the hit does  not
span  the whole read,  we  further  multiply  by (read_length-hit_length  +1) to
account for the increased number of  matching strings. Now  let the total number
of strings be Z; then we define  p_chance as 1-(1-(Z/(4^k)))^genome_length. This
is the application of the Bayesian "noisy or"  to the probability that a certain
position matched one of the strings  as close or closer to  the read in question
than the real hit.

The second value  computed  by probcalc is the  probability  that each  hit  was
generated by the genome (p_genome). For this we use  a bootstrapping approach to
estimate the rates of matches, mismatches (SNPs), indels, and, for colour space,
crossovers, taking only the best hit  of every read  if it is below the p-chance
threshold. The probability that a certain region of the genome generated a given
read is exactly the product of the frequencies of  all of the events observed in
the alignment.

Finally   the    normalized  odds are   computed  by     summing   up  the  odds
(p_genome/p_chance) for every  hit for every read,  and  dividing each value  by
their sum.  This value represents how  much more likely is this  hit to be right
than all the other ones. For example  having two equally good  hits with lead to
normalized odd values of 0.5 for both, while having an almost  exact match and a
more distant one will lead to large discrepancies.


--------------------------------------------------------------------------------
8. Mate Pairs
--------------------------------------------------------------------------------

The probcalc_mp project  takes  in a  sorted file (ascii  or binary  filled with
mapping_t structures as defined in dbtypes.h). This is a file with read mappings
as outputted by the 'probcalc' tool. Assuming the binary  file is sorted by read
id,    probcalc_mp  will  find  all   the    mate  pair   mappings, assign  them
probability-based confidence values (see the SHRiMP paper), and output them. How
and which computations are done, and which mate-pairs are  output depends on the
given parameters. Due to the  fact that this software is  aimed at being used in
current NGS research, many parameters are provided.


--------------------------------------------------------------------------------
8. Contact
--------------------------------------------------------------------------------

The program website is 
    	http://compbio.cs.toronto.edu/shrimp

The authors of this software may be contacted at the following e-mail address:
	shrimp at cs dot toronto dot edu


--------------------------------------------------------------------------------
9. Acknowledgements
--------------------------------------------------------------------------------

Development  was performed at the University  of Toronto's Computational Biology
lab in collaboration with the Stanford University Sidow Lab.

The   development of this   distribution was made possible  in  part by National
Engineering and Research Council of Canada Undergraduate Student Research Awards
(NSERC USRAs).

We'd like to thank Dr. Alessandro Guffanti (Bioinformatics, Genomnia srl, Milan,
Italy),  Christine Voellenkle  and Jeroen  van Rooij  ( Policlinico  San Donato,
Milan, Italy) for their feedback on using SHRiMP for mapping micro RNA data.