Skip to content
grenaud edited this page Feb 14, 2013 · 12 revisions

freeIbis Manual


Command Line Options
Test data set
Paired End sequencing and Index Reads
Starting analysis from RTA output folder/unpacked SRF files
Cluster coordinates
How to specify control sequences ?
What is quality score recalibration?
Amount of training data
Using freeIbis with on a run without phiX
Training on a non-control read library / SOAP vs. Bowtie

We here a list of the command line options for Ibis:
Usage: runBaseCalling.py [options]

Options:
  -h, --help            show this help message and exit

  General:
    General options

    -c CORES, --cores=CORES
                        Number of CPU cores to be used (default 1)
    --corespred=CORESPRED
                        Number of CPU cores for prediction only (default 1)
    -e EXPID, --expID=EXPID
                        Name of experiment to use for sequence names
    -o OUTPATH, --outpath=OUTPATH
                        Path for output files
    -b BUSTARD, --bustard_path=BUSTARD
                        Path to bustard folder
    --NoFinishCheck     Do not check for s_[1-8]_finished.txt files in Bustard
                        folder.
    --coordianteType=COORDTYPE
                        Type of cluster coordinate transformation: round =
                        Pipeline <= 1.4: round(x), floor = Pipeline 1.5.*:
                        floor(abs(x)), shift_round (default) = Pipeline 1.6.*:
                        round(x*10+1000)
    --temp=TMP          Path to temporary folder (default /tmp/)
    --NoTestDataSet     Don't separate data for a test data set
    --mock              Do a mock run for testing
    --lock              Use a .lock file to avoid basecalling a run twice
    --quick             Use logistic regression for prediction instead of
                        SVMs, quicker but less accurate

  Filter/Paired End:
    Excluding parts of sequences/Handle paired end runs

    --start=START       First base to include/first base in each read (comma
                        separated, not including index)
    --end=END           Last base to include/last base in each read (comma
                        separated, not including index)
    --numberN=NUMBERN   Maximum number of missing bases to be accepted
                        (default 3)
    --indexlength=INDEXLENGTH
                        Length of index read (default 0)
    --2nd_indexlength=INDEXLENGTH2
                        Length of a second index read following the reverse
                        read (default 0)
    --control_index=CONTROL_INDEX
                        Sequence of index (only first index!) identifying
                        control reads used for training/test data set (default
                        '' = no filter)

  Training:
    Parameters for training

    -l LANES, --lanes=LANES
                        Lanes, example: 1,3,4-8 (default 4)
    -t TILES, --tiles=TILES
                        Tiles, example: 1-13,100 (default 1-120)
    -a ALIGN, --align=ALIGN
                        Path to SOAP/Bowtie binary
    -r REFERENCE, --reference=REFERENCE
                        Reference genome file
    -m MISMATCH, --mismatch=MISMATCH
                        Number of mismatches allowed (default 5/bowtie max 3)
    --maskMM            Mask mismatches from mapping output in training data
    --nomask            Do not mask variable positions on the control reads

  Prediction:
    Parameters for prediction

    --NoPrediction      Don't do prediction, just do training
    --prediction_lanes=PREDICTION_LANES
                        Lanes, example: 1,3,4-8 (default all)
    --prediction_tiles=PREDICTION_TILES
                        Tiles, example: 1-13,100 (default all)
    --fastq             Create output in fastq format
    --out4Q             Create output in 4Q format
    --predictionprog=PREDPROGRAM
                        Use this program for prediction

  Quality Scores:
    Parameters for quality scores

    --recalibration     Trigger quality score recalibration using control
                        reads
    --plotqual          Plot quality score recalibration lines (requires R)
    --maxqual=MAXQUAL   Set maximum allowed quality score (PHRED scale
                        default=40)
    --cgroup=CGROUP     Group consecutive cycles together for quality score
                        recalibration (useful for low amounts of control
                        sequences ex: miseq)

Here is a more verbose explanation of each argument:

Option Description Example of use
-c CORES, --cores=CORES By setting the number of cores here, Ibis will not spawn more than CORES processes. Please note that for large intensity files like those produced by HiSeq, we highly recommend using fewer cores (2-5) to avoid filling up the RAM. It is recommended to periodically monitor memory usage during the basecalling. -c 4 or --cores=4
-corespred=CORESPRED Run using CORESPRED when doing the prediction. This is done since the intensities have to be loaded at once for an entire tile. This can cause problems with low memory machines or with sequencers with very high throughput (e.g. HiSeq).
-e EXPID, --expID=EXPID This will append EXPID to the definition line in the resulting FASTQ file in the following manner: EXPID:lane number:tile number:x coodinate:y coodinate -e MYRUN_2012-01-01 or --expID=MYRUN_2012-01-01
-o OUTPATH, --outpath=OUTPATH The software will write final files in this directory, please note that this will not be the only directory used for writing files (see the --temp option) -o /home/user/mysequences or -outpath=/home/user/mysequences
-b BUSTARD, --bustard_path=BUSTARD The full path to the Bustard basecalls. The intensity files should be located in the directory above this one (the structure of the intensities varies a bit according to the version of the Illumina software). -b /mnt/ngsdata/MYRUN_2012-01-01/Data/Intensities/Bustard1.4.0_01-01-2012_sbsuser/ or --bustard_path=/mnt/ngsdata/MYRUN_2012-01-01/Data/Intensities/Bustard1.4.0_01-01-2012_sbsuser/
--NoFinishCheck Will not check if the *finished.txt exist (see Starting analysis from RTA output folder/unpacked SRF files)
--coordianteType=COORDTYPE To specify the function used by Ibis to revert the raw cluster coordinates stored by the sequencer to actual coordinates (see section on clusters coordinates) --coordianteType=shift_round
--temp=TMP The directory where the temporary files will be written to. --temp=/var/tmp/
--NoTestDataSet If used, Ibis will not use half of the controls as testing data. This is usually not recommended as there won't be any quality metric of the training but could bbe useful on runs with very limited amounts of training data
--mock Will not launch any actual training or basecalling but will only print the commands that would be used
--lock This will create a .lock in the output directory. If the basecalling is launched from another instance of freeIbis, the second will sleep until the .lock is removed. This is to facilitate basecalling within sequencing pipelines.
--quick Instead of using the SVM library to achieve prediction, Ibis will use the logistic regression in LIBLINEAR which will train more quickly but can slightly reduce the accuracy of the basecalling
--start=START To provide the index (or indices) of the first cycle(s) for the reads. See this section for a greater explanation of this parameter for paired-end reads. --start='1' --start='1,76'
--end=END To provide the index (or indices) of the last cycle(s) for the reads. See this section for a greater explanation of this parameter for paired-end reads. --start='75' --start='75,150'
--numberN=NUMBERN Control clusters will be kept if they have less than this number of unresolved base pairs ('N') --numberN=3
--indexlength=INDEXLENGTH Length (total number of cycles) of the first index coming immediately after the forward read. See this section for further explanation. --indexlength=7
--2nd_indexlength=INDEXLENGTH2 Length (total number of cycles) of the second index coming immediately after the reverse read. See this section for further explanation. --2nd_indexlength=7
--control_index=CONTROL_INDEX The index used to specify control sequences. See this section about specifying control sequences --control_index=TTGCCGC
-l LANES Which lanes to use to look for control sequences to build the training/testing data set, see specifying control sequences -l 1-8 or -l 4
-t TILES Which tiles to use to look for control sequences to build the training/testing data set, see specifying control sequences -t 1-100 or -t 4,45,82
-a ALIGN, --align=ALIGN To override the default path to the SOAP executables defined in params.py -a /usr/bin/soap
-r REFERENCE, --reference=REFERENCE To override the default path defined in params.py to the file containing the genome of the control organism. This should be changed as the one defined in params.py is tailored to our servers at MPI. -r /mnt/data/genomes/phix174.fa
-m MISMATCH, --mismatch=MISMATCH Maximum number of mismatches allowed by SOAP during the alignment to the control reference -m 3
--maskMM If used, Ibis will ignore bases from reads with a mismatch to the control reference genomes
--nomask When Ibis encounters a position on the control reference with a high number of mismatches for a given position with a strong bias to a given nucleotide, is masks this position and does not use training data from those positions. This options disables that and will use every position of the control reference regardless of likely mutations in the control used
--NoPrediction Will construct the training sets and train but will not predict the bases
--prediction_lanes=PREDICTION_LANES Limit prediction to only these lanes --prediction_lanes=1-3
--prediction_tiles=PREDICTION_TILES Limit prediction to only these tiles --prediction_tiles=1-50
--out4Q Produce output in the 4Q format which gives you the quality scores for every 4 nucleotide
--recalibration Use recalibration for the quality scores, takes longer but gives a higher correlation between observed error rate and the predicted one. Make sure you have enought training. See the section on recalibration
--plotqual Plot the calibration curves for each cycle
--maxqual=MAXQUAL Used to specify the maximum quality score (in PHRED scale) being produced by the basecalling procedure. See the section on recalibration --maxqual=40
--cgroup=CGROUP When there is a lack of control sequences, recalibration can fail and give a poor results due to the impossibility of evaluating error rates. This option will cluster consecutive cycles together to evaluate the error rate. This will increase the bias in cycle for quality scores See the section on recalibration --cgroup=5

The best way to get started is a test data set. Therefore, we extracted clusters from the control lane of a 51 cycle single read run. We extracted 2000 clusters per tile, resulting in 200,000 clusters in total (1.7% of the complete lane) and available for download (size 104MB). The control (PhiX RF1) was sequenced in lane 4 of the run. All other lanes and files not used by Ibis are missing in this archive. Eventhough this is not the typical way to use Ibis, this will allow for a quick demonstration. So download the archive via the link above or use wget; and unpack the archive:

$ wget http://bioinf.eva.mpg.de/Ibis/51cycle_phix_2000pT.tgz
$ tar vxzf 51cycle_phix_2000pT.tgz
$ cd Test_Data

Let us assume we downloaded the archive to the same folder that the Ibis installation was unpacked and installed in (otherwise use complete paths instead of the '..' notation). Most of the parameters you do not need to set, as we have a single read run and the training is already set to use all tiles of the lane 4. The script will automatically use every second tile as test data for quality score estimation and the other tiles for training. One should set the name of the experiment, the number of cores (in case of a multi-core machine), the input path for Bustard (Firecrest/IPAR files are assumed to be in the parent folder of Bustard), the output path and the reference sequence used. The output path does not have to exist, it is created if neccessary:
./runBaseCalling.py --coordianteType=round --start 1 --end 51 -c 6 -e TEST_RUN -o Ibis_Basecall -b C1-51_Firecrest1.9.5/Bustard1.9.5/ -r solexaPhix.fa

Depending on the system this may take several minutes (on our eight core system about 10 minutes). Due to the missing lanes, we may get error messages at the very last step from cat, however these we can ignore. The file of interest, the 200'000 newly called PhiX sequences can now be found in Ibis_Basecall/s_4_sequence.txt as a FastQ file (having alternating lines of sequences and quality scores separated by header lines) with Sanger offset (33) for quality score encoding:

$ head Ibis_Basecall/s_4_sequence.txt
@TEST_RUN:4:1:1539:1398
ACGAGCACGAGAGCGGTCAGTAGCAATCCAAACTTTGTTACTCGTCAGAAA
+
89>HB8;7??=@@6<<87?<3<A4><775?==3633<229322;2169567
@TEST_RUN:4:1:1610:778
ATCACCTTGAATGCCACCGGAGGCGGCTTTTTGACCGCCTCCAAACAATTT
+
7<8G8769=>?6>58>66::<:=3<=1165687714722124253.43-1,
@TEST_RUN:4:1:1493:727
CCATTGCGTCGTGGCCTTGCTATTGACTCTACTGTAGACATTTTTACTTTT
Such a FastQ output file could be directly used with programs like bowtie or maq/bwa - including their tools for SNP calling (e.g. samtools pileup). When using SOAP you either have to set the correct offset (possible with SOAP v1 with '-z !') or use fasta as input with SOAP v2. With the test data, we also provided scripts that allow you to convert FastQ to Fasta and to extract the raw sequences in Fasta format from the Bustard folder. We will now use these together with SOAP for checking whether using Ibis was successful.

First extract the raw sequences from Bustard (resulting in s_4_sequence.txt):

$ ./Bustard2Fa.py -p C1-51_Firecrest1.9.5/Bustard1.9.5/ -e TEST_DATA -l 4
$ mv s_4_sequence.txt s_4_sequence.Bustard.fa
Convert the FastQ file from Ibis to Fasta (resulting in s_4_sequence.fa):
$ ./FastQ2Fa.py Ibis_Basecall/s_4_sequence.txt -o .
$ mv s_4_sequence.fa s_4_sequence.Ibis.fa
Now we can run SOAP v1.11 (provided with Ibis) and map the raw sequences of both base callers to the PhiX reference sequence provided (allowing up to 5 mismatches and using 8 cores):
$ ./soap_1.11_patched/soap -d solexaPhix.fa -a s_4_sequence.Ibis.fa -o s_4_sequence.Ibis.soap -v 5 -p 8
Start at: Thu Apr 9 09:56:28 2009

Load in 1 db seqs, total size 5386 bp. 1 secs passed
total_kmers: 1048576
Create seed table. 1 secs passed
Single read alignment:
Query: s_4_sequence.Ibis.fa Reference: solexaPhix.fa Output: s_4_sequence.Ibis.soap
10000 reads finished. 1 secs passed
20000 reads finished. 1 secs passed
30000 reads finished. 1 secs passed
40000 reads finished. 1 secs passed
60000 reads finished. 1 secs passed
50000 reads finished. 1 secs passed
80000 reads finished. 2 secs passed
70000 reads finished. 2 secs passed
90000 reads finished. 2 secs passed
110000 reads finished. 2 secs passed
100000 reads finished. 2 secs passed
120000 reads finished. 2 secs passed
130000 reads finished. 2 secs passed
160000 reads finished. 2 secs passed
140000 reads finished. 2 secs passed
170000 reads finished. 2 secs passed
180000 reads finished. 2 secs passed
190000 reads finished. 2 secs passed
150000 reads finished. 2 secs passed
200000 reads finished. 2 secs passed
Total number of aligned reads: 176253 (88%)
Done.
Finished at Thu Apr 9 09:56:30 2009
Total time consumed: 2 secs

$ ./soap_1.11_patched/soap -d solexaPhix.fa -a s_4_sequence.Bustard.fa -o s_4_sequence.Bustard.soap -v 5 -p 8
Start at: Thu Apr 9 09:57:03 2009

Load in 1 db seqs, total size 5386 bp. 0 secs passed
total_kmers: 1048576
Create seed table. 0 ses passed
Single read alignment:
Query: s_4_sequence.Bustard.fa Reference: solexaPhix.fa Output: s_4_sequence.Bustard.soap
10000 reads finished. 0 secs passed
30000 reads finished. 1 secs passed
20000 reads finished. 1 secs passed
40000 reads finished. 1 secs passed
60000 reads finished. 1 secs passed
80000 reads finished. 1 secs passed
90000 reads finished. 1 secs passed
70000 reads finished. 1 secs passed
50000 reads finished. 1 secs passed
100000 reads finished. 1 secs passed
110000 reads finished. 1 secs passed
120000 reads finished. 1 secs passed
130000 reads finished. 1 secs passed
150000 reads finished. 1 secs passed
140000 reads finished. 1 secs passed
160000 reads finished. 1 secs passed
190000 reads finished. 2 secs passed
170000 reads finished. 2 secs passed
180000 reads finished. 2 secs passed
200000 reads finished. 2 secs passed
Total number of aligned reads: 167401 (84%)
Done.
Finished at Thu Apr 9 09:57:05 2009
Total time consumed: 2 secs

What we can already see from the SOAP output, of the raw sequences called with Ibis ~4% more sequences can be aligned to the reference. We can also take look on the number of mismatches, which is available in column 10 of the SOAP output. Now we can count how often we have seen a read matching with 0,1,.. up to 5 mismatches.

$ cut -f 10 s_4_sequence.Ibis.soap | sort | uniq -c
 131945 0
  22991 1
   9541 2
   5559 3
   3643 4
   2574 5
$ cut -f 10 s_4_sequence.Bustard.soap | sort | uniq -c
  88834 0
  38708 1
  18138 2
  10703 3
   6583 4
   4435 5
Using these numbers we can calculate a lower error rate of this sequencing run by assuming that the not mapped sequences would be mapped when allowing one more mismatch:

Ibis: (22991*1+9541*2+5559*3+3643*4+2574*5+23747*6)/(51*200000) = 2.24%

Bustard: (38708*1+18138*2+10703*3+6583*4+4435*5+6*32599)/(51*200000) = 3.44%

So we were able to reduce the error rate of this small test run by a factor of 1.5 (to 65.1%).

The above example shows how to base call single read runs using Ibis. If you want to use Ibis for Paired End runs, the approach is very similar. Only two more parameters have to be set: the start and the end of the first and second read.

$ ./runBaseCalling.py -c 8 -e <Run_name> -o Ibis_Basecall -b <Bustard_folder> -r solexaPhix.fa --start='1,52' --end='51,102'

There is a special behavior if there are gaps in the two parts of the sequence defined by start and end. If for example the following parameters are set --start='1,60' --end='51,110' then there are eight bases that do not seem to belong to read1 but are also not yet read2. In this case Ibis will ignore those eight bases for creating a training data set, however for later prediction it will train a model on cycle 60 and apply it to 52, it will also train a model for cycle 61 and apply it to cycles 53 to 61. This behavior was designed to handle tag sequences at the beginning of a read.

For indexed samples the behavior should be slighlty different, in this case we would like to extrapolate from base calling the first bases of the first read to base calling of the index read. Starting with Ibis 1.0.1 there exists a index flag to runBaseCalling.py which implements this behavior when defining the number of cycles of the index read (--indexlength=INDEXLENGTH). Using the index flag makes is neccessary to define the length of the first read (if PE, the length of both reads). For an indexed PE sequencing run of 2x51 cycles and an index read of 6 cycles the parameters would be the following: --start=1,58 --end=51,108 --indexlength=6 For an double-indexed PE sequencing run of 2x51 cycles and an index read of 6 cycles and an second index of 6 cycles (occuring after the reverse read), the parameters would be the following: --start=1,58 --end=51,108 --indexlength=6 --2nd_indexlength=6

For Paired Ends the output file will also be one single FastQ file containing both (all three) reads together in one sequence line without any separators (reflecting the behaviour of the Illumina Pipeline up to version 1.3.2). To separate the reads into two files (index read being merged with the first read), you can use the script provided here.

When running Ibis on RTA output folders or folders unpacked from SRF files, the BaseCalls/Bustard folder does not contain s_[1-8]_finished.txt files for which the runBaseCalling.py script checks. In this case runBaseCalling.py terminates with the message: "Need path to finished bustard folder. The given folder misses [...]". In this case, please run runBaseCalling.py with the --NoFinishCheck option or change to the Bustard/Intensities folder and create the files using the following command line:

$ for i in 1 2 3 4 5 6 7 8; do touch s_${i}_finished.txt; done

The assignment of intensity values to cluster X:Y coordinates used for sequences identifiers is done differently for the different versions of the Illumina Pipeline. While the original Firecrest intensity tables contained the very same coordinates that were then used for the sequence identifiers, IPAR software created separated text files defining floating point positions of each cluster. These were then rounded to obtain the numbers used in the sequence identifier. With RTA and pipelines after 1.3.2 the rounding was replaced by taking the absolut value and truncating the float numbers at the decimal point (floor). In the most recent version (OLB/RTA 1.6) the numbers are multiplied by 10, 1000 is added and the resulting value is rounded to the next integer. This makes it necessary to define the correct transformation with each Ibis call - otherwise Ibis will not be able to find the intensities values of sequences in the training data set and model training will fail. In a multiplexed run, the index sequence control reads must be specified using the "--control_index" parameter. Please note that this parameter is solely for the first index. If indices are not used and certain lanes are dedicated to control sequences, the "-l LANES, --lanes=LANES" option must be used to specify them. In the case where there more control reads than needed by Ibis (e.g. a HiSeq run with an entire lane dedicated to controls), the "-t TILES, --tiles=TILES" can be used to limit the amount of training data so avoid slowing down training (see the Amount of training data) Quality scores are the probability of error for individual errors. Quality score recalibration attempts to model the observed error rate in control reads as a function of the SVM decision score to subsequently use this correlation to produce quality scores that model more accurately the observed error rate for the remaining sequences. This option is triggered using the "--recalibration" option. Make sure a sufficient amount of training data is available as Ibis will evaluate the error on a per cycle, per nucleotide. If a small amount of training data is available, consider using "--cgroup=CGROUP" to cluster consecutive cycles together to evaluate the error rate. However, please note that this will increase cycle bias in the resulting quality scores (observed error rate according to cycle). The maximum quality score (PHRED scale) predicted by Ibis can be specified using the "--maxqual=MAXQUAL" parameter. This has to be lowered for runs using older version of the chemistry kit (e.g. 37,38) and can be increased for more recent sequencers MiSeq (e.g. 40,41). In tests we have seen that Ibis does not get considerably better the more training data is used and that as little as 50'000 reads already allow for 97% of the improvement. We saw the performence increase leveling off at something like 1 million reads. So we do recommend having at least a million reads for training and another million for testing, i.e. 2 million reads in each training_seqs and training_seqs_r2. Taking more data will mostly slow down the whole process for a very small gain in prediction performance. Thus we recommend using only a subset of tiles for obtaining the training data set from a dedicated control lane (e.g. --training_tiles='1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77,81,85,89,93,97,101,105,109,113,117,120'). As indicated in the example, tiles should be spread over the whole lane. What do to if you have a run without phiX control ? Can you still use freeIbis ? Yes but the procedure will be slower and we cannot guarantee you get good better results than Bustard. However, we highly recommend not to sequence without some PhiX spiked in, this is instrumental to evaluate how good your run was. A few pointers:
you need a mature genome build for the organism
Try to evaluate the number of clusters per tile and estimate the percentage of aligned reads, use this to estimate how many tiles you will need to reach at least 1 million training examples
To run freeIbis:
Specify the lane (using -l) and the tiles (using -t) of the tiles you chose for training
Specify --nomask
Do not attempt to calibrate quality scores, using PhiX we assume the divergence to the reference is 0 and any mismatch is a sequencing error (once masking is performed), this assumption is violated for non-control targets
If you have very poor nucleotide diversity for a given cycle and cross-talk estimates failed for Bustard, freeIbis will be unlikely to improve the accuracy. Ibis was designed with the idea of having some type of control reads with known reference sequence when creating the training data set. These control reads shall allow, when aligning them against the known reference sequence, to consider every single mismatch as a putative sequencing/base calling error and to use the reference base as the correct base in training. This will allow to have cases in the training data set which are very informative because it is so hard to tell which base it actually is (since Bustard was wrong about it). For this reason the reference should not have any biological divergence from the control reads and it should allow unambiguous alignments with many mismatches for getting most control reads aligned/obtaining a big training data set. These requirements (especially the unambiguous placement with many mismatches) are typically only fulfilled by small reference genomes, like the PhiX control used by Illumina (either as dedicated control lane or as spiked in control). SOAPv1 is able to very quickly create a kmer/seed table for such a small genome and it allows up to 5 mismatches when searching in this table. Bowtie has a more clever "seeding" strategy, however has its limits with the number of mismatches and even trims reads back if mismatches get to frequent at the end of the read. Further bowtie is not faster when the genome is really small, bowtie basically only scales with the number of reads aligned and the number of mismatches allowed. As a result the highest number of mismatches you can have with bowtie is 3 (when requiring that the whole read to be aligned). So bowtie does not perform better on the original task Ibis was designed for.

However, if one does not have control reads, but a close-by reference (which could be big mammalian genome) for the reads then one would like to do something else. Now one has to assume that since reads are close to the reference something like 99% of the bases should be identical. Actually, if one now sees a difference to the reference one does not know whether it is from divergence or from sequence error (depending on the position in the read the one is more likely than the other). This will cause issues in Ibis's quality score calibration. In cases where error is lower than divergence one might not even want to consider the reference to be correct at all (use --maskMM to use only bases where read and reference match). Further if the reference is big, alignment might now introduce another problem - one might find the position closest to the read sequence but not necessarily find the correct position that the read would go to if its sequence was determined correctly or if it did not have any divergence to the reference. Unfortunately there seems no good way around it, except trusting that due to the low divergence from the reference 99% of the bases will still be correct and hoping that the statistical learner deals well with mislabeled samples - i.e. generalizes well. The SVM used by Ibis seems to generalize well, especially as its simple kernel (linear) does not allow to over train these samples.

Only in a situation where instead of small control genome a big regular genome is used, SOAPv1 gets really slow and bowtie gets useful. There is another factor that makes SOAPv1 a pain on big genomes. SOAPv1 ignores Ns in the query and does not report the reference base for an N in its output. That is why Ibis extracts training sequence directly from the reference genome instead of reconstructing it from the SOAP output. However bowtie, handles Ns as mismatching bases and thus allows to reconstruct the reference sequence without requiring the reference genome. So when using bowtie one can save this step. On the other hand, bowtie has a reduced sensitivity (due to the maximum of three mismatches). Especially for long reads this might not be useful. So we needed to allow bowtie to match more mismatches and the easiest way around this problem was splitting up reads as it also done for SOAPv1 when reads above 60nt shall be processed. In the current version, reads longer 40nt will be split up (increasing bowtie runtime). In case one would like to change this behavior it is controlled by the max_length_bowtie variable.

In summary, running without real control reads does not perform as good as running with them and one normally will not need bowtie for the control read situation as there is no big sized control genome without redundancy.