-
Notifications
You must be signed in to change notification settings - Fork 4
Home
Command Line Options
Test data set
Paired End sequencing and Index Reads
Starting analysis from RTA output folder/unpacked SRF files
Cluster coordinates
How to specify control sequences ?
What is quality score recalibration?
Amount of training data
Using freeIbis with on a run without phiX
Training on a non-control read library / SOAP vs. Bowtie
We here a list of the command line options for Ibis:
Usage: runBaseCalling.py [options] Options: -h, --help show this help message and exit General: General options -c CORES, --cores=CORES Number of CPU cores to be used (default 1) --corespred=CORESPRED Number of CPU cores for prediction only (default 1) -e EXPID, --expID=EXPID Name of experiment to use for sequence names -o OUTPATH, --outpath=OUTPATH Path for output files -b BUSTARD, --bustard_path=BUSTARD Path to bustard folder --NoFinishCheck Do not check for s_[1-8]_finished.txt files in Bustard folder. --coordianteType=COORDTYPE Type of cluster coordinate transformation: round = Pipeline <= 1.4: round(x), floor = Pipeline 1.5.*: floor(abs(x)), shift_round (default) = Pipeline 1.6.*: round(x*10+1000) --temp=TMP Path to temporary folder (default /tmp/) --NoTestDataSet Don't separate data for a test data set --mock Do a mock run for testing --lock Use a .lock file to avoid basecalling a run twice --quick Use logistic regression for prediction instead of SVMs, quicker but less accurate Filter/Paired End: Excluding parts of sequences/Handle paired end runs --start=START First base to include/first base in each read (comma separated, not including index) --end=END Last base to include/last base in each read (comma separated, not including index) --numberN=NUMBERN Maximum number of missing bases to be accepted (default 3) --indexlength=INDEXLENGTH Length of index read (default 0) --2nd_indexlength=INDEXLENGTH2 Length of a second index read following the reverse read (default 0) --control_index=CONTROL_INDEX Sequence of index (only first index!) identifying control reads used for training/test data set (default '' = no filter) Training: Parameters for training -l LANES, --lanes=LANES Lanes, example: 1,3,4-8 (default 4) -t TILES, --tiles=TILES Tiles, example: 1-13,100 (default 1-120) -a ALIGN, --align=ALIGN Path to SOAP/Bowtie binary -r REFERENCE, --reference=REFERENCE Reference genome file -m MISMATCH, --mismatch=MISMATCH Number of mismatches allowed (default 5/bowtie max 3) --maskMM Mask mismatches from mapping output in training data --nomask Do not mask variable positions on the control reads Prediction: Parameters for prediction --NoPrediction Don't do prediction, just do training --prediction_lanes=PREDICTION_LANES Lanes, example: 1,3,4-8 (default all) --prediction_tiles=PREDICTION_TILES Tiles, example: 1-13,100 (default all) --fastq Create output in fastq format --out4Q Create output in 4Q format --predictionprog=PREDPROGRAM Use this program for prediction Quality Scores: Parameters for quality scores --recalibration Trigger quality score recalibration using control reads --plotqual Plot quality score recalibration lines (requires R) --maxqual=MAXQUAL Set maximum allowed quality score (PHRED scale default=40) --cgroup=CGROUP Group consecutive cycles together for quality score recalibration (useful for low amounts of control sequences ex: miseq)
Here is a more verbose explanation of each argument:
Option | Description | Example of use |
---|---|---|
-c CORES, --cores=CORES | By setting the number of cores here, Ibis will not spawn more than CORES processes. Please note that for large intensity files like those produced by HiSeq, we highly recommend using fewer cores (2-5) to avoid filling up the RAM. It is recommended to periodically monitor memory usage during the basecalling. | -c 4 or --cores=4 |
-corespred=CORESPRED | Run using CORESPRED when doing the prediction. This is done since the intensities have to be loaded at once for an entire tile. This can cause problems with low memory machines or with sequencers with very high throughput (e.g. HiSeq). | |
-e EXPID, --expID=EXPID | This will append EXPID to the definition line in the resulting FASTQ file in the following manner: EXPID:lane number:tile number:x coodinate:y coodinate | -e MYRUN_2012-01-01 or --expID=MYRUN_2012-01-01 |
-o OUTPATH, --outpath=OUTPATH | The software will write final files in this directory, please note that this will not be the only directory used for writing files (see the --temp option) | -o /home/user/mysequences or -outpath=/home/user/mysequences |
-b BUSTARD, --bustard_path=BUSTARD | The full path to the Bustard basecalls. The intensity files should be located in the directory above this one (the structure of the intensities varies a bit according to the version of the Illumina software). | -b /mnt/ngsdata/MYRUN_2012-01-01/Data/Intensities/Bustard1.4.0_01-01-2012_sbsuser/ or --bustard_path=/mnt/ngsdata/MYRUN_2012-01-01/Data/Intensities/Bustard1.4.0_01-01-2012_sbsuser/ |
--NoFinishCheck | Will not check if the *finished.txt exist (see Starting analysis from RTA output folder/unpacked SRF files) | |
--coordianteType=COORDTYPE | To specify the function used by Ibis to revert the raw cluster coordinates stored by the sequencer to actual coordinates (see section on clusters coordinates) | --coordianteType=shift_round |
--temp=TMP | The directory where the temporary files will be written to. | --temp=/var/tmp/ |
--NoTestDataSet | If used, Ibis will not use half of the controls as testing data. This is usually not recommended as there won't be any quality metric of the training but could bbe useful on runs with very limited amounts of training data | |
--mock | Will not launch any actual training or basecalling but will only print the commands that would be used | |
--lock | This will create a .lock in the output directory. If the basecalling is launched from another instance of freeIbis, the second will sleep until the .lock is removed. This is to facilitate basecalling within sequencing pipelines. | |
--quick | Instead of using the SVM library to achieve prediction, Ibis will use the logistic regression in LIBLINEAR which will train more quickly but can slightly reduce the accuracy of the basecalling | |
--start=START | To provide the index (or indices) of the first cycle(s) for the reads. See this section for a greater explanation of this parameter for paired-end reads. | --start='1' --start='1,76' |
--end=END | To provide the index (or indices) of the last cycle(s) for the reads. See this section for a greater explanation of this parameter for paired-end reads. | --start='75' --start='75,150' |
--numberN=NUMBERN | Control clusters will be kept if they have less than this number of unresolved base pairs ('N') | --numberN=3 |
--indexlength=INDEXLENGTH | Length (total number of cycles) of the first index coming immediately after the forward read. See this section for further explanation. | --indexlength=7 |
--2nd_indexlength=INDEXLENGTH2 | Length (total number of cycles) of the second index coming immediately after the reverse read. See this section for further explanation. | --2nd_indexlength=7 |
--control_index=CONTROL_INDEX | The index used to specify control sequences. See this section about specifying control sequences | --control_index=TTGCCGC |
-l LANES | Which lanes to use to look for control sequences to build the training/testing data set, see specifying control sequences | -l 1-8 or -l 4 |
-t TILES | Which tiles to use to look for control sequences to build the training/testing data set, see specifying control sequences | -t 1-100 or -t 4,45,82 |
-a ALIGN, --align=ALIGN | To override the default path to the SOAP executables defined in params.py | -a /usr/bin/soap |
-r REFERENCE, --reference=REFERENCE | To override the default path defined in params.py to the file containing the genome of the control organism. This should be changed as the one defined in params.py is tailored to our servers at MPI. | -r /mnt/data/genomes/phix174.fa |
-m MISMATCH, --mismatch=MISMATCH | Maximum number of mismatches allowed by SOAP during the alignment to the control reference | -m 3 |
--maskMM | If used, Ibis will ignore bases from reads with a mismatch to the control reference genomes | |
--nomask | When Ibis encounters a position on the control reference with a high number of mismatches for a given position with a strong bias to a given nucleotide, is masks this position and does not use training data from those positions. This options disables that and will use every position of the control reference regardless of likely mutations in the control used | |
--NoPrediction | Will construct the training sets and train but will not predict the bases | |
--prediction_lanes=PREDICTION_LANES | Limit prediction to only these lanes | --prediction_lanes=1-3 |
--prediction_tiles=PREDICTION_TILES | Limit prediction to only these tiles | --prediction_tiles=1-50 |
--out4Q | Produce output in the 4Q format which gives you the quality scores for every 4 nucleotide | |
--recalibration | Use recalibration for the quality scores, takes longer but gives a higher correlation between observed error rate and the predicted one. Make sure you have enought training. See the section on recalibration | |
--plotqual | Plot the calibration curves for each cycle | |
--maxqual=MAXQUAL | Used to specify the maximum quality score (in PHRED scale) being produced by the basecalling procedure. See the section on recalibration | --maxqual=40 |
--cgroup=CGROUP | When there is a lack of control sequences, recalibration can fail and give a poor results due to the impossibility of evaluating error rates. This option will cluster consecutive cycles together to evaluate the error rate. This will increase the bias in cycle for quality scores See the section on recalibration | --cgroup=5 |
The best way to get started is a test data set. Therefore, we extracted clusters from the control lane of a 51 cycle single read run. We extracted 2000 clusters per tile, resulting in 200,000 clusters in total (1.7% of the complete lane) and available for download (size 104MB). The control (PhiX RF1) was sequenced in lane 4 of the run. All other lanes and files not used by Ibis are missing in this archive. Eventhough this is not the typical way to use Ibis, this will allow for a quick demonstration. So download the archive via the link above or use wget; and unpack the archive:
$ wget http://bioinf.eva.mpg.de/Ibis/51cycle_phix_2000pT.tgz
$ tar vxzf 51cycle_phix_2000pT.tgz
$ cd Test_Data
./runBaseCalling.py --coordianteType=round --start 1 --end 51 -c 6 -e TEST_RUN -o Ibis_Basecall -b C1-51_Firecrest1.9.5/Bustard1.9.5/ -r solexaPhix.fa
Depending on the system this may take several minutes (on our eight core system about 10 minutes). Due to the missing lanes, we may get error messages at the very last step from cat, however these we can ignore. The file of interest, the 200'000 newly called PhiX sequences can now be found in Ibis_Basecall/s_4_sequence.txt as a FastQ file (having alternating lines of sequences and quality scores separated by header lines) with Sanger offset (33) for quality score encoding:
$ head Ibis_Basecall/s_4_sequence.txt @TEST_RUN:4:1:1539:1398 ACGAGCACGAGAGCGGTCAGTAGCAATCCAAACTTTGTTACTCGTCAGAAA + 89>HB8;7??=@@6<<87?<3<A4><775?==3633<229322;2169567 @TEST_RUN:4:1:1610:778 ATCACCTTGAATGCCACCGGAGGCGGCTTTTTGACCGCCTCCAAACAATTT + 7<8G8769=>?6>58>66::<:=3<=1165687714722124253.43-1, @TEST_RUN:4:1:1493:727 CCATTGCGTCGTGGCCTTGCTATTGACTCTACTGTAGACATTTTTACTTTTSuch a FastQ output file could be directly used with programs like bowtie or maq/bwa - including their tools for SNP calling (e.g. samtools pileup). When using SOAP you either have to set the correct offset (possible with SOAP v1 with '-z !') or use fasta as input with SOAP v2. With the test data, we also provided scripts that allow you to convert FastQ to Fasta and to extract the raw sequences in Fasta format from the Bustard folder. We will now use these together with SOAP for checking whether using Ibis was successful.
First extract the raw sequences from Bustard (resulting in s_4_sequence.txt):
$ ./Bustard2Fa.py -p C1-51_Firecrest1.9.5/Bustard1.9.5/ -e TEST_DATA -l 4 $ mv s_4_sequence.txt s_4_sequence.Bustard.faConvert the FastQ file from Ibis to Fasta (resulting in s_4_sequence.fa):
$ ./FastQ2Fa.py Ibis_Basecall/s_4_sequence.txt -o . $ mv s_4_sequence.fa s_4_sequence.Ibis.faNow we can run SOAP v1.11 (provided with Ibis) and map the raw sequences of both base callers to the PhiX reference sequence provided (allowing up to 5 mismatches and using 8 cores):
$ ./soap_1.11_patched/soap -d solexaPhix.fa -a s_4_sequence.Ibis.fa -o s_4_sequence.Ibis.soap -v 5 -p 8 Start at: Thu Apr 9 09:56:28 2009 Load in 1 db seqs, total size 5386 bp. 1 secs passed total_kmers: 1048576 Create seed table. 1 secs passed Single read alignment: Query: s_4_sequence.Ibis.fa Reference: solexaPhix.fa Output: s_4_sequence.Ibis.soap 10000 reads finished. 1 secs passed 20000 reads finished. 1 secs passed 30000 reads finished. 1 secs passed 40000 reads finished. 1 secs passed 60000 reads finished. 1 secs passed 50000 reads finished. 1 secs passed 80000 reads finished. 2 secs passed 70000 reads finished. 2 secs passed 90000 reads finished. 2 secs passed 110000 reads finished. 2 secs passed 100000 reads finished. 2 secs passed 120000 reads finished. 2 secs passed 130000 reads finished. 2 secs passed 160000 reads finished. 2 secs passed 140000 reads finished. 2 secs passed 170000 reads finished. 2 secs passed 180000 reads finished. 2 secs passed 190000 reads finished. 2 secs passed 150000 reads finished. 2 secs passed 200000 reads finished. 2 secs passed Total number of aligned reads: 176253 (88%) Done. Finished at Thu Apr 9 09:56:30 2009 Total time consumed: 2 secs $ ./soap_1.11_patched/soap -d solexaPhix.fa -a s_4_sequence.Bustard.fa -o s_4_sequence.Bustard.soap -v 5 -p 8 Start at: Thu Apr 9 09:57:03 2009 Load in 1 db seqs, total size 5386 bp. 0 secs passed total_kmers: 1048576 Create seed table. 0 ses passed Single read alignment: Query: s_4_sequence.Bustard.fa Reference: solexaPhix.fa Output: s_4_sequence.Bustard.soap 10000 reads finished. 0 secs passed 30000 reads finished. 1 secs passed 20000 reads finished. 1 secs passed 40000 reads finished. 1 secs passed 60000 reads finished. 1 secs passed 80000 reads finished. 1 secs passed 90000 reads finished. 1 secs passed 70000 reads finished. 1 secs passed 50000 reads finished. 1 secs passed 100000 reads finished. 1 secs passed 110000 reads finished. 1 secs passed 120000 reads finished. 1 secs passed 130000 reads finished. 1 secs passed 150000 reads finished. 1 secs passed 140000 reads finished. 1 secs passed 160000 reads finished. 1 secs passed 190000 reads finished. 2 secs passed 170000 reads finished. 2 secs passed 180000 reads finished. 2 secs passed 200000 reads finished. 2 secs passed Total number of aligned reads: 167401 (84%) Done. Finished at Thu Apr 9 09:57:05 2009 Total time consumed: 2 secs
What we can already see from the SOAP output, of the raw sequences called with Ibis ~4% more sequences can be aligned to the reference. We can also take look on the number of mismatches, which is available in column 10 of the SOAP output. Now we can count how often we have seen a read matching with 0,1,.. up to 5 mismatches.
$ cut -f 10 s_4_sequence.Ibis.soap | sort | uniq -c 131945 0 22991 1 9541 2 5559 3 3643 4 2574 5 $ cut -f 10 s_4_sequence.Bustard.soap | sort | uniq -c 88834 0 38708 1 18138 2 10703 3 6583 4 4435 5Using these numbers we can calculate a lower error rate of this sequencing run by assuming that the not mapped sequences would be mapped when allowing one more mismatch:
Ibis: (22991*1+9541*2+5559*3+3643*4+2574*5+23747*6)/(51*200000) = 2.24%
Bustard: (38708*1+18138*2+10703*3+6583*4+4435*5+6*32599)/(51*200000) = 3.44%
So we were able to reduce the error rate of this small test run by a factor of 1.5 (to 65.1%).
$ ./runBaseCalling.py -c 8 -e <Run_name> -o Ibis_Basecall -b <Bustard_folder> -r solexaPhix.fa --start='1,52' --end='51,102'
There is a special behavior if there are gaps in the two parts of the sequence defined by start and end. If for example the following parameters are set --start='1,60' --end='51,110' then there are eight bases that do not seem to belong to read1 but are also not yet read2. In this case Ibis will ignore those eight bases for creating a training data set, however for later prediction it will train a model on cycle 60 and apply it to 52, it will also train a model for cycle 61 and apply it to cycles 53 to 61. This behavior was designed to handle tag sequences at the beginning of a read.
For indexed samples the behavior should be slighlty different, in this case we would like to extrapolate from base calling the first bases of the first read to base calling of the index read. Starting with Ibis 1.0.1 there exists a index flag to runBaseCalling.py which implements this behavior when defining the number of cycles of the index read (--indexlength=INDEXLENGTH). Using the index flag makes is neccessary to define the length of the first read (if PE, the length of both reads). For an indexed PE sequencing run of 2x51 cycles and an index read of 6 cycles the parameters would be the following: --start=1,58 --end=51,108 --indexlength=6 For an double-indexed PE sequencing run of 2x51 cycles and an index read of 6 cycles and an second index of 6 cycles (occuring after the reverse read), the parameters would be the following: --start=1,58 --end=51,108 --indexlength=6 --2nd_indexlength=6
For Paired Ends the output file will also be one single FastQ file containing both (all three) reads together in one sequence line without any separators (reflecting the behaviour of the Illumina Pipeline up to version 1.3.2). To separate the reads into two files (index read being merged with the first read), you can use the script provided here.
$ for i in 1 2 3 4 5 6 7 8; do touch s_${i}_finished.txt; done
The assignment of intensity values to cluster X:Y coordinates used for sequences identifiers is done differently for the different versions of the Illumina Pipeline. While the original Firecrest intensity tables contained the very same coordinates that were then used for the sequence identifiers, IPAR software created separated text files defining floating point positions of each cluster. These were then rounded to obtain the numbers used in the sequence identifier. With RTA and pipelines after 1.3.2 the rounding was replaced by taking the absolut value and truncating the float numbers at the decimal point (floor). In the most recent version (OLB/RTA 1.6) the numbers are multiplied by 10, 1000 is added and the resulting value is rounded to the next integer. This makes it necessary to define the correct transformation with each Ibis call - otherwise Ibis will not be able to find the intensities values of sequences in the training data set and model training will fail. In a multiplexed run, the index sequence control reads must be specified using the "--control_index" parameter. Please note that this parameter is solely for the first index. If indices are not used and certain lanes are dedicated to control sequences, the "-l LANES, --lanes=LANES" option must be used to specify them. In the case where there more control reads than needed by Ibis (e.g. a HiSeq run with an entire lane dedicated to controls), the "-t TILES, --tiles=TILES" can be used to limit the amount of training data so avoid slowing down training (see the Amount of training data) Quality scores are the probability of error for individual errors. Quality score recalibration attempts to model the observed error rate in control reads as a function of the SVM decision score to subsequently use this correlation to produce quality scores that model more accurately the observed error rate for the remaining sequences. This option is triggered using the "--recalibration" option. Make sure a sufficient amount of training data is available as Ibis will evaluate the error on a per cycle, per nucleotide. If a small amount of training data is available, consider using "--cgroup=CGROUP" to cluster consecutive cycles together to evaluate the error rate. However, please note that this will increase cycle bias in the resulting quality scores (observed error rate according to cycle). The maximum quality score (PHRED scale) predicted by Ibis can be specified using the "--maxqual=MAXQUAL" parameter. This has to be lowered for runs using older version of the chemistry kit (e.g. 37,38) and can be increased for more recent sequencers MiSeq (e.g. 40,41). In tests we have seen that Ibis does not get considerably better the more training data is used and that as little as 50'000 reads already allow for 97% of the improvement. We saw the performence increase leveling off at something like 1 million reads. So we do recommend having at least a million reads for training and another million for testing, i.e. 2 million reads in each training_seqs and training_seqs_r2. Taking more data will mostly slow down the whole process for a very small gain in prediction performance. Thus we recommend using only a subset of tiles for obtaining the training data set from a dedicated control lane (e.g. --training_tiles='1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77,81,85,89,93,97,101,105,109,113,117,120'). As indicated in the example, tiles should be spread over the whole lane. What do to if you have a run without phiX control ? Can you still use freeIbis ? Yes but the procedure will be slower and we cannot guarantee you get good better results than Bustard. However, we highly recommend not to sequence without some PhiX spiked in, this is instrumental to evaluate how good your run was. A few pointers:- you need a mature genome build for the organism
- Try to evaluate the number of clusters per tile and estimate the percentage of aligned reads, use this to estimate how many tiles you will need to reach at least 1 million training examples
- To run freeIbis:
- Specify the lane (using -l) and the tiles (using -t) of the tiles you chose for training
- Specify --nomask
- Do not attempt to calibrate quality scores, using PhiX we assume the divergence to the reference is 0 and any mismatch is a sequencing error (once masking is performed), this assumption is violated for non-control targets
However, if one does not have control reads, but a close-by reference (which could be big mammalian genome) for the reads then one would like to do something else. Now one has to assume that since reads are close to the reference something like 99% of the bases should be identical. Actually, if one now sees a difference to the reference one does not know whether it is from divergence or from sequence error (depending on the position in the read the one is more likely than the other). This will cause issues in Ibis's quality score calibration. In cases where error is lower than divergence one might not even want to consider the reference to be correct at all (use --maskMM to use only bases where read and reference match). Further if the reference is big, alignment might now introduce another problem - one might find the position closest to the read sequence but not necessarily find the correct position that the read would go to if its sequence was determined correctly or if it did not have any divergence to the reference. Unfortunately there seems no good way around it, except trusting that due to the low divergence from the reference 99% of the bases will still be correct and hoping that the statistical learner deals well with mislabeled samples - i.e. generalizes well. The SVM used by Ibis seems to generalize well, especially as its simple kernel (linear) does not allow to over train these samples.
Only in a situation where instead of small control genome a big regular genome is used, SOAPv1 gets really slow and bowtie gets useful. There is another factor that makes SOAPv1 a pain on big genomes. SOAPv1 ignores Ns in the query and does not report the reference base for an N in its output. That is why Ibis extracts training sequence directly from the reference genome instead of reconstructing it from the SOAP output. However bowtie, handles Ns as mismatching bases and thus allows to reconstruct the reference sequence without requiring the reference genome. So when using bowtie one can save this step. On the other hand, bowtie has a reduced sensitivity (due to the maximum of three mismatches). Especially for long reads this might not be useful. So we needed to allow bowtie to match more mismatches and the easiest way around this problem was splitting up reads as it also done for SOAPv1 when reads above 60nt shall be processed. In the current version, reads longer 40nt will be split up (increasing bowtie runtime). In case one would like to change this behavior it is controlled by the max_length_bowtie variable.
In summary, running without real control reads does not perform as good as running with them and one normally will not need bowtie for the control read situation as there is no big sized control genome without redundancy.