Toullig is (since January 17 2017) a reader, parser of Fast5 files of ONT (basecalled and not basecalled) and a '.fastq' files writer. The 01 March Toullig can trim fastq of ONT based on the RT adaptors.
This project is now closed. Most of the parts of this work has been merged in Eoulsan.
Aurélien Birer, [email protected]
- Requirements
- Installation
- General options
- How it works
- Fast5tofastq
- TrimFastq
- Gtftogpd
- Development environment
- Repository
- License
You just need to have Java 8 and Maven installed on your computer. This alpha version work on Ubuntu (Unix distribution).
sudo apt-get install maven
git clone https://github.com/GenomicParisCentre/toullig.git
cd toullig
mvn clean install
#Information
-help | -h #display help
-version #display version of toullig
-about #display information of toullig
-license #display license of toullig
Toullig have 2 tools :
- fast5tofastq : read the rootDirectory of your ONT run after the step of basecalling (metrichor/albacore) and produce a '.fastq' file.
- trim : trim the reads of a ONT fastq with a sam file, based on the RT adaptors.
Not Barcoded Directories Tree
├── downloads
│ ├── fail
│ └── pass
└── uploaded
Barcoded Directories Tree (with 6 barcode)
├── downloads
│ ├── fail
│ │ └── unclassified
│ └── pass
│ ├── BC01
│ ├── BC02
│ ├── BC03
│ ├── BC04
│ ├── BC05
│ └── BC06
└── uploaded
**Barcoded Directories Tree **
├── unclassified
├── barcode01
├── barcode02
├── barcode03
├── barcode04
├── barcode05
├── barcode06
└── ...
Chemistry Kit/ Type of sequencing |
1D | 2D | 1D² |
---|---|---|---|
R7.3 | Not tested |
Tested | Not available |
R9 | Not tested | Tested | Not available |
R9.4 | Tested | Tested | Not available |
R9.5 | Not tested | Not tested | Not tested |
The module work for Metrichor (now closure) and Albacore basecallers.
In the execution of toullig Fast5tofastq, the program step :
- List the '.fast5' files.
- Read a '.fast5' file.
- Write the sequence(s) in a '.fastq' file.
- Make some log files.
Actually, we use in development Metrichor for the basecalling of our '.fast5' files.
But it's important to understand clearly the type of the 4 fastq sequences (see Sequence available for each run configuration) give by the basecaller (in our case Metrichor).
The template sequence is the first sequence basecalled, this sequence correspond to the own read sequenced in 1D. This sequence contains section as follow :
The complement sequence is the second sequence basecalled (in 2D), this sequence correspond to the reverse of the template sequence. This sequence contains section as follow :
/!\ The sequence template and complement can be reverse (it's depends of how the leader-adaptor is fix).
The consensus sequence is the sequence result of the alignment of the template and the complement sequence (in 2D). This sequence contains section as follow :
/!\ The consensus sequence can be bases on the template sequence or complement sequence.
The transcript sequence is the sequence result of the consensus sequence (in 2D) with a trim of the barcoded sequence (/!\ barcode can be still). This sequence contains section as follow :
/!\ The transcript sequence is only available for barcoded ONT run.
Here, for each sequencing configuration the sequence available for the fast5tofastq conversion.
In bold, the type of sequencing that it's mostly interesting.
Barcoding | Type of Sequencing |
Type of sequence available with Metrichor |
Type of sequence available with Albacore |
---|---|---|---|
YES | 1D | Template, Transcript | Template |
2D | Template, Complement, Consensus, Transcript | Template, Complement, Consensus | |
1D² | Not tested | Not tested | |
NO | 1D | Template | Template |
2D | Template, Complement, Consensus | Template, Complement, Consensus | |
1D² | Not tested | Not tested |
#Options
-status pass|fail|unclassified (default: pass) # The status of '.fast5' file
-type template|complement|consensus|transcript (default: transcript) # The type of sequence
-mergeSequence true|false (default: false) # If you want merge all type of sequence whatever the status
-compress GZIP|BZIP2 (default: none) # Set the type of compression for the output '.fastq' files
#Arguments
-rootDirectoryFast5run /home/user/yourRootDirectoryFast5run
-outputDirectoryFastq /home/user/yourOutputDirectoryFastq
I have a directory of a MinION run in 2D barcoded. If i want just get the fastq sequence of the 'template', the 'complement' and the 'consensus' for the fast5 files in the status/repertory 'fail'.
bash ./target/dist/toullig-0.2-alpha/toullig.sh fast5tofastq -status fail -type template,complement,consensus /home/user/myRootDirectoryFast5run /home/user/myOutputDirectoryFastq
I have a directory of a minION run in 2D not barcoded. If i want just get the fastq default sequence for the fast5 files in the status/repertory 'pass'.
bash ./target/dist/toullig-0.2-alpha/toullig.sh fast5tofastq -status pass /home/user/myRootDirectoryFast5run /home/user/myOutputDirectoryFastq
WARNING : This last example is a trap ! The default type of sequence is transcript or the run is not barcoded and the transcript sequence is not available on ONT barcoded run (see Sequence available for each run configuration).
This log contains the main informations about the execution of the conversion of '.fast5' files to a the '.fastq' files. The file is structured in 4 sections:
- The date of the begin of the conversion execution.
- The number of files reads for each folder create after the basecalling.
- The number of sequences writes in the fastq and the number of sequences null (not write).
- The date of the end of the conversion execution.
This log contains a list of path of corrupt '.fast5' files. Theses files can be read and information can be extract but the HDF5 library use for opening theses files detect a corruption.
This log contains the final status of each basecalling workflow for each folder create after the basecalling.
In the case of cDNA protocols like SQK-MAP006. One of the problem of the MinION reads in the format fastq is that the read is not the transcript as we expected. The read still have the RT adaptor and, in some case, the barcode with sequencing adaptors (leader/hairpin).
For find the truely position of the start/stop of the mRNA, it's important to trim the reads to delete these unwanted sequences.
In the execution of toullig Trim, the program step :
- Mark the index between the outliers and the mRNA (mode P (Perfect) or SW (Side-Window) ).
For Cutadapt:
- Write '.fasta' file for each outlier.
- Trim with Cutadapt
- Write the Cutadapt output merged in a '.fastq' file.
For Trimmomatic:
- Read a fastq sequence.
- Trim with Trimmomatic.
- Write the sequence trimmed into a '.fastq' file.
For no trimmer:
- Read a fastq sequence.
- Cut the sequence at the outlier position between the outlier and the transcript.
- Write the sequence trimmed into a '.fastq' file.
#Options
-trimmer cutadapt|trimmomatic|no (default: cutadapt) # The trimmer tool use for trimming or not
-mode P | SW (default: P) # The type of trimming the transcripts reads
-stats true|flase (default: false) # If you want some statistical information on the Cutadapt trimming
#Options Trimming by Side-window mode
-addIndexOutlier (default: 0) # A addition of bases of the outlier from the transcript sequence to have a better catch of the adaptor if adaptor base match on the reference genome
#Options Trimming by Side-window mode
-thresholdSideWindow (default: 0.8) # The threshold for the Side-Window algorithm
-lengthWindowsSW (default: 15) # The length for the the Side-Window algorithm
#Options Cutadapt
-errorRateCutadapt (default: 0.5) # The error rate for Cutadapt (mismatch + deletion)
#Options Trimmomatic
-seedMismatchesTrimmomatic (default:17 ) # The number base of mismatchs maximun for Trimmomatic
-palindromeClipThresholdTrimmomatic (default: 30) # The threshold of palindrome clip for Trimmomatic
-simpleClipThreshold (default: 7) # The threshold of simple clip for Trimmomatic
#Arguments
-samFile /home/user/yourSamFile
-fastqFile /home/user/yourFastqFile
-fastqOutputFile /home/user/yourFastqTrimOutput
-adaptorFile /home/user/yourAdaptorFile
-workDir /home/user/yourTmpRepertoryOfWork
bash ./target/dist/toullig-0.2-alpha/toullig.sh trim /home/user/samFile.sam /home/user/fastqONTFile.fastq /home/user/myFastqTrim.fastq ~/toullig/config_files/adaptor_RT_sequence_modify_for_nanopore.txt /home/user/yourTmpRepertoryOfWork
For AlignQC QC tool for alignment. An analyze with annotation can be processed but only with an annotation file in format GPD. So, toullig provide a tool to translate a common GTF format to GPD format.
#Arguments
-gtfFile /home/user/yourGTFFile
-gpdOutputFile /home/user/yourGPDFile
bash ./target/dist/toullig-0.2-alpha/toullig.sh trim /home/user/GTFFile.gtf /home/user/GPDFile.gpd
Ubuntu version: '16.04.4'
Java version: '1.8.0_121'
Maven version: '3.2.3'
Currently the Git reference repository is https://github.com/GenomicParisCentre/toullig.