An accurate and ultra-fast genome assembler
- SYNOPSIS
- Description
- Short-read assembly
- Long-read presets
- Wengan demo
- Wengan benchmark
- Wengan components
- Getting the latest source code
- Limitations
- About the name
- Citation
# Assembling Oxford nanopore and illumina reads with WenganM
wengan.pl -x ontraw -a M -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l ont.fastq.gz -p asm1 -t 20 -g 3000
# Assembling PacBio reads and illumina reads with WenganA
wengan.pl -x pacraw -a A -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm2 -t 20 -g 3000
# Assembling ultra-long nanopore reads and BGI reads with WenganM
wengan.pl -x ontlon -a M -s lib2.fwd.fastq.gz,lib2.rev.fastq.gz -l ont.fastq.gz -p asm3 -t 20 -g 3000
# Non-hybrid assembly of PacBio Circular Consensus Sequence data with WenganM
wengan.pl -x pacccs -a M -l ccs.fastq.gz -p asm4 -t 20 -g 3000
# Assembling ultra-long nanopore reads and Illumina reads with WenganD (need a high memory machine 600Gb)
wengan.pl -x ontlon -a D -s lib2.fwd.fastq.gz,lib2.rev.fastq.gz -l ont.fastq.gz -p asm5 -t 20 -g 3000
# Assembling pacraw reads with pre-assembled short-read contigs from Minia3
wengan.pl -x pacraw -a M -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm6 -t 20 -g 3000 -c contigs.minia.fa
# Assembling pacraw reads with pre-assembled short-read contigs from Abyss
wengan.pl -x pacraw -a A -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm7 -t 20 -g 3000 -c contigs.abyss.fa
# Assembling pacraw reads with pre-assembled short-read contigs from DiscovarDenovo
wengan.pl -x pacraw -a D -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm8 -t 20 -g 3000 -c contigs.disco.fa
Wengan is a new genome assembler that unlike most of the current long-reads assemblers avoids entirely the all-vs-all read comparison. The key idea behind Wengan is that long-read alignments can be inferred by building paths on a sequence graph. To achieve this, Wengan builds a new sequence graph called the Synthetic Scaffolding Graph. The SSG is built from a spectrum of synthetic mate-pair libraries extracted from raw long-reads. Longer alignments are then built by peforming a transitive reduction of the edges. Another distinct feature of Wengan is that it performs self-validation by following the read information. Wengan identifies miss-assemblies at differents steps of the assembly process. For more information about the algorithmic ideas behind Wengan please read the preprint available in bioRxiv.
Wengan uses a de Bruijn graph assembler to build the assembly backbone from short-read data. Currently, Wengan can use Minia3, Abyss2 or DiscoVarDenovo. The recommended short-read coverage is 50-60X of 2 x 150bp or 2 x 250bp reads.
This Wengan mode uses the Minia3 short-read assembler. This is the fastest mode of Wengan and can assemble a complete human genome in less than 210 CPU hours (~50Gb of RAM).
This Wengan mode uses the Abyss2 short-read assembler, this is the lowest memory mode of Wengan and can assemble a complete human genome in less than 40Gb of RAM (~900 CPU hours). This assembly mode takes ~2 days when using 20 CPUs on a single machine.
This Wengan mode uses the DiscovarDenovo short-read assembler, this is the greedier memory mode of Wengan and for assembling a complete human genome needs about 600Gb of RAM (~900 CPU hours). This assembly mode takes ~2 days when using 20 CPUs on a single machine.
The presets define several variables of the wengan pipeline execution and depend on the long-read technology used to sequence the genome. The recommended long-read coverage is 30X.
preset for raw ultra-long-reads from Oxford Nanopore, typically with an N50 > 50kb.
preset for raw Nanopore reads typically with an N50 ~[15kb-40kb].
preset for raw long-reads from Pacific Bioscience (PacBio) typically with an N50 ~[8kb-60kb].
preset for Circular Consensus Sequences from Pacific Bioscience (PacBio) typically with an N50 ~[15kb]. This type of data is not fully supported yet.
The repository wengan_demo contains a small dataset and instructions to test Wengan v0.1.
#fetch the demo dataset
git clone https://github.com/adigenova/wengan_demo.git
Genome | Long reads | Short reads | Wengan Mode | NG50 (Mb) | CPU (h) | RAM (Gb) | Fasta file |
---|---|---|---|---|---|---|---|
2x150bp 50X (GIAB:rs1 , rs2) | WenganA | 23.08 | 671 | 45 | asm | ||
NA12878 | ONT 35X (rel5) | 2x150bp 50X (GIAB:rs1 , rs2) | WenganM | 16.67 | 185 | 53 | asm |
2x250bp 60X (ENA:rs1 , rs2) | WenganD | 33.13 | 550 | 622 | asm | ||
HG00073 | PAC 90X (ENA:rl1) | 2x250bp 63X (ENA:rs1 , rs2) | WenganD | 29.2 | 800 | 644 | asm |
NA24385 | ONT 60X (GIAB:rl1) | 2x250bp 70X (GIAB:rs1) | WenganD | 48.8 | 910 | 650 | asm |
CHM13 | ONT 50X (T2T:rel2) | 2x250bp 66X (ENA:rs1 , rs2) | WenganD | 57.4 | 1027 | 647 | asm |
The assemblies generated using Wengan can be downloaded from Zenodo. All the assemblies were ran as described in the Wengan preprint. NG50 was computed using a genome size of 3.14Gb.
- A de Bruijn graph assembler (Minia, Abyss or DiscovarDenovo)
- FastMIN-SG
- IntervalMiss
- Liger
It is recommended to use/download the latest binary release (Linux) from : https://github.com/adigenova/wengan/releases
To compile Wengan run the following command:
#fetch Wengan and its components
git clone --recursive https://github.com/adigenova/wengan.git wengan
There are specific instructions for each Wengan component. After compilation you have to copy the binaries to wengan-dir/bin.
c++ compiler; compilation was tested with gcc version GCC/7.3.0-2.30 (Linux) and clang-1000.11.45.5 (Mac OSX). cmake 3.2+.
- abyss commit d4b4b5d
- discovarexp-51885 commit f827bab
- minia commit 017d23e
- fastmin-sg commit 710aea0
- intervalmiss commit bb884c4
- liger commit 82658bc
- seqtk commit 2efd0c8
1.- Genomes larger than 4Gb are not supported yet.
Wengan is a Mapudungun word. Mapudungun is the language of the Mapuche people, the largest indigenous inhabitants of south-central Chile. Wengan means "Making the path".
Alex Di Genova, Elena Buena-Atienza, Stephan Ossowski, Marie-France Sagot. Wengan: Efficient and high quality hybrid de novo assembly of human genomes. BioRxiv, link