Skip to content
/ wengan Public
forked from adigenova/wengan

An accurate and ultra-fast hybrid genome assembler

License

Notifications You must be signed in to change notification settings

AdamVS/wengan

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HitCount GitHub Downloads

Wengan

An accurate and ultra-fast genome assembler

Table of Contents

SYNOPSIS

# Assembling Oxford nanopore and illumina reads with WenganM
 wengan.pl -x ontraw -a M -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l ont.fastq.gz -p asm1 -t 20 -g 3000

# Assembling PacBio reads and illumina reads with WenganA
 wengan.pl -x pacraw -a A -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm2 -t 20 -g 3000

# Assembling ultra-long nanopore reads and BGI reads with WenganM
 wengan.pl -x ontlon -a M -s lib2.fwd.fastq.gz,lib2.rev.fastq.gz -l ont.fastq.gz -p asm3 -t 20 -g 3000

# Non-hybrid assembly of PacBio Circular Consensus Sequence data with WenganM
 wengan.pl -x pacccs -a M -l ccs.fastq.gz -p asm4 -t 20 -g 3000

# Assembling ultra-long nanopore reads and Illumina reads with WenganD (need a high memory machine 600Gb)
 wengan.pl -x ontlon -a D -s lib2.fwd.fastq.gz,lib2.rev.fastq.gz -l ont.fastq.gz -p asm5 -t 20 -g 3000

# Assembling pacraw reads with pre-assembled short-read contigs from Minia3
 wengan.pl -x pacraw -a M -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm6 -t 20 -g 3000 -c contigs.minia.fa

# Assembling pacraw reads with pre-assembled short-read contigs from Abyss
 wengan.pl -x pacraw -a A -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm7 -t 20 -g 3000 -c contigs.abyss.fa

# Assembling pacraw reads with pre-assembled short-read contigs from DiscovarDenovo
 wengan.pl -x pacraw -a D -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm8 -t 20 -g 3000 -c contigs.disco.fa

Description

Wengan is a new genome assembler that unlike most of the current long-reads assemblers avoids entirely the all-vs-all read comparison. The key idea behind Wengan is that long-read alignments can be inferred by building paths on a sequence graph. To achieve this, Wengan builds a new sequence graph called the Synthetic Scaffolding Graph. The SSG is built from a spectrum of synthetic mate-pair libraries extracted from raw long-reads. Longer alignments are then built by peforming a transitive reduction of the edges. Another distinct feature of Wengan is that it performs self-validation by following the read information. Wengan identifies miss-assemblies at differents steps of the assembly process. For more information about the algorithmic ideas behind Wengan please read the preprint available in bioRxiv.

Short-read assembly

Wengan uses a de Bruijn graph assembler to build the assembly backbone from short-read data. Currently, Wengan can use Minia3, Abyss2 or DiscoVarDenovo. The recommended short-read coverage is 50-60X of 2 x 150bp or 2 x 250bp reads.

WenganM [M]

This Wengan mode uses the Minia3 short-read assembler. This is the fastest mode of Wengan and can assemble a complete human genome in less than 210 CPU hours (~50Gb of RAM).

WenganA [A]

This Wengan mode uses the Abyss2 short-read assembler, this is the lowest memory mode of Wengan and can assemble a complete human genome in less than 40Gb of RAM (~900 CPU hours). This assembly mode takes ~2 days when using 20 CPUs on a single machine.

WenganD [D]

This Wengan mode uses the DiscovarDenovo short-read assembler, this is the greedier memory mode of Wengan and for assembling a complete human genome needs about 600Gb of RAM (~900 CPU hours). This assembly mode takes ~2 days when using 20 CPUs on a single machine.

Long-read presets

The presets define several variables of the wengan pipeline execution and depend on the long-read technology used to sequence the genome. The recommended long-read coverage is 30X.

ontlon

preset for raw ultra-long-reads from Oxford Nanopore, typically with an N50 > 50kb.

ontraw

preset for raw Nanopore reads typically with an N50 ~[15kb-40kb].

pacraw

preset for raw long-reads from Pacific Bioscience (PacBio) typically with an N50 ~[8kb-60kb].

pacccs (experimental)

preset for Circular Consensus Sequences from Pacific Bioscience (PacBio) typically with an N50 ~[15kb]. This type of data is not fully supported yet.

Wengan demo

The repository wengan_demo contains a small dataset and instructions to test Wengan v0.1.

#fetch the demo dataset
git clone https://github.com/adigenova/wengan_demo.git

Wengan benchmark

Genome Long reads Short reads Wengan Mode NG50 (Mb) CPU (h) RAM (Gb) Fasta file
2x150bp 50X (GIAB:rs1 , rs2) WenganA 23.08 671 45 asm
NA12878 ONT 35X (rel5) 2x150bp 50X (GIAB:rs1 , rs2) WenganM 16.67 185 53 asm
2x250bp 60X (ENA:rs1 , rs2) WenganD 33.13 550 622 asm
HG00073 PAC 90X (ENA:rl1) 2x250bp 63X (ENA:rs1 , rs2) WenganD 29.2 800 644 asm
NA24385 ONT 60X (GIAB:rl1) 2x250bp 70X (GIAB:rs1) WenganD 48.8 910 650 asm
CHM13 ONT 50X (T2T:rel2) 2x250bp 66X (ENA:rs1 , rs2) WenganD 57.4 1027 647 asm

The assemblies generated using Wengan can be downloaded from Zenodo. All the assemblies were ran as described in the Wengan preprint. NG50 was computed using a genome size of 3.14Gb.

Wengan components

Getting the latest source code

Instructions

It is recommended to use/download the latest binary release (Linux) from : https://github.com/adigenova/wengan/releases

Building Wengan from source

To compile Wengan run the following command:

#fetch Wengan and its components
git clone --recursive https://github.com/adigenova/wengan.git wengan

There are specific instructions for each Wengan component. After compilation you have to copy the binaries to wengan-dir/bin.

Requirements

c++ compiler; compilation was tested with gcc version GCC/7.3.0-2.30 (Linux) and clang-1000.11.45.5 (Mac OSX). cmake 3.2+.

Specific component source code versions used to build Wengan v0.1

  1. abyss commit d4b4b5d
  2. discovarexp-51885 commit f827bab
  3. minia commit 017d23e
  4. fastmin-sg commit 710aea0
  5. intervalmiss commit bb884c4
  6. liger commit 82658bc
  7. seqtk commit 2efd0c8

Limitations

1.- Genomes larger than 4Gb are not supported yet.

About the name

Wengan is a Mapudungun word. Mapudungun is the language of the Mapuche people, the largest indigenous inhabitants of south-central Chile. Wengan means "Making the path".

Citation

Alex Di Genova, Elena Buena-Atienza, Stephan Ossowski, Marie-France Sagot. Wengan: Efficient and high quality hybrid de novo assembly of human genomes. BioRxiv, link

About

An accurate and ultra-fast hybrid genome assembler

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Perl 100.0%