Insights needed for comparison with other assemblers #46

GeoMicroSoares · 2024-04-16T16:29:51Z

Hi there,

Congratulations on your tool - I'm really excited about PenguiN as this could be an interesting alternative to explore. As such, I've set out to compare it our group's gold-standard for environmental metagenomics, metaSPAdes and am getting some really interesting data that maybe you could help me interpret to see if we should consider changing to using PenguiN or not?

Here's how everything has been run so far on an example environmental metagenome:

metaSPAdes:

python3 metaspades.py -m 1150 -1 $read1 -2 $read2 -t 10 -o ${assembly}_metaspades3.15.5

PenguiN:

penguin guided_nuclassemble $read1 $read2 --threads 10 1 ${assembly}_PenguiN.fasta tmp

PenguiN_wmods:

penguin guided_nuclassemble $read1 $read2 --threads 10 --max-seq-len 1000000 --contig-output-mode 0 --num-iterations 10 --min-length 35 --use-all-table-starts 1 ${assembly}_PenguiN_wmods.fasta tmp

Mapping statistics really vary across methods
Using bowtie2, I mapped each assembly to its reads as below, after filtering each assembly to contain only scaffolds/contigs >1000 bp:

bowtie2-build $assembly bt2/$assembly > bt2/$assembly.log
bowtie2 -p 10 --sensitive -x bt2/$assembly -1 $read1 -2 $read2 2> $assembly.sam.log | shrinksam-master/shrinksam > $assembly.sam

Looking at the log files here's what I see:

Is this something you see a lot in your experience? In principle, I'd say that higher percentages of 'aligned concordantly >1 times' should be indicative of multimapping and thus not a good sign?

Quite a lot of difference in mean/median contig lengths:

Here's a quick plot of median/mean contig lengths (scaffolds for metaSPAdes), with standard deviations as vertical lines from each point:

This makes sense when looking at length frequency distributions for each assembly:

Do you contemplate adding a scaffolding module to PenguiN? I wonder how these values could change with that!

Finally, here's some more general stats on each assembly:

I think there's a lot of potential in PenguiN - I'm still reading up on it, but will take any insights you're willing to offer as you look at this data! I can also share rps3 taxonomic profiles I've run on each assembly if you'd want.

Thanks in advance for the attention!

The text was updated successfully, but these errors were encountered:

AnnSeidel · 2024-04-16T20:39:56Z

Happy to see that PenguiN is considered as your potential assembly tool.

From the command calls you used I guess the final contigs from PenguiN's default parameters are too short, so increasing the num-iteration parameter is a good idea. However, I don't know your data but first of all I would like to point out that it might make sense to only increase the number of iterations at the nucleotide level as the proteins are usually already assembled to full length with few iterations and therefore there is no advantage to run more iterations in both stages but it increases the risk of redundancy. You can set the number of iterations separately using --num-iterations aa:5,nucl:10 (or depending on your data even higher values)

Regarding 1)
You are right aligned concordantly>1 is an indicative of multimapping. However if it’s a good or bad sign is difficult to say. With PenguiN we aim to resolve also very closely related strains, whereas metaSPAdes reconstruct a consensus assembly of a strain mixture. Depending on the mapping sensitivity reads might map to multiple correctly assembled strain contigs for the PenguiN assembly.

On the other hand, it must also be said that PenguiN's approach carries the risk of producing redundant contigs. To overcome the issue of dead ends in low coverage regions during the greedy iterative assembly strategy, PenguiN (and Plass) re-uses reads. More precisely, different contigs can be extended with the same read. In principle the same genomic region can be built multiple times in parallel. We introduced a few ideas to minimize the effect however it cannot be prevented completely. This is why we integrated the Linclust algorithm [Steinegger and Söding, 2018] in PenguiN as the last step and only output the cluster representatives as final contigs. However Linclust's speed comes with at the expense of some loss in sensitivity. In cases where redundancy is problematic, I suggest using a more sensitive all-against-all clustering in a post-processing step after the assembly. In our Paper benchmarks, we used for example an additional clustering step using the nucleotide clustering workflow of the MMseqs2 software suite.

Regarding 2)
At the moment we are not working on a scaffolding module. However, we have already thought about it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Insights needed for comparison with other assemblers #46

Insights needed for comparison with other assemblers #46

GeoMicroSoares commented Apr 16, 2024

AnnSeidel commented Apr 16, 2024 •

edited

Loading

Insights needed for comparison with other assemblers #46

Insights needed for comparison with other assemblers #46

Comments

GeoMicroSoares commented Apr 16, 2024

AnnSeidel commented Apr 16, 2024 • edited Loading

AnnSeidel commented Apr 16, 2024 •

edited

Loading