-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add step to break up misassemblies in de novo contigs #804
Comments
Since we're planning to blast contigs for metagenomic analysis (#795 in progress); maybe we can reuse the hits for both splitting and LCA assignment? At a minimum we could re-run blast for assignment only on the contigs that need to be split. |
So which workflow would this be in? Right now metagenomics analysis and reference-assisted assembly are different workflows. For the latter we'd blast against a small db of just the taxon we're assembling, e.g. all known mumps genomes. |
Which application are we trying to solve what kind of problem in? If this is about the assembly process, the simplest approach would be to leave the contigs alone and just make sure the scaffolding code tolerated splitting up contigs—it’s already aligning the contigs to the references the user wants to align to at that point so that’s probably the right step to solve the problem. It’s possible that the current code already handles this appropriately? If this is about a metagenomic workflow, perhaps the first step is really to determine how often this problem happens and whether metaSPAdes makes it go away. And if not, maybe Chris’s suggestion makes the most sense: have the downstream contig classifier break things up based on what it sees. |
I was thinking assembly. For metagenomics, it doesn't necessarily matter where a contig blasts to, only that it does? In assembly, a misassembled contig might not align well to a reference under novoalign: if it glues together two somewhat-far-away parts, it'll look like an unreasonably big insertion. |
There are also de novo scaffolding tools I wanted to look at, that scaffold based on read pairs rather than a reference (as part of a general effort to make the process less reference-dependent). These tools might get confused by misassembled contigs, Of course, using a collection of references to fix misassemblies is itself a reference-dependent process, but less so than reference-assisted scaffolding. |
De novo contigs sometimes glue together non-adjacent pieces of the genome, or repeat the same piece twice. Add a step, as in QUAST and SHIVER, to blast the contigs against known references and, if a contig has more than one local match, break the contig up. If we break up too much, scaffolding and gapfilling should be able to restore contiguity.
The text was updated successfully, but these errors were encountered: