-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[New Workflow] Flye_denovo to replace DragonFlye #692
base: main
Are you sure you want to change the base?
Conversation
Thanks for all the comments
Local Testing: Successfully reran the workflow locally with default settings to confirm all tasks executed as expected. Confirmed filter_contigs new workflow output and tested with homopolymer and short read data to confirm trimming Terra Testing: Reran the workflow in Terra using default settings. No major task overhauls were performed; results matched expectations for typical use cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few more to follow up on from Sage's previous comments.
Thanks for making all of these changes @fraser-combe. This looks much more sound, programmatically. I do have a few remaining questions, more on the biological comparison/assessment side of things. I understand the predicted Gambit Taxon was identical across samples assembled with flye vs dragonflye, which is great. It looks like you have a note in this PR to "Add in comparison results." Would you please do that? I think we want to get granular here in how we assess the flye/dragonfle assemblies. In theory, as we're just unwrapping Dragonflye into it's separate components, the assemblies should be largely similar. If you would, please pull together a table with the following (feel free to just update the table that already exists in this PR, if you like: Once you've got this, we can link up as a team to review & make sure our "hive mind" is in agreement across the board. |
Thanks for adding the table of SNP differences here @fraser-combe. This shows what I was hoping for - minimal impact to genome assembly. I'm good with merging this PR. I know @sage-wright was taking a look this morn, as well. So, pending any final comments from her, we can get this thing merged! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good! Running a final sanity check here for final confirmation. Will merge upon success!
Updated the docker image name @sage-wright I re ran your submission from before here I re ran the 8 comparison samples again and received identical results so no effect on comparisons as there were no filtered contigs using the old docker image or new as the contigs met the minimum thresholds of 1000 bp length and no homopolymers |
Thanks for the updates. Running a TheiaValidate here for confirmation, will merge upon success 👍 |
This PR closes #611, closes #585, and closes #565.
🗑️ This dev branch should be deleted after merging to main.
🧠 Summary
This PR introduces a new
flye_denovo
workflow as a replacement for theDragonflye
workflow. The updated workflow streamlines the assembly and polishing pipeline, focusing on being flexible and modular with the addition of assembly visualization through Bandage plots.Notable enhancements include:
New -tasks, including optional read trimming with
Porechop
, enhanced assembly visualization withBandage
, and multiple polishing options. Supports ONT data, hybrid assemblies with Illumina reads, and multiple assembly polishing tools (Medaka
,Racon
, andPolypolish
).Medaka polishing is set at 1 round as recommended by Rwick, and ONT
⚡ Impacted Workflows/Tasks
flye_denovo
workflow.Dragonflye
workflow.task_porechop.wdl
task_flye.wdl
task_bandageplot.wdl
task_bwa.wdl
task_medaka.wdl
task_racon.wdl
task_dnaapler.wdl
task_polypolish.wdl
task_filtercontigs.wdl
removes task_dragonfly.wdl
This PR may lead to different results in pre-existing outputs: Yes
This PR uses an element that could cause duplicate runs to have different results: Yes
🛠️ Changes
flye_denovo.wdl
to replaceDragonflye
. as a sub workflow⚙️ Algorithm
flye_denovo
workflow replaces theDragonflye
workflow, with a modular and flexible structure that separates tasks like trimming, assembly, polishing, and final orientation for clarity and maintainability.Polypolish
.Flye
,Porechop
,Medaka
,Racon
,Polypolish
,Bandage
, andDnaapler
.Flye
,Medaka
,Racon
,dnaapler
and other tasks to their latest stable versions➡️ Inputs
No
⬅️ Outputs
Added bandage plot png output
version outputs for task level software
medaka models used
Assembly_fasta output from dnaapler for downstream analyses
🧪 Testing
Scenarios tested within TheiaProk - Expected TheiaProk workflow to complete successfully for each task and specifically for flye_denovo workflow we expect to see successful creation of assembly fasta after any filtering or polishing conducted.
Default path Flye>Medaka Polish>Filtercontige>dnaApler

https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/586af547-03dd-4cb8-8877-8041d0064464
medaka output model and version
Porechop run i.e skip_trim_reads = false
https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/e5a645db-0e17-4c58-84ce-4e1f44ef9042
Skip polishing skip_polishing = true
https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/110a4ffa-c208-4849-a3b2-11d88ffddc90
Racon polishing pathway (polishing_rounds = 2)
https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/95694506-946f-4ef7-9b3d-657522cc7809
Hybrid assembly ONT data and Illumina (Polypolish and BWA)
https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/4fa9c915-4572-43b9-bd9b-28bed40c75a4
##Comparisons between DragonFlye and New Flye_denovo subworkflow##
Here we are looking for similarities in assemblies, statistics and downstream analyses. 8 bacterial samples selected
Both workflows produce assemblies of similar lengths for each sample, with minor variations (typically within ±1%).
Both workflows achieve high BUSCO completeness scores, generally above 90%.
Both workflows consistently predict the same taxa for each sample.
Comparisons tables
Table 1: Assembly Metrics & BUSCO Scores
SNP Comparison Summary from table: Flye vs. Dragonflye Assemblies
Minimal Differences: Most samples showed no SNP differences between Flye and Dragonflye assemblies, indicating high consistency between the two methods.
Minimal Variability in Some Samples:
ERR8958704 had 146 SNPs (0.0025% difference).
SAMN05250424 showed the highest SNP count (217 SNPs, 0.0045% difference).
SAMN05596277 had 130 SNPs (0.0027% difference).
Stable Genomes: Samples ERR8958706, ERR8958833, ERR8958835, SAMN23569621, and SAMN23605158 had 0 SNPs, suggesting nearly identical assemblies.
Summary
Downstream analyses
-Gambit Taxon: Identical predictions across workflows for each sample.
-Both workflows produce identical results for most downstream analyses, ensuring reliable serotype predictions, taxonomic classifications, and virulence gene identifications.
Finally the 44 validation ONT raw data samples were ran through Flye denovo and samples were checked manually to compare against previously ran Dragonflye submissions and we found similar comparable results
https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/47b75bb4-2a1b-4e41-b5f3-2f421e4e38ed
Suggested Scenarios for Reviewer to Test
Parameters to test:
skip_trim_reads: true
skip_polishing: false
polishing_rounds: 1
Expected outputs: Final polished assembly in FASTA format.
Metadata output for versions used (e.g., Flye, Medaka).
No trimming or filtering applied.
Successful Bandage plot and GFA graph generation.
🔬 Final Developer Checklist
workflows_overview
tables.🎯 Reviewer Checklist