[New Workflow] Flye_denovo to replace DragonFlye #692

fraser-combe · 2024-12-13T22:13:34Z

This PR closes #611, closes #585, and closes #565.

🗑️ This dev branch should be deleted after merging to main.

🧠 Summary

This PR introduces a new flye_denovo workflow as a replacement for the Dragonflye workflow. The updated workflow streamlines the assembly and polishing pipeline, focusing on being flexible and modular with the addition of assembly visualization through Bandage plots.

Notable enhancements include:

New -tasks, including optional read trimming with Porechop, enhanced assembly visualization with Bandage, and multiple polishing options. Supports ONT data, hybrid assemblies with Illumina reads, and multiple assembly polishing tools (Medaka, Racon, and Polypolish).
Medaka polishing is set at 1 round as recommended by Rwick, and ONT

⚡ Impacted Workflows/Tasks

New flye_denovo workflow.
Replaces and enhances functionality previously offered by the Dragonflye workflow.
Tasks impacted:
- task_porechop.wdl
- task_flye.wdl
- task_bandageplot.wdl
- task_bwa.wdl
- task_medaka.wdl
- task_racon.wdl
- task_dnaapler.wdl
- task_polypolish.wdl
- task_filtercontigs.wdl
  removes task_dragonfly.wdl

This PR may lead to different results in pre-existing outputs: Yes

This PR uses an element that could cause duplicate runs to have different results: Yes

Due to the introduction of optional polishing tools and enhancements in assembly parameters, output may vary based on selected configurations.
This includes updated medaka (including most recent models), polypolish and racon polishing tools from Dragonflye versions
Updated dnaapler for contig reoirientation - faster run time tested for similar results by authors of the tool

🛠️ Changes

Added flye_denovo.wdl to replace Dragonflye. as a sub workflow
Enhanced modularity and task-level input definitions for flexibility.
Integrated multiple polishing and trimming options.
Introduced better documentation and metadata outputs for transparency and reproducibility.

⚙️ Algorithm

Workflow Redesign: The flye_denovo workflow replaces the Dragonflye workflow, with a modular and flexible structure that separates tasks like trimming, assembly, polishing, and final orientation for clarity and maintainability.
Polishing Enhancements:
- Added support for Medaka, Racon, and Polypolish with configurable rounds of polishing and tool-specific parameters.
- Support for hybrid assemblies using Illumina data with Polypolish.
Medaka Model Selection:
- Introduced automatic Medaka model selection based on the input reads or user-provided overrides.
- defaults to a medaka model if auto fails otherwise user can override
- Outputs the Medaka model used
Version Tracking:
- Outputs versions of Flye, Porechop, Medaka, Racon, Polypolish, Bandage, and Dnaapler.
Outputs:
- Outputs now include:
  - Final polished assembly.
  - Bandage plots for graph visualization.
  - Assembly graphs in GFA format.
  - Metadata for task versions
Docker Updates:
- Updated Docker images for Flye, Medaka, Racon, dnaapler and other tasks to their latest stable versions

➡️ Inputs

No

⬅️ Outputs

Added bandage plot png output
version outputs for task level software
medaka models used
Assembly_fasta output from dnaapler for downstream analyses

🧪 Testing

Scenarios tested within TheiaProk - Expected TheiaProk workflow to complete successfully for each task and specifically for flye_denovo workflow we expect to see successful creation of assembly fasta after any filtering or polishing conducted.

Default path Flye>Medaka Polish>Filtercontige>dnaApler
https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/586af547-03dd-4cb8-8877-8041d0064464
medaka output model and version
Porechop run i.e skip_trim_reads = false
https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/e5a645db-0e17-4c58-84ce-4e1f44ef9042
Skip polishing skip_polishing = true
https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/110a4ffa-c208-4849-a3b2-11d88ffddc90
Racon polishing pathway (polishing_rounds = 2)
https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/95694506-946f-4ef7-9b3d-657522cc7809
Hybrid assembly ONT data and Illumina (Polypolish and BWA)
https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/4fa9c915-4572-43b9-bd9b-28bed40c75a4

##Comparisons between DragonFlye and New Flye_denovo subworkflow##
Here we are looking for similarities in assemblies, statistics and downstream analyses. 8 bacterial samples selected

Both workflows produce assemblies of similar lengths for each sample, with minor variations (typically within ±1%).
Both workflows achieve high BUSCO completeness scores, generally above 90%.
Both workflows consistently predict the same taxa for each sample.

Comparisons tables

Table 1: Assembly Metrics & BUSCO Scores

Sample ID	Expected Genome Length (bp)	Workflow	Assembly Length (bp)	BUSCO Completeness (%)	BUSCO Fragmentation (%)	BUSCO Missing (%)	Taxonomic Prediction	N50 (bp)	# Contigs
ERR8958704	5,300,000	Dragonflye	5,750,961	98.2	0.7	1.1	Klebsiella pneumoniae	5,314,253	8
ERR8958704	5,300,000	Flye	5,732,385	98.0	0.9	1.1	Klebsiella pneumoniae	5,314,188	8
ERR8958706	5,300,000	Dragonflye	5,727,717	98.0	0.2	1.8	Klebsiella pneumoniae	5,314,190	8
ERR8958706	5,300,000	Flye	5,718,792	98.5	0.2	1.3	Klebsiella pneumoniae	5,314,197	9
ERR8958833	2,800,000	Dragonflye	2,902,609	100.0	0.0	0.0	Staphylococcus aureus	2,902,609	1
ERR8958833	2,800,000	Flye	2,902,617	100.0	0.0	0.0	Staphylococcus aureus	2,902,596	1
ERR8958835	2,800,000	Dragonflye	2,902,603	99.8	0.2	0.0	Staphylococcus aureus	2,902,603	1
ERR8958835	2,800,000	Flye	2,902,618	99.8	0.2	0.0	Staphylococcus aureus	2,902,601	1
SAMN05250424	4,800,000	Dragonflye	4,774,436	93.8	3.9	2.3	Salmonella enterica	4,685,874	5
SAMN05250424	4,800,000	Flye	4,778,142	92.0	6.1	1.9	Salmonella enterica	4,684,533	5
SAMN05596277	4,800,000	Dragonflye	4,778,588	94.5	3.4	2.1	Salmonella enterica	4,763,458	5
SAMN05596277	4,800,000	Flye	4,773,705	92.9	4.3	2.8	Salmonella enterica	4,762,539	4
SAMN23569621	4,900,000	Dragonflye	5,281,055	83.6	13.6	2.8	Shigella sonnei	4,818,873	10
SAMN23569621	4,900,000	Flye	5,248,069	87.9	9.8	2.3	Shigella sonnei	4,812,769	10
SAMN23605158	4,900,000	Dragonflye	5,207,959	77.3	17.0	5.7	Shigella sonnei	4,863,959	9
SAMN23605158	4,900,000	Flye	5,190,940	81.6	13.0	5.4	Shigella sonnei	4,856,756	9

Table 2: SNP Comparison

Sample ID	SNP Differences (Flye vs. DragonFlye)	% Difference
ERR8958704	146	0.0025%
ERR8958706	0	0%
ERR8958833	0	0%
ERR8958835	0	0%
SAMN05250424	217	0.0045%
SAMN05596277	130	0.0027%
SAMN23569621	0	0%
SAMN23605158	0	0%

SNP Comparison Summary from table: Flye vs. Dragonflye Assemblies
Minimal Differences: Most samples showed no SNP differences between Flye and Dragonflye assemblies, indicating high consistency between the two methods.
Minimal Variability in Some Samples:
ERR8958704 had 146 SNPs (0.0025% difference).
SAMN05250424 showed the highest SNP count (217 SNPs, 0.0045% difference).
SAMN05596277 had 130 SNPs (0.0027% difference).
Stable Genomes: Samples ERR8958706, ERR8958833, ERR8958835, SAMN23569621, and SAMN23605158 had 0 SNPs, suggesting nearly identical assemblies.
Summary

Downstream analyses
-Gambit Taxon: Identical predictions across workflows for each sample.
-Both workflows produce identical results for most downstream analyses, ensuring reliable serotype predictions, taxonomic classifications, and virulence gene identifications.

Finally the 44 validation ONT raw data samples were ran through Flye denovo and samples were checked manually to compare against previously ran Dragonflye submissions and we found similar comparable results
https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/47b75bb4-2a1b-4e41-b5f3-2f421e4e38ed

Suggested Scenarios for Reviewer to Test

Parameters to test:
skip_trim_reads: true
skip_polishing: false
polishing_rounds: 1

Expected outputs: Final polished assembly in FASTA format.
Metadata output for versions used (e.g., Flye, Medaka).
No trimming or filtering applied.
Successful Bandage plot and GFA graph generation.

🔬 Final Developer Checklist

The workflow/task has been tested and results, including file contents, are as anticipated
The CI/CD has been adjusted and tests are passing (Theiagen developers)
Code changes follow the style guide
Documentation and/or workflow diagrams have been updated if applicable
- You have updated the latest version for any affected worklows in the respective workflow documentation page and for every entry in the three workflows_overview tables.

🎯 Reviewer Checklist

All changed results have been confirmed
You have tested the PR appropriately (see the testing guide for more information)
All code adheres to the style guide
MD5 sums have been updated
The PR author has addressed all comments
The documentation has been updated

… add

…sh task

fraser-combe · 2025-01-17T16:27:48Z

Thanks for all the comments

Updated filter_contigs.wdl - now uses Biopython instead of awk and updated output file text
Flye- uses read_type input - set default to -nano-hq (open to thoughts to go back to nano-raw) -seems most relavant for recent ONT data going forward
Ensured runtime parameters standardized across tasks and names
Updated documentation

Local Testing: Successfully reran the workflow locally with default settings to confirm all tasks executed as expected. Confirmed filter_contigs new workflow output and tested with homopolymer and short read data to confirm trimming

Terra Testing: Reran the workflow in Terra using default settings. No major task overhauls were performed; results matched expectations for typical use cases.

rerun flye default

reran test after merge with main

AndrewLangvt

Just a few more to follow up on from Sage's previous comments.

docs/workflows/genomic_characterization/theiaprok.md

tasks/polishing/task_medaka.wdl

tasks/quality_control/read_filtering/task_filter_contigs.wdl

workflows/utilities/wf_flye_denovo.wdl

…etting

AndrewLangvt · 2025-01-21T15:14:31Z

Thanks for making all of these changes @fraser-combe. This looks much more sound, programmatically. I do have a few remaining questions, more on the biological comparison/assessment side of things. I understand the predicted Gambit Taxon was identical across samples assembled with flye vs dragonflye, which is great. It looks like you have a note in this PR to "Add in comparison results." Would you please do that? I think we want to get granular here in how we assess the flye/dragonfle assemblies. In theory, as we're just unwrapping Dragonflye into it's separate components, the assemblies should be largely similar. If you would, please pull together a table with the following (feel free to just update the table that already exists in this PR, if you like:
SNP distance between flye/dragonflye assembled genomes
genome length
BUSCO Scores
predicted taxon
contig length
# contigs
n50

Once you've got this, we can link up as a team to review & make sure our "hive mind" is in agreement across the board.

AndrewLangvt · 2025-01-29T15:22:55Z

Thanks for adding the table of SNP differences here @fraser-combe. This shows what I was hoping for - minimal impact to genome assembly. I'm good with merging this PR. I know @sage-wright was taking a look this morn, as well. So, pending any final comments from her, we can get this thing merged!

sage-wright

Code looks good! Running a final sanity check here for final confirmation. Will merge upon success!

tasks/quality_control/read_filtering/task_filter_contigs.wdl

workflows/theiaprok/wf_theiaprok_ont.wdl

fraser-combe · 2025-01-29T19:46:29Z

Updated the docker image name

@sage-wright I re ran your submission from before here

I re ran the 8 comparison samples again and received identical results so no effect on comparisons as there were no filtered contigs using the old docker image or new as the contigs met the minimum thresholds of 1000 bp length and no homopolymers

sage-wright · 2025-01-30T18:56:46Z

Thanks for the updates. Running a TheiaValidate here for confirmation, will merge upon success 👍

sage-wright and others added 30 commits October 4, 2024 16:39

placeholder

38b1caa

make flye task

cdf0913

rename fasta

46629d8

make workflow a workflow

b45b41d

update output files for flye

62b48e2

v1 bandage plot flye assembly visual

787121c

medaka initial commit

e9f29b4

initial commit racon framework

df7f79f

framework for tasks dnaapler porechop and racon

5a3fcd6

update docker images

d6fe1bc

update outdir medaka

89f7d3b

update medaka and dnaapler

35982db

add polypolish and separate bwa mem -a tasks

d738ab1

remove comment cruft

85a68dc

initial commit bash contig filtering

c7db697

initial commit bash contig filtering

0371972

update medaka docker image

facd02f

refactor assembly tasks and workflows for clarity and consistency

d26fc95

add dnaapler to wf

d45481e

update racon

eef27a9

add polisher options to flye_consensus wf

0fee522

update workflow and tasks and altered racon with minimap in docker to…

17289d1

… add

update dnapler and tody flye consensus wf

646b116

update filter contigs task initial attempt aowrking

8e56dc0

updated flye consensus wf and filter contigs

f37bf3b

update docker images porechop and dnaapler

2865a2d

optional trim and polish tasks, update porechop and dnaapler mode

f9cbde4

incporporate hybrid assemblies with polypolish

4b97b30

update meta wf description

1993791

start updating docs, remove run polypolish logic and update po0lypoli…

fa10d0c

…sh task

fraser-combe added 7 commits January 16, 2025 15:58

update fasta name

8ce6b13

update bandage plot output for theiaprok_ont wdl

0750790

update bandage plot output for theiaprok_ont wdl

797882a

update docs and minor updates

8439292

Merge branch 'main' into smw-flye-dev

1403c6a

correct merge

ea4ecd0

update md5sums

330fb60

fraser-combe requested a review from AndrewLangvt January 17, 2025 16:28

AndrewLangvt requested changes Jan 17, 2025

View reviewed changes

update input defaults and medaka model resolving and defauilt model s…

4cb2f9b

…etting

fraser-combe added 4 commits January 21, 2025 14:35

update docker image for filter contigs biopython

de4a766

remove default values wf level

b7b97cb

remove default values wf level

bf061ce

add back in bandage plot options

8a0daa3

fraser-combe requested a review from AndrewLangvt January 24, 2025 02:47

sage-wright reviewed Jan 29, 2025

View reviewed changes

sage-wright requested changes Jan 29, 2025

View reviewed changes

tasks/quality_control/read_filtering/task_filter_contigs.wdl Outdated Show resolved Hide resolved

sage-wright reviewed Jan 29, 2025

View reviewed changes

workflows/theiaprok/wf_theiaprok_ont.wdl Outdated Show resolved Hide resolved

sage-wright reviewed Jan 29, 2025

View reviewed changes

workflows/theiaprok/wf_theiaprok_ont.wdl Outdated Show resolved Hide resolved

fraser-combe added 2 commits January 29, 2025 10:51

update

0909205

update docker

5be645c

fraser-combe requested a review from sage-wright January 29, 2025 19:48

AndrewLangvt assigned awh082834 and unassigned andrewjpage Feb 12, 2025

sage-wright marked this pull request as draft February 18, 2025 14:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[New Workflow] Flye_denovo to replace DragonFlye #692

[New Workflow] Flye_denovo to replace DragonFlye #692

fraser-combe commented Dec 13, 2024 •

edited

Loading

fraser-combe commented Jan 17, 2025 •

edited

Loading

AndrewLangvt left a comment

AndrewLangvt commented Jan 21, 2025 •

edited by sage-wright

Loading

AndrewLangvt commented Jan 29, 2025

sage-wright left a comment

fraser-combe commented Jan 29, 2025 •

edited

Loading

sage-wright commented Jan 30, 2025

[New Workflow] Flye_denovo to replace DragonFlye #692

Are you sure you want to change the base?

[New Workflow] Flye_denovo to replace DragonFlye #692

Conversation

fraser-combe commented Dec 13, 2024 • edited Loading

🧠 Summary

⚡ Impacted Workflows/Tasks

🛠️ Changes

⚙️ Algorithm

➡️ Inputs

⬅️ Outputs

🧪 Testing

Suggested Scenarios for Reviewer to Test

🔬 Final Developer Checklist

🎯 Reviewer Checklist

fraser-combe commented Jan 17, 2025 • edited Loading

AndrewLangvt left a comment

Choose a reason for hiding this comment

AndrewLangvt commented Jan 21, 2025 • edited by sage-wright Loading

AndrewLangvt commented Jan 29, 2025

sage-wright left a comment

Choose a reason for hiding this comment

fraser-combe commented Jan 29, 2025 • edited Loading

sage-wright commented Jan 30, 2025

fraser-combe commented Dec 13, 2024 •

edited

Loading

fraser-combe commented Jan 17, 2025 •

edited

Loading

AndrewLangvt commented Jan 21, 2025 •

edited by sage-wright

Loading

fraser-combe commented Jan 29, 2025 •

edited

Loading