Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New Workflow] Flye_denovo to replace DragonFlye #692

Draft
wants to merge 76 commits into
base: main
Choose a base branch
from

Conversation

fraser-combe
Copy link
Contributor

@fraser-combe fraser-combe commented Dec 13, 2024

This PR closes #611, closes #585, and closes #565.

🗑️ This dev branch should be deleted after merging to main.

🧠 Summary

This PR introduces a new flye_denovo workflow as a replacement for the Dragonflye workflow. The updated workflow streamlines the assembly and polishing pipeline, focusing on being flexible and modular with the addition of assembly visualization through Bandage plots.

Notable enhancements include:

New -tasks, including optional read trimming with Porechop, enhanced assembly visualization with Bandage, and multiple polishing options. Supports ONT data, hybrid assemblies with Illumina reads, and multiple assembly polishing tools (Medaka, Racon, and Polypolish).
Medaka polishing is set at 1 round as recommended by Rwick, and ONT

⚡ Impacted Workflows/Tasks

  • New flye_denovo workflow.
  • Replaces and enhances functionality previously offered by the Dragonflye workflow.
  • Tasks impacted:
    • task_porechop.wdl
    • task_flye.wdl
    • task_bandageplot.wdl
    • task_bwa.wdl
    • task_medaka.wdl
    • task_racon.wdl
    • task_dnaapler.wdl
    • task_polypolish.wdl
    • task_filtercontigs.wdl
      removes task_dragonfly.wdl

This PR may lead to different results in pre-existing outputs: Yes

This PR uses an element that could cause duplicate runs to have different results: Yes

  • Due to the introduction of optional polishing tools and enhancements in assembly parameters, output may vary based on selected configurations.
  • This includes updated medaka (including most recent models), polypolish and racon polishing tools from Dragonflye versions
  • Updated dnaapler for contig reoirientation - faster run time tested for similar results by authors of the tool

🛠️ Changes

  • Added flye_denovo.wdl to replace Dragonflye. as a sub workflow
  • Enhanced modularity and task-level input definitions for flexibility.
  • Integrated multiple polishing and trimming options.
  • Introduced better documentation and metadata outputs for transparency and reproducibility.

⚙️ Algorithm

  1. Workflow Redesign: The flye_denovo workflow replaces the Dragonflye workflow, with a modular and flexible structure that separates tasks like trimming, assembly, polishing, and final orientation for clarity and maintainability.
  2. Polishing Enhancements:
    • Added support for Medaka, Racon, and Polypolish with configurable rounds of polishing and tool-specific parameters.
    • Support for hybrid assemblies using Illumina data with Polypolish.
  3. Medaka Model Selection:
    • Introduced automatic Medaka model selection based on the input reads or user-provided overrides.
    • defaults to a medaka model if auto fails otherwise user can override
    • Outputs the Medaka model used
  4. Version Tracking:
    • Outputs versions of Flye, Porechop, Medaka, Racon, Polypolish, Bandage, and Dnaapler.
  5. Outputs:
    • Outputs now include:
      • Final polished assembly.
      • Bandage plots for graph visualization.
      • Assembly graphs in GFA format.
      • Metadata for task versions
  6. Docker Updates:
    • Updated Docker images for Flye, Medaka, Racon, dnaapler and other tasks to their latest stable versions

➡️ Inputs

No

⬅️ Outputs

Added bandage plot png output
version outputs for task level software
medaka models used
Assembly_fasta output from dnaapler for downstream analyses

🧪 Testing

Scenarios tested within TheiaProk - Expected TheiaProk workflow to complete successfully for each task and specifically for flye_denovo workflow we expect to see successful creation of assembly fasta after any filtering or polishing conducted.

  1. Default path Flye>Medaka Polish>Filtercontige>dnaApler
    https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/586af547-03dd-4cb8-8877-8041d0064464
    medaka output model and version
    image

  2. Porechop run i.e skip_trim_reads = false
    https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/e5a645db-0e17-4c58-84ce-4e1f44ef9042

  3. Skip polishing skip_polishing = true
    https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/110a4ffa-c208-4849-a3b2-11d88ffddc90

  4. Racon polishing pathway (polishing_rounds = 2)
    https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/95694506-946f-4ef7-9b3d-657522cc7809

  5. Hybrid assembly ONT data and Illumina (Polypolish and BWA)
    https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/4fa9c915-4572-43b9-bd9b-28bed40c75a4

##Comparisons between DragonFlye and New Flye_denovo subworkflow##
Here we are looking for similarities in assemblies, statistics and downstream analyses. 8 bacterial samples selected

Both workflows produce assemblies of similar lengths for each sample, with minor variations (typically within ±1%).
Both workflows achieve high BUSCO completeness scores, generally above 90%.
Both workflows consistently predict the same taxa for each sample.

Comparisons tables

Table 1: Assembly Metrics & BUSCO Scores

Sample ID Expected Genome Length (bp) Workflow Assembly Length (bp) BUSCO Completeness (%) BUSCO Fragmentation (%) BUSCO Missing (%) Taxonomic Prediction N50 (bp) # Contigs
ERR8958704 5,300,000 Dragonflye 5,750,961 98.2 0.7 1.1 Klebsiella pneumoniae 5,314,253 8
ERR8958704 5,300,000 Flye 5,732,385 98.0 0.9 1.1 Klebsiella pneumoniae 5,314,188 8
ERR8958706 5,300,000 Dragonflye 5,727,717 98.0 0.2 1.8 Klebsiella pneumoniae 5,314,190 8
ERR8958706 5,300,000 Flye 5,718,792 98.5 0.2 1.3 Klebsiella pneumoniae 5,314,197 9
ERR8958833 2,800,000 Dragonflye 2,902,609 100.0 0.0 0.0 Staphylococcus aureus 2,902,609 1
ERR8958833 2,800,000 Flye 2,902,617 100.0 0.0 0.0 Staphylococcus aureus 2,902,596 1
ERR8958835 2,800,000 Dragonflye 2,902,603 99.8 0.2 0.0 Staphylococcus aureus 2,902,603 1
ERR8958835 2,800,000 Flye 2,902,618 99.8 0.2 0.0 Staphylococcus aureus 2,902,601 1
SAMN05250424 4,800,000 Dragonflye 4,774,436 93.8 3.9 2.3 Salmonella enterica 4,685,874 5
SAMN05250424 4,800,000 Flye 4,778,142 92.0 6.1 1.9 Salmonella enterica 4,684,533 5
SAMN05596277 4,800,000 Dragonflye 4,778,588 94.5 3.4 2.1 Salmonella enterica 4,763,458 5
SAMN05596277 4,800,000 Flye 4,773,705 92.9 4.3 2.8 Salmonella enterica 4,762,539 4
SAMN23569621 4,900,000 Dragonflye 5,281,055 83.6 13.6 2.8 Shigella sonnei 4,818,873 10
SAMN23569621 4,900,000 Flye 5,248,069 87.9 9.8 2.3 Shigella sonnei 4,812,769 10
SAMN23605158 4,900,000 Dragonflye 5,207,959 77.3 17.0 5.7 Shigella sonnei 4,863,959 9
SAMN23605158 4,900,000 Flye 5,190,940 81.6 13.0 5.4 Shigella sonnei 4,856,756 9
Table 2: SNP Comparison
Sample ID SNP Differences (Flye vs. DragonFlye) % Difference
ERR8958704 146 0.0025%
ERR8958706 0 0%
ERR8958833 0 0%
ERR8958835 0 0%
SAMN05250424 217 0.0045%
SAMN05596277 130 0.0027%
SAMN23569621 0 0%
SAMN23605158 0 0%

SNP Comparison Summary from table: Flye vs. Dragonflye Assemblies
Minimal Differences: Most samples showed no SNP differences between Flye and Dragonflye assemblies, indicating high consistency between the two methods.
Minimal Variability in Some Samples:
ERR8958704 had 146 SNPs (0.0025% difference).
SAMN05250424 showed the highest SNP count (217 SNPs, 0.0045% difference).
SAMN05596277 had 130 SNPs (0.0027% difference).
Stable Genomes: Samples ERR8958706, ERR8958833, ERR8958835, SAMN23569621, and SAMN23605158 had 0 SNPs, suggesting nearly identical assemblies.
Summary

Downstream analyses
-Gambit Taxon: Identical predictions across workflows for each sample.
-Both workflows produce identical results for most downstream analyses, ensuring reliable serotype predictions, taxonomic classifications, and virulence gene identifications.

Finally the 44 validation ONT raw data samples were ran through Flye denovo and samples were checked manually to compare against previously ran Dragonflye submissions and we found similar comparable results
https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/47b75bb4-2a1b-4e41-b5f3-2f421e4e38ed

Suggested Scenarios for Reviewer to Test

Parameters to test:
skip_trim_reads: true
skip_polishing: false
polishing_rounds: 1

Expected outputs: Final polished assembly in FASTA format.
Metadata output for versions used (e.g., Flye, Medaka).
No trimming or filtering applied.
Successful Bandage plot and GFA graph generation.

🔬 Final Developer Checklist

  • The workflow/task has been tested and results, including file contents, are as anticipated
  • The CI/CD has been adjusted and tests are passing (Theiagen developers)
  • Code changes follow the style guide
  • Documentation and/or workflow diagrams have been updated if applicable
    • You have updated the latest version for any affected worklows in the respective workflow documentation page and for every entry in the three workflows_overview tables.

🎯 Reviewer Checklist

  • All changed results have been confirmed
  • You have tested the PR appropriately (see the testing guide for more information)
  • All code adheres to the style guide
  • MD5 sums have been updated
  • The PR author has addressed all comments
  • The documentation has been updated

@fraser-combe
Copy link
Contributor Author

fraser-combe commented Jan 17, 2025

Thanks for all the comments

  • Updated filter_contigs.wdl - now uses Biopython instead of awk and updated output file text
  • Flye- uses read_type input - set default to -nano-hq (open to thoughts to go back to nano-raw) -seems most relavant for recent ONT data going forward
  • Ensured runtime parameters standardized across tasks and names
  • Updated documentation

Local Testing: Successfully reran the workflow locally with default settings to confirm all tasks executed as expected. Confirmed filter_contigs new workflow output and tested with homopolymer and short read data to confirm trimming

Terra Testing: Reran the workflow in Terra using default settings. No major task overhauls were performed; results matched expectations for typical use cases.

rerun flye default

reran test after merge with main

Copy link
Contributor

@AndrewLangvt AndrewLangvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more to follow up on from Sage's previous comments.

@AndrewLangvt
Copy link
Contributor

AndrewLangvt commented Jan 21, 2025

Thanks for making all of these changes @fraser-combe. This looks much more sound, programmatically. I do have a few remaining questions, more on the biological comparison/assessment side of things. I understand the predicted Gambit Taxon was identical across samples assembled with flye vs dragonflye, which is great. It looks like you have a note in this PR to "Add in comparison results." Would you please do that? I think we want to get granular here in how we assess the flye/dragonfle assemblies. In theory, as we're just unwrapping Dragonflye into it's separate components, the assemblies should be largely similar. If you would, please pull together a table with the following (feel free to just update the table that already exists in this PR, if you like:
SNP distance between flye/dragonflye assembled genomes
genome length
BUSCO Scores
predicted taxon
contig length
# contigs
n50

Once you've got this, we can link up as a team to review & make sure our "hive mind" is in agreement across the board.

@AndrewLangvt
Copy link
Contributor

Thanks for adding the table of SNP differences here @fraser-combe. This shows what I was hoping for - minimal impact to genome assembly. I'm good with merging this PR. I know @sage-wright was taking a look this morn, as well. So, pending any final comments from her, we can get this thing merged!

Copy link
Member

@sage-wright sage-wright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good! Running a final sanity check here for final confirmation. Will merge upon success!

@fraser-combe
Copy link
Contributor Author

fraser-combe commented Jan 29, 2025

Updated the docker image name

@sage-wright I re ran your submission from before here

I re ran the 8 comparison samples again and received identical results so no effect on comparisons as there were no filtered contigs using the old docker image or new as the contigs met the minimum thresholds of 1000 bp length and no homopolymers

@sage-wright
Copy link
Member

Thanks for the updates. Running a TheiaValidate here for confirmation, will merge upon success 👍

@sage-wright sage-wright marked this pull request as draft February 18, 2025 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants