Merge branch 'CW-4776' into 'dev'

Add nextclade [CW-4776] Closes CW-4776 See merge request epi2melabs/workflows/wf-mpx!39
epi2me-labs · Sep 4, 2024 · 05d0253 · 05d0253
2 parents 2b6c1db + d7a033d
commit 05d0253
Show file tree

Hide file tree

Showing 23 changed files with 10,074 additions and 78 deletions.
diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
@@ -10,7 +10,6 @@ variables:
     # (not things such as -profile)
     NF_WORKFLOW_OPTS: >
         --fastq test_data/fastq/barcode01
-        --override_basecaller_cfg [email protected]
         -executor.\$$local.memory 32GB
     CI_FLAVOUR: "new"
     NF_IGNORE_PROCESSES: "noAssembly"
@@ -45,19 +44,15 @@ docker-run:
         NF_WORKFLOW_OPTS: >
           -executor.\$$local.memory 32GB
           --fastq test_data/fastq/barcode01
-          --override_basecaller_cfg [email protected]
           --assembly false
-        NF_IGNORE_PROCESSES: flyeAssembly,medaka_polish,bamtobed
+        NF_IGNORE_PROCESSES: "flyeAssembly,medaka_polish,bamtobed,nextclade_full"
     - if: $MATRIX_NAME == "bam"
       variables:
         NF_WORKFLOW_OPTS: >
           -executor.\$$local.memory 32GB
           --bam test_data/bam
-          --override_basecaller_cfg [email protected]
     - if: $MATRIX_NAME == "ubam"
       variables:
-        # the uBAM test data contains a basecall model (no
-        # `--override_basecaller_cfg` needed)
         NF_WORKFLOW_OPTS: >
           -executor.\$$local.memory 32GB
           --bam test_data/ubam
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -8,7 +8,7 @@ repos:
         always_run: true
         pass_filenames: false
         additional_dependencies:
-          - epi2melabs==0.0.55
+          - epi2melabs==0.0.56
       - id: build_models
         name: build_models
         entry: datamodel-codegen --strict-nullable --base-class workflow_glue.results_schema_helpers.BaseModel --use-schema-description --disable-timestamp --input results_schema.yml --input-file-type openapi --output bin/workflow_glue/results_schema.py

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,14 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+
+## [v1.1.1]
+### Added
+- Nextclade typing for both reference-based and optional _de novo_ consensus assemblies.
+- Clade I reference sequence (`NC_003310.1`) is now available for selection.
+### Changed
+- Default reference sequence changed to `NC_003310.1` (Clade I) following mpox PHEIC announcement.
+
 ## [v1.1.0]
 ### Added
 - Support for uBAM (unaligned BAM) input files.

diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
-# Mpox metagenomics assembly workflow
+# MPXV metagenomic assembly workflow
 
-A basic workflow for analysing ONT Mpox data to create draft consensus or de-novo assemblies.
+A workflow for analysing ONT monkeypox virus (MPXV) sequences to create draft consensus or de novo assemblies.
 
 
 
@@ -75,6 +75,7 @@ the following command:
 ```
 nextflow pull epi2me-labs/wf-mpx
 ```
+
 A demo dataset is provided for testing of the workflow.
 It can be downloaded and unpacked using the following commands:
 ```
@@ -87,6 +88,7 @@ nextflow run epi2me-labs/wf-mpx \
 	--fastq 'wf-mpx-demo/fastq/barcode01' \
 	-profile standard
 ```
+
 For further information about running a workflow on
 the command line see https://labs.epi2me.io/wfquickstart/
 
@@ -126,7 +128,7 @@ input_reads.fastq.gz        -── input_directory
 |--------------------------|------|-------------|------|---------|
 | bam | string | BAM file to use in the analysis. | Path to a single BAM file, or a single directory containing one or more BAM files. Multiple samples are not currently supported. |  |
 | fastq | string | FASTQ file to use in the analysis. | Path to a single FASTQ file, or a single directory containing one or more FASTQ files. Multiple samples or barcodes are not currently supported. |  |
-| reference | string | The reference genome to use for mapping. | This is used if inputting a FASTQ file. We provide four popular mpox reference sequences with which to map your reads to. More information can be found here: https://labs.epi2me.io/basic-mpox-workflow | MT903344.1 |
+| reference | string | The reference genome to use for mapping. | This is used if inputting a FASTQ file. We provide five MPXV reference sequences for alignment, representing clades I and II. More information can be found in our [blog post introducing this workflow](https://labs.epi2me.io/basic-mpox-workflow) | NC_003310.1 |
 
 
 ### Sample Options
@@ -148,7 +150,7 @@ input_reads.fastq.gz        -── input_directory
 | Nextflow parameter name  | Type | Description | Help | Default |
 |--------------------------|------|-------------|------|---------|
 | override_basecaller_cfg | string | Override auto-detected basecaller model that processed the signal data; used to select an appropriate Medaka model. | Per default, the workflow tries to determine the basecall model from the input data. This parameter can be used to override the detected value (or to provide a model name if none was found in the inputs). However, users should only do this if they know for certain which model was used as selecting the wrong option might give sub-optimal results. A list of recent models can be found here: https://github.com/nanoporetech/dorado#DNA-models. |  |
-| assembly | boolean | Perform assembly with flye. | Take the reads mapped to the mpx reference and perform a de novo assembly using flye. This might be useful to uncover any structural variation not present in your chosen reference genome. Turning off will reduce run times and computational intensity at the expense of not being able to identify structural variants. | True |
+| assembly | boolean | Perform assembly with flye. | Take the reads mapped to the mpx reference and perform a de novo assembly using Flye. This might be useful to uncover any structural variation not present in your chosen reference genome. Turning off will reduce run times and computational intensity at the expense of not being able to identify structural variants. | True |
 | min_coverage | number | Coverage threshold for masking in the consensus step. | Regions with less than the coverage entered here are masked (nucleotide sequences will be replaced with N) when generating the consensus. It might be useful to decrease this value if you have very low coverage, but this will affect the quality of your consensus sequence. | 20 |
 
 
@@ -163,8 +165,8 @@ Output files may be aggregated including information for all samples or provided
 | Title | File path | Description | Per sample or aggregated |
 |-------|-----------|-------------|--------------------------|
 | Workflow report | ./wf-mpx-report.html | The report for the workflow | aggregated |
-| Consensus assembly FASTA | ./consensus.fasta | De-novo consensus assembly sequence from flye and polished by medaka. | per-sample |
-| Draft consensus FASTA | ./{{ alias }}.draft.consensus.fasta | Draft consensus sequence from bcftools. | per-sample |
+| De novo consensus assembly FASTA | ./denovo.consensus.fasta | De novo consensus assembly sequence from Flye and polished by Medaka. | per-sample |
+| Reference-based consensus assembly FASTA | ./{{ alias }}.ref.consensus.fasta | Reference-based consensus sequence from Bcftools. | per-sample |
 | Read stats | ./{{ alias }}.per-read-stats.tsv.gz | A simple text file providing a summary of sequencing reads. | per-sample |
 | Read alignment | ./{{ alias }}.bam | Read alignments in BAM format. | per-sample |
 | Alignment index file | ./{{ alias }}.bam.bai | Index file of BAM file. | per-sample |

diff --git a/bin/workflow_glue/check_bam_headers_in_dir.py b/bin/workflow_glue/check_bam_headers_in_dir.py
@@ -60,6 +60,6 @@ def main(args):
 
 def argparser():
     """Argument parser for entrypoint."""
-    parser = wf_parser("check_bam_headers")
+    parser = wf_parser("check_bam_headers_in_dir")
     parser.add_argument("input_path", type=Path, help="Path to target directory")
     return parser
diff --git a/bin/workflow_glue/check_xam_index.py b/bin/workflow_glue/check_xam_index.py
@@ -38,6 +38,6 @@ def main(args):
 
 def argparser():
     """Argument parser for entrypoint."""
-    parser = wf_parser("check_bam_headers")
+    parser = wf_parser("check_xam_index")
     parser.add_argument("input_xam", type=Path, help="Path to target XAM")
     return parser
diff --git a/bin/workflow_glue/get_max_depth_locus.py b/bin/workflow_glue/get_max_depth_locus.py
@@ -49,7 +49,7 @@ def main(args):
 
 def argparser():
     """Argument parser for entrypoint."""
-    parser = wf_parser("check_bam_headers")
+    parser = wf_parser("get_max_depth_locus")
     parser.add_argument(
         "depths_bed",
         type=Path,

diff --git a/bin/workflow_glue/report.py b/bin/workflow_glue/report.py
@@ -1,6 +1,6 @@
 #!/usr/bin/env python
 """Create workflow report."""
-
+import json
 import math
 import os
 
@@ -80,9 +80,9 @@ def make_coverage_section(coverage_file, variants_data, report_doc):
     section = report_doc.add_section()
     section._add_item("""<h3>Genome coverage</h3>
 
-    <p>The plot below shows coverage of the whole genome. Dots represent
-    single nucleotide polymorphisms called by medaka. Bars represent
-    insertions and deletions called by medaka.</p>""")
+    <p>The plot below shows read depth across the whole genome, annotated with variants
+    in the reference-based assembly called by Medaka. Dots represent single nucleotide
+    polymorphisms, while bars represent insertions and deletions.</p>""")
 
     max_coverage = max(coverage['depth'])
 
@@ -137,12 +137,12 @@ def get_context(variant, fasta):
 def make_variants_context(variants_data, reference, report):
     """Make variants context section."""
     section = report.add_section()
-    section._add_item("""<h3>Variant context</h3>
+    section._add_item("""<h3>Mutational spectrum</h3>
 
-    <p>APOBEC3 host enzyme action has been implicated in the mutational process
-    of MPX. As described in <a href="https://tinyurl.com/2fawchvu">this</a>
-    <a href="https://virological.org">virological.org</a> post by Áine O’Toole
-    & Andrew Rambaut.<p>""")
+    <p>Evidence of APOBEC3 host enzyme editing has been observed in Monkeypox virus
+    (MPXV), as described in <a href="https://tinyurl.com/2fawchvu">this virological.org
+    post</a> by Áine O’Toole & Andrew Rambaut. The mutational spectrum of the
+    reference-based assembly is shown below.</p>""")
 
     variants_data['context'] = [
         get_context(x, reference) for x in zip(
@@ -168,8 +168,8 @@ def make_variants_table(variants_data, report):
     section = report.add_section()
     section._add_item("""<h3>All variants</h3>
 
-    <p>Variants with respect to the reference sequence as called by medaka
-    .</p>""")
+    <p>The following table shows variants in the reference-based assembly called
+    by Medaka with respect to the chosen reference sequence.</p>""")
     section.table(
             variants_data,
             index=False,
@@ -182,10 +182,10 @@ def make_variants_table(variants_data, report):
 def make_assembly_summary(bed, report):
     """Make a plot for the assembly."""
     section = report.add_section()
-    section._add_item("""<h3>Flye Assembly</h3>
+    section._add_item("""<h3>De novo assembly</h3>
 
-    <p>Quick assembly of the reads. Below the contigs are displayed as mapped
-    to the reference chosen for analysis.</p>""")
+    <p>The plot below shows an alignment of <em>de novo</em> assembly contigs to the
+    chosen reference sequence.</p>""")
 
     contigs = pd.read_csv(bed, sep='\t', header=0).set_axis(
             ['reference', 'start', 'end', 'name', 'qual', 'strand'],
@@ -215,6 +215,8 @@ def make_assembly_summary(bed, report):
         text_color=ont_colors.BRAND_BLUE)
     p.add_layout(labels)
 
+    p.xaxis.formatter = BasicTickFormatter(use_scientific=False)
+
     section.plot(p)
 
 
@@ -241,6 +243,12 @@ def argparser():
     parser.add_argument(
         "--params", default=None, required=True,
         help="A JSON file containing the workflow parameter key/values")
+    parser.add_argument(
+        "--nextclade_ref", default=None, required=True,
+        help="A JSON file containing the nextclade result")
+    parser.add_argument(
+        "--nextclade_denovo", default=None,
+        help="A JSON file containing the nextclade result")
     parser.add_argument(
         "--revision", default='unknown',
         help="git branch/tag of the executed workflow")
@@ -259,24 +267,67 @@ def main(args):
     section = report.add_section()
     section.markdown(""" ### Preamble
 
-    This workflow is intended to produce a __draft__ consensus Monkeypox virus
+    This workflow is intended to assemble a __draft__ consensus Monkeypox virus (MPXV)
     genome from Oxford Nanopore Technologies Sequencing data.
 
     The rough outline of this workflow:
 
     * Map reads to a reference genome
-    * Call variants, using medaka, with respect to this reference
-    * Create a consensus using these called variants
-    * Attempt assembly with flye polsihed with medaka
+    * Call variants with respect to this reference using Medaka
+    * Create a consensus sequence by applying these variants to the reference
+    * Attempt *de novo* assembly with Flye, polished with Medaka (optional)
 
-    Consensus is masked ('N') in regions of <20x coverage, deletions are
-    represented as "-" and insertions are in lower case. The consensus __will
-    require manual review__.""")
+    Assembled consensus sequences are masked ('N') in regions of <20x coverage,
+    deletions are represented as "-", and insertions are in lower case. Resulting
+    assemblies __will require manual review__.""")
 
     if os.path.exists(args.summaries[0]):
         report.add_section(
             section=fastcat.full_report(args.summaries))
 
+    with open(args.nextclade_ref, 'r') as file:
+        data = json.load(file)
+
+    section = report.add_section()
+    section.markdown("""
+### Nextclade results
+#### Reference-based assembly
+""")
+    if len(data['results']) > 0:
+        section.table(pd.DataFrame(
+            {
+                'clade': [data['results'][0]['clade']],
+                'coverage': [data['results'][0]['coverage']],
+                'overall_qc': [data['results'][0]['qc']['overallStatus']]}))
+    else:
+        section.markdown("""
+        Clade typing was unsuccessful.
+        """)
+
+    if len(data['errors']) > 0:
+        section.table(pd.DataFrame(data['errors']))
+
+    if args.nextclade_denovo:
+        with open(args.nextclade_denovo, 'r') as file:
+            data = json.load(file)
+
+        section.markdown("""
+        #### *De novo* assembly
+        """)
+        if len(data['results']) > 0:
+            section.table(pd.DataFrame(
+                {
+                    'clade': [data['results'][0]['clade']],
+                    'coverage': [data['results'][0]['coverage']],
+                    'overall_qc': [data['results'][0]['qc']['overallStatus']]}))
+        else:
+            section.markdown("""
+            Clade typing was unsuccessful.
+            """)
+
+        if len(data['errors']) > 0:
+            section.table(pd.DataFrame(data['errors']))
+
     variants = load_vcf(args.variants)
 
     make_coverage_section(args.coverage, variants, report)