Skip to content

Commit

Permalink
Merge branch 'CW-4776' into 'dev'
Browse files Browse the repository at this point in the history
Add nextclade [CW-4776]

Closes CW-4776

See merge request epi2melabs/workflows/wf-mpx!39
  • Loading branch information
bede committed Sep 4, 2024
2 parents 2b6c1db + d7a033d commit 05d0253
Show file tree
Hide file tree
Showing 23 changed files with 10,074 additions and 78 deletions.
7 changes: 1 addition & 6 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@ variables:
# (not things such as -profile)
NF_WORKFLOW_OPTS: >
--fastq test_data/fastq/barcode01
--override_basecaller_cfg [email protected]
-executor.\$$local.memory 32GB
CI_FLAVOUR: "new"
NF_IGNORE_PROCESSES: "noAssembly"
Expand Down Expand Up @@ -45,19 +44,15 @@ docker-run:
NF_WORKFLOW_OPTS: >
-executor.\$$local.memory 32GB
--fastq test_data/fastq/barcode01
--override_basecaller_cfg [email protected]
--assembly false
NF_IGNORE_PROCESSES: flyeAssembly,medaka_polish,bamtobed
NF_IGNORE_PROCESSES: "flyeAssembly,medaka_polish,bamtobed,nextclade_full"
- if: $MATRIX_NAME == "bam"
variables:
NF_WORKFLOW_OPTS: >
-executor.\$$local.memory 32GB
--bam test_data/bam
--override_basecaller_cfg [email protected]
- if: $MATRIX_NAME == "ubam"
variables:
# the uBAM test data contains a basecall model (no
# `--override_basecaller_cfg` needed)
NF_WORKFLOW_OPTS: >
-executor.\$$local.memory 32GB
--bam test_data/ubam
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ repos:
always_run: true
pass_filenames: false
additional_dependencies:
- epi2melabs==0.0.55
- epi2melabs==0.0.56
- id: build_models
name: build_models
entry: datamodel-codegen --strict-nullable --base-class workflow_glue.results_schema_helpers.BaseModel --use-schema-description --disable-timestamp --input results_schema.yml --input-file-type openapi --output bin/workflow_glue/results_schema.py
Expand Down
8 changes: 8 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,14 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).


## [v1.1.1]
### Added
- Nextclade typing for both reference-based and optional _de novo_ consensus assemblies.
- Clade I reference sequence (`NC_003310.1`) is now available for selection.
### Changed
- Default reference sequence changed to `NC_003310.1` (Clade I) following mpox PHEIC announcement.

## [v1.1.0]
### Added
- Support for uBAM (unaligned BAM) input files.
Expand Down
14 changes: 8 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Mpox metagenomics assembly workflow
# MPXV metagenomic assembly workflow

A basic workflow for analysing ONT Mpox data to create draft consensus or de-novo assemblies.
A workflow for analysing ONT monkeypox virus (MPXV) sequences to create draft consensus or de novo assemblies.



Expand Down Expand Up @@ -75,6 +75,7 @@ the following command:
```
nextflow pull epi2me-labs/wf-mpx
```

A demo dataset is provided for testing of the workflow.
It can be downloaded and unpacked using the following commands:
```
Expand All @@ -87,6 +88,7 @@ nextflow run epi2me-labs/wf-mpx \
--fastq 'wf-mpx-demo/fastq/barcode01' \
-profile standard
```

For further information about running a workflow on
the command line see https://labs.epi2me.io/wfquickstart/

Expand Down Expand Up @@ -126,7 +128,7 @@ input_reads.fastq.gz -── input_directory
|--------------------------|------|-------------|------|---------|
| bam | string | BAM file to use in the analysis. | Path to a single BAM file, or a single directory containing one or more BAM files. Multiple samples are not currently supported. | |
| fastq | string | FASTQ file to use in the analysis. | Path to a single FASTQ file, or a single directory containing one or more FASTQ files. Multiple samples or barcodes are not currently supported. | |
| reference | string | The reference genome to use for mapping. | This is used if inputting a FASTQ file. We provide four popular mpox reference sequences with which to map your reads to. More information can be found here: https://labs.epi2me.io/basic-mpox-workflow | MT903344.1 |
| reference | string | The reference genome to use for mapping. | This is used if inputting a FASTQ file. We provide five MPXV reference sequences for alignment, representing clades I and II. More information can be found in our [blog post introducing this workflow](https://labs.epi2me.io/basic-mpox-workflow) | NC_003310.1 |


### Sample Options
Expand All @@ -148,7 +150,7 @@ input_reads.fastq.gz -── input_directory
| Nextflow parameter name | Type | Description | Help | Default |
|--------------------------|------|-------------|------|---------|
| override_basecaller_cfg | string | Override auto-detected basecaller model that processed the signal data; used to select an appropriate Medaka model. | Per default, the workflow tries to determine the basecall model from the input data. This parameter can be used to override the detected value (or to provide a model name if none was found in the inputs). However, users should only do this if they know for certain which model was used as selecting the wrong option might give sub-optimal results. A list of recent models can be found here: https://github.com/nanoporetech/dorado#DNA-models. | |
| assembly | boolean | Perform assembly with flye. | Take the reads mapped to the mpx reference and perform a de novo assembly using flye. This might be useful to uncover any structural variation not present in your chosen reference genome. Turning off will reduce run times and computational intensity at the expense of not being able to identify structural variants. | True |
| assembly | boolean | Perform assembly with flye. | Take the reads mapped to the mpx reference and perform a de novo assembly using Flye. This might be useful to uncover any structural variation not present in your chosen reference genome. Turning off will reduce run times and computational intensity at the expense of not being able to identify structural variants. | True |
| min_coverage | number | Coverage threshold for masking in the consensus step. | Regions with less than the coverage entered here are masked (nucleotide sequences will be replaced with N) when generating the consensus. It might be useful to decrease this value if you have very low coverage, but this will affect the quality of your consensus sequence. | 20 |


Expand All @@ -163,8 +165,8 @@ Output files may be aggregated including information for all samples or provided
| Title | File path | Description | Per sample or aggregated |
|-------|-----------|-------------|--------------------------|
| Workflow report | ./wf-mpx-report.html | The report for the workflow | aggregated |
| Consensus assembly FASTA | ./consensus.fasta | De-novo consensus assembly sequence from flye and polished by medaka. | per-sample |
| Draft consensus FASTA | ./{{ alias }}.draft.consensus.fasta | Draft consensus sequence from bcftools. | per-sample |
| De novo consensus assembly FASTA | ./denovo.consensus.fasta | De novo consensus assembly sequence from Flye and polished by Medaka. | per-sample |
| Reference-based consensus assembly FASTA | ./{{ alias }}.ref.consensus.fasta | Reference-based consensus sequence from Bcftools. | per-sample |
| Read stats | ./{{ alias }}.per-read-stats.tsv.gz | A simple text file providing a summary of sequencing reads. | per-sample |
| Read alignment | ./{{ alias }}.bam | Read alignments in BAM format. | per-sample |
| Alignment index file | ./{{ alias }}.bam.bai | Index file of BAM file. | per-sample |
Expand Down
2 changes: 1 addition & 1 deletion bin/workflow_glue/check_bam_headers_in_dir.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,6 @@ def main(args):

def argparser():
"""Argument parser for entrypoint."""
parser = wf_parser("check_bam_headers")
parser = wf_parser("check_bam_headers_in_dir")
parser.add_argument("input_path", type=Path, help="Path to target directory")
return parser
2 changes: 1 addition & 1 deletion bin/workflow_glue/check_xam_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,6 @@ def main(args):

def argparser():
"""Argument parser for entrypoint."""
parser = wf_parser("check_bam_headers")
parser = wf_parser("check_xam_index")
parser.add_argument("input_xam", type=Path, help="Path to target XAM")
return parser
2 changes: 1 addition & 1 deletion bin/workflow_glue/get_max_depth_locus.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ def main(args):

def argparser():
"""Argument parser for entrypoint."""
parser = wf_parser("check_bam_headers")
parser = wf_parser("get_max_depth_locus")
parser.add_argument(
"depths_bed",
type=Path,
Expand Down
93 changes: 72 additions & 21 deletions bin/workflow_glue/report.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/usr/bin/env python
"""Create workflow report."""

import json
import math
import os

Expand Down Expand Up @@ -80,9 +80,9 @@ def make_coverage_section(coverage_file, variants_data, report_doc):
section = report_doc.add_section()
section._add_item("""<h3>Genome coverage</h3>
<p>The plot below shows coverage of the whole genome. Dots represent
single nucleotide polymorphisms called by medaka. Bars represent
insertions and deletions called by medaka.</p>""")
<p>The plot below shows read depth across the whole genome, annotated with variants
in the reference-based assembly called by Medaka. Dots represent single nucleotide
polymorphisms, while bars represent insertions and deletions.</p>""")

max_coverage = max(coverage['depth'])

Expand Down Expand Up @@ -137,12 +137,12 @@ def get_context(variant, fasta):
def make_variants_context(variants_data, reference, report):
"""Make variants context section."""
section = report.add_section()
section._add_item("""<h3>Variant context</h3>
section._add_item("""<h3>Mutational spectrum</h3>
<p>APOBEC3 host enzyme action has been implicated in the mutational process
of MPX. As described in <a href="https://tinyurl.com/2fawchvu">this</a>
<a href="https://virological.org">virological.org</a> post by Áine O’Toole
& Andrew Rambaut.<p>""")
<p>Evidence of APOBEC3 host enzyme editing has been observed in Monkeypox virus
(MPXV), as described in <a href="https://tinyurl.com/2fawchvu">this virological.org
post</a> by Áine O’Toole & Andrew Rambaut. The mutational spectrum of the
reference-based assembly is shown below.</p>""")

variants_data['context'] = [
get_context(x, reference) for x in zip(
Expand All @@ -168,8 +168,8 @@ def make_variants_table(variants_data, report):
section = report.add_section()
section._add_item("""<h3>All variants</h3>
<p>Variants with respect to the reference sequence as called by medaka
.</p>""")
<p>The following table shows variants in the reference-based assembly called
by Medaka with respect to the chosen reference sequence.</p>""")
section.table(
variants_data,
index=False,
Expand All @@ -182,10 +182,10 @@ def make_variants_table(variants_data, report):
def make_assembly_summary(bed, report):
"""Make a plot for the assembly."""
section = report.add_section()
section._add_item("""<h3>Flye Assembly</h3>
section._add_item("""<h3>De novo assembly</h3>
<p>Quick assembly of the reads. Below the contigs are displayed as mapped
to the reference chosen for analysis.</p>""")
<p>The plot below shows an alignment of <em>de novo</em> assembly contigs to the
chosen reference sequence.</p>""")

contigs = pd.read_csv(bed, sep='\t', header=0).set_axis(
['reference', 'start', 'end', 'name', 'qual', 'strand'],
Expand Down Expand Up @@ -215,6 +215,8 @@ def make_assembly_summary(bed, report):
text_color=ont_colors.BRAND_BLUE)
p.add_layout(labels)

p.xaxis.formatter = BasicTickFormatter(use_scientific=False)

section.plot(p)


Expand All @@ -241,6 +243,12 @@ def argparser():
parser.add_argument(
"--params", default=None, required=True,
help="A JSON file containing the workflow parameter key/values")
parser.add_argument(
"--nextclade_ref", default=None, required=True,
help="A JSON file containing the nextclade result")
parser.add_argument(
"--nextclade_denovo", default=None,
help="A JSON file containing the nextclade result")
parser.add_argument(
"--revision", default='unknown',
help="git branch/tag of the executed workflow")
Expand All @@ -259,24 +267,67 @@ def main(args):
section = report.add_section()
section.markdown(""" ### Preamble
This workflow is intended to produce a __draft__ consensus Monkeypox virus
This workflow is intended to assemble a __draft__ consensus Monkeypox virus (MPXV)
genome from Oxford Nanopore Technologies Sequencing data.
The rough outline of this workflow:
* Map reads to a reference genome
* Call variants, using medaka, with respect to this reference
* Create a consensus using these called variants
* Attempt assembly with flye polsihed with medaka
* Call variants with respect to this reference using Medaka
* Create a consensus sequence by applying these variants to the reference
* Attempt *de novo* assembly with Flye, polished with Medaka (optional)
Consensus is masked ('N') in regions of <20x coverage, deletions are
represented as "-" and insertions are in lower case. The consensus __will
require manual review__.""")
Assembled consensus sequences are masked ('N') in regions of <20x coverage,
deletions are represented as "-", and insertions are in lower case. Resulting
assemblies __will require manual review__.""")

if os.path.exists(args.summaries[0]):
report.add_section(
section=fastcat.full_report(args.summaries))

with open(args.nextclade_ref, 'r') as file:
data = json.load(file)

section = report.add_section()
section.markdown("""
### Nextclade results
#### Reference-based assembly
""")
if len(data['results']) > 0:
section.table(pd.DataFrame(
{
'clade': [data['results'][0]['clade']],
'coverage': [data['results'][0]['coverage']],
'overall_qc': [data['results'][0]['qc']['overallStatus']]}))
else:
section.markdown("""
Clade typing was unsuccessful.
""")

if len(data['errors']) > 0:
section.table(pd.DataFrame(data['errors']))

if args.nextclade_denovo:
with open(args.nextclade_denovo, 'r') as file:
data = json.load(file)

section.markdown("""
#### *De novo* assembly
""")
if len(data['results']) > 0:
section.table(pd.DataFrame(
{
'clade': [data['results'][0]['clade']],
'coverage': [data['results'][0]['coverage']],
'overall_qc': [data['results'][0]['qc']['overallStatus']]}))
else:
section.markdown("""
Clade typing was unsuccessful.
""")

if len(data['errors']) > 0:
section.table(pd.DataFrame(data['errors']))

variants = load_vcf(args.variants)

make_coverage_section(args.coverage, variants, report)
Expand Down
Loading

0 comments on commit 05d0253

Please sign in to comment.