Merge branch 'dev' into main

phac-nml · Oct 13, 2023 · 19e6c8c · 19e6c8c
2 parents 9a26a34 + e95a9c5
commit 19e6c8c
Show file tree

Hide file tree

Showing 12 changed files with 86 additions and 49 deletions.
diff --git a/README.md b/README.md
@@ -31,7 +31,16 @@ Containers are not perfect, below is a list of some issues you may face using co
 - Exit code 137, likely means your docker container used to much memory.
 
 ## Dependencies
-Besides the Nextflow run time (requires Java), and container engine the dependencies required by mikrokondo are fairly minimal requiring only Python 3.10 (more recent Python versions will work as well) to run. Currently mikrokondo has been tested with fully with Singularity (partially with Apptainer, containers all work not all workflow paths tested) and partially tested with Docker (not all workflow paths tested). **Dependencies can be installed with Conda (e.g. Nextflow and Python)**
+Besides the Nextflow run time (requires Java), and container engine the dependencies required by mikrokondo are fairly minimal requiring only Python 3.10 (more recent Python versions will work as well) to run. Currently mikrokondo has been tested with fully with Singularity (partially with Apptainer, containers all work not all workflow paths tested) and partially tested with Docker (not all workflow paths tested). **Dependencies can be installed with Conda (e.g. Nextflow and Python)**. To download the pipeline run:
+`git clone https://github.com/phac-nml/mikrokondo.git`
+
+### Dependencies listed
+
+- Python (3.10>=)
+- Nextflow (22.10.1>=)
+- Container service (Docker, Singularity, Apptainer have been tested)
+- The source code: `git clone https://github.com/phac-nml/mikrokondo.git`
+
 
 
 ## Resources to download

diff --git a/docs/subworkflows/annotate_genomes.md b/docs/subworkflows/annotate_genomes.md
@@ -3,10 +3,11 @@
 ## subworflows/local/annotate_genomes
 
 ## Steps
-1. **Genome annotation** is performed using [Bakta](https://github.com/oschwengers/bakta) (bakta_annotate.nf).
+1. **Genome annotation** is performed using [Bakta](https://github.com/oschwengers/bakta) [Bakta](bakta_annotate.nf), you must download a Bakta database and add its path to the `nextflow.config` file or add its path as a command line option. To skip running Bakta add `--skip_bakta true` to your command line options.
+2. **Screening for antimicrobial resistance** with **Abricate**. Abricate [Abricate](https://github.com/tseemann/abricate) is used with the default options and default database, however you can specify a database by updating the `args` in the `nextflow.config` for Abricate. You can also skip running Abricate by adding `--skip_abricate true` to your command line options.
 
->NOTE:  
->A custom database for Bakta can be downloaded via the commandline using `bakta_download_db.nf`.  
+>NOTE:
+>A custom database for Bakta can be downloaded via the commandline using `bakta_download_db.nf`.
 >The `bakta_db` setting can be changed in the `nextflow.config` file, see 'Changing Pipeline settings' <!-- need to link that page here, also check the name of that setting -->
 
 ## Input

diff --git a/docs/subworkflows/assemble_reads.md b/docs/subworkflows/assemble_reads.md
@@ -4,16 +4,16 @@
 
 ## Steps
 
-1. **Bandage plots** are generated using [Bandage](https://rrwick.github.io/Bandage/), these may not be useful for every user, but the can be informative of assembly quality in some situations (bandage_image.nf).
-
-2. **Assembly** proceeds differently depending whether short paired-end or long reads. **If the samples are marked as metagenomic, then metagenomic assembly flags will be added** to the corresponding assembler.
+1. **Assembly** proceeds differently depending whether short paired-end or long reads. **If the samples are marked as metagenomic, then metagenomic assembly flags will be added** to the corresponding assembler.
   - **Paired end assembly** is performed using [Spades](https://github.com/ablab/spades) (spades_assemble.nf)
   - **Long read assembly** is performed using [Flye](https://github.com/fenderglass/Flye) (flye_assemble.nf)
-
+
+2. **Bandage plots** are generated using [Bandage](https://rrwick.github.io/Bandage/), these may not be useful for every user, but the can be informative of assembly quality in some situations (bandage_image.nf).
+
 >NOTE:
->Hybrid assembly of long and short reads uses a different workflow that can be found [here] <!-- need to add page location -->
+>Hybrid assembly of long and short reads uses a different workflow that can be found [here](hybrid_assembly.md)
 
-3. **Polishing** (OPTIONAL) can be performed on either short or long/hybrid assemblies. [Minimap2](https://github.com/lh3/minimap2) is used to create a contig index (minimap2_index.nf) and then maps reads to that index (minimap2_map.nf). Lastly, [Racon](https://github.com/isovic/racon) uses this output to perform contig polishing (racon_polish.nf).
+3. **Polishing** (OPTIONAL) can be performed on either short or long/hybrid assemblies. [Minimap2](https://github.com/lh3/minimap2) is used to create a contig index (minimap2_index.nf) and then maps reads to that index (minimap2_map.nf). Lastly, [Racon](https://github.com/isovic/racon) uses this output to perform contig polishing (racon_polish.nf). To turn off polishing add the following to your command line parameters `--skip_polishing`.
 
 ## Input
 - cleaned reads and metadata

diff --git a/docs/subworkflows/bin_contigs.md b/docs/subworkflows/bin_contigs.md
@@ -3,10 +3,10 @@
 ## subworkflows/local/split_metagenomic.nf
 ## Steps
 
-1. **kraken2** is run to generate output reports and separate classified contigs from unclassified.
+1. **Kraken2** is run to generate output reports and separate classified contigs from unclassified.
 2. **A Python script** is run that separates each classified group of contigs into separate files at a specified taxonomic level (the default level is genus). Quite a few outputs can be generated from the process as each file is each file id is updated to be labeled as {Sample Name}_{Genus}
 
 ## Input
 - contigs, reads and meta data
 ## Outputs
-- metadata, binned contigs
+- metadata, binned contigs
diff --git a/docs/subworkflows/clean_reads.md b/docs/subworkflows/clean_reads.md
@@ -4,27 +4,25 @@
 
 ## Steps
 1. **Reads are decontaminated** using **minimap2**, against an 'sequencing off-target' index. This index contains:
-	- Reads associated with Humans (de-hosting)  
-	- Known sequencing controls (phiX)  
-	- A **new index can be swapped in, or created** (see minimap2_index subworkflow). <!-- ADD LINK TO THIS SUBWORKFLOW -->
+	- Reads associated with Humans (de-hosting)
+	- Known sequencing controls (phiX)
 2. **FastQC** is run on reads to create summary outputs, **FastQC may not be retained** in later versions of MikroKondo.
-3. **Read quality filtering and trimming** is performed using [FastP](https://github.com/OpenGene/fastp)  
-	- Currently no adapters are specified within FastP when it is run and auto-detection is used. 
+3. **Read quality filtering and trimming** is performed using [FastP](https://github.com/OpenGene/fastp)
+	- Currently no adapters are specified within FastP when it is run and auto-detection is used.
 	- FastP parameters can be altered within the nextflow.config file. <!-- ADD LINK TO CHANGING PARAMETERS PAGE -->
-	- Long read data is also run through FastP for gathering of summary data, however long read (un-paired reads) trimming is not performed and only summary metrics are generated. **Chopper** is currently integrated in MikroKondo but it has been removed from this workflow due to a lack of interest in quality trimming of long read data. It may be reintroduced in the future upon request.  
-4. **Genome size estimation** is performed using [Kat](https://github.com/TGAC/KAT) k-mer spectra's are also generated.
-5. **Read downsampling** (OPTIONAL) if toggled on, an estimated depth threshold can be specified to down sample large read sets. This step can be used to improve genome assembly quality, and is something that can be found in other assembly pipelines such as [Shovill](https://github.com/tseemann/shovill).   
-	- Depth is estimated by using the estimated genome size output from [Kat](https://github.com/TGAC/KAT)  
-	- Total basepairs are taken from [FastP](https://github.com/OpenGene/fastp)  
-	- Read downsampling is then performed using [Seqtk](https://github.com/lh3/seqtk)   
-6. **Metagenomic assesment** using a custom [Mash](https://github.com/marbl/Mash) 'sketch' file generated from the Genome Taxonomy Database [GTDB](https://gtdb.ecogenomic.org/) and the mash_screen module, the workflow will assess how many bacterial taxa are present in a sample (default of X percent to positively identify a taxa <!-- WHAT IS THE THRESHOLD OF DECIDING A SAMPLE HAS MORE THAN ONE TAXA?-->, to change this setting, see 'changing parameters')  <!-- ADD LINK TO CHANGING PARAMETERS PAGE -->. When more than 1 taxa are present, the metagenomic tag is set, which has further implications downstream in the 'Post Assembly' workflow. <!-- ADD LINK TO THIS WORKFLOW -->
-7. **Nanopore ID screening** duplicate nanopore read ID's have been known to cause issues in the pipeline downstream. In order to bypass this issue, an option can be toggled where a script will read in nanopore reads and append a unique ID to the header.
+	- Long read data is also run through FastP for gathering of summary data, however long read (un-paired reads) trimming is not performed and only summary metrics are generated. **Chopper** is currently integrated in MikroKondo but it has been removed from this workflow due to a lack of interest in quality trimming of long read data. It may be reintroduced in the future upon request.
+4. **Genome size estimation** is performed using [Mash](https://github.com/marbl/Mash) Sketch of reads and estimated genome size is output.
+5. **Read downsampling** (OPTIONAL) if toggled on, an estimated depth threshold can be specified to down sample large read sets. This step can be used to improve genome assembly quality, and is something that can be found in other assembly pipelines such as [Shovill](https://github.com/tseemann/shovill). To disable down sampling add `--skip_depth_sampling true` to your command line.
+	- Depth is estimated by using the estimated genome size output from [Mash](https://github.com/marbl/Mash)
+	- Total basepairs are taken from [FastP](https://github.com/OpenGene/fastp)
+	- Read downsampling is then performed using [Seqtk](https://github.com/lh3/seqtk)
+6. **Metagenomic assesment** using a custom [Mash](https://github.com/marbl/Mash) 'sketch' file generated from the Genome Taxonomy Database [GTDB](https://gtdb.ecogenomic.org/) and the mash_screen module, the workflow will assess how many bacterial genera are present in a sample (e.g. a contaminated or metagenomic sample may have more than one genus of bacteria present) with greater than 90% identity (according to Mash). When more than 1 taxa are present, the metagenomic tag is set, turning on metagenomic assembly in later steps. Additionally Kraken2 will be run on metagenomic assemblis later on and contigs will be binned at a defined taxonomic level (default is genus level).
+7. **Nanopore ID screening** duplicate Nanopore read ID's have been known to cause issues in the pipeline downstream. In order to bypass this issue, an option can be toggled where a script will read in Nanopore reads and append a unique ID to the header, this process can be slow so it can be easily skipped by enabling the `--skip_ont_header_cleaning true` option from the command line.
 
 ## Input
 - reads and metadata
 
 ## Outputs
 - quality trimmed and deconned reads
 - estimated genome size
-- estimated heterozygozity
 - software versions
diff --git a/docs/subworkflows/determine_species.md b/docs/subworkflows/determine_species.md
@@ -3,7 +3,7 @@
 ## subworkflows/local/determine_species
 
 ## Steps
-1. **Taxonomic classification** is completed using [Mash](https://github.com/marbl/Mash) (DEFAULT), (mash_screen.nf), or [Kraken2](https://github.com/DerrickWood/kraken2) (OPTIONAL, or when samples are flagged metagenomic), (kraken.nf).
+1. **Taxonomic classification** is completed using [Mash](https://github.com/marbl/Mash) (DEFAULT), (mash_screen.nf), or [Kraken2](https://github.com/DerrickWood/kraken2) (OPTIONAL, or when samples are flagged metagenomic), (kraken.nf). Species classification and subsequent subtyping can be skipped by passing `--skip_species_classification true` on the command line. To select Kraken2 for speciation rather than mash you can add `--run_kraken true` to your command line arguments.
 
 >NOTE:
 >If species specific subtyping tools are to be executed by the pipeline, **Mash must be the chosen classifier**

diff --git a/docs/subworkflows/input_check.md b/docs/subworkflows/input_check.md
@@ -4,8 +4,8 @@
 
 
 ## Steps
-1. Intake Sample sheet CSV and group samples with same ID. Sample metadata specific to the pipeline is added. A metadata field will additionally be created for samples containing the read data and sample information such as the samples name, and if the sample contains paired reads (Illumina) or long reads (Nanopore or Pacbio). Verification of workflows with Pacbio reads still needs to be performed as of 2023-07-19.
-2. If there are samples that contain duplicate ID's the samples will be combined.
+1. Intake Sample sheet CSV and group samples with same ID. Sample metadata specific to the pipeline is added. A metadata field will additionally be created for samples containing the read data and sample information such as the samples name, and if the sample contains paired reads (Illumina) or long reads (Nanopore or Pacbio).
+2. If there are samples that contain duplicate ID's the **samples will be combined**.
 
 
 ## Input

diff --git a/docs/subworkflows/qc_assembly.md b/docs/subworkflows/qc_assembly.md
@@ -3,8 +3,11 @@
 ## subworkflows/local/qc_assembly
 
 ## Steps
-1. **Generate assembly quality metrics** using **QUAST**. QUAST is used to generate summary assembly metrics such as: N50 value, number of contigs and genome size.
+1. **Generate assembly quality metrics** using **QUAST**. QUAST is used to generate summary assembly metrics such as: N50 value, number of contigs,average depth of coverage and genome size.
 2. **Assembly filtering** a script implemented using the nextflow DSL (Groovy) then filters assemblies that meet quality thresholds, so that only assemblies meeting some given set of criteria are used in down stream processing.
+3. **Contamination detection** using CheckM, CheckM is run to identify a percent contamination score and build up evidence for signs of contamination in a sample. CheckM can be skipped by adding `--skip_checkm` to you command-line options as the data it generates may not be needed, and it can have a long run time.
+4. **Classic seven gene MLST** using **mlst**. (mlst)[https://github.com/tseemann/mlst] is run and its outputs are contained within the final report. This step can be skipped by adding `--skip_mlst` to the commmand line options.
+
 
 ## Input
 - cleaned reads and metadata

diff --git a/docs/subworkflows/subtype_genome.md b/docs/subworkflows/subtype_genome.md
@@ -3,8 +3,7 @@
 ## subworkflows/local/subtype_genome
 
 ## Steps
-1. **Parsing of Mash report** is done to determine the species present in the sample.
-2. **Species specific subtyping** tools are launched requiring the pipelines outputted **Mash** screen report. Currently subtyping tools for *E.coli*, *Salmonella*, *Listeria spp.* and *Shigella spp.* are supported.
+1. **Species specific subtyping** tools are launched requiring the pipelines outputted **Mash** screen report. Currently subtyping tools for *E.coli*, *Salmonella*, *Listeria spp.*, *Staphylococcus spp.*, *Klebsiella spp.* and *Shigella spp.* are supported. Subtyping can be disabled from the command line by passing `--skip_subtyping true` on the command line.
 
 ## Note of importance
 If a sample cannot be subtyped, it merely passes through the pipeline and is not typed. A log message will instead be displayed notifying the user the sample cannot be typed however.

diff --git a/docs/usage/installation.md b/docs/usage/installation.md
@@ -14,7 +14,17 @@ Containers are not perfect, below is a list of some issues you may face using co
 - Exit code 137, likely means your docker container used to much memory.
 
 ## Dependencies
-Besides the Nextflow run time (requires Java), and container engine the dependencies required by mikrokondo are fairly minimal requiring only Python 3.10 (more recent Python versions will work as well) to run. Currently mikrokondo has been tested with fully with Singularity (partially with Apptainer, containers all work not all workflow paths tested) and partially tested with Docker (not all workflow paths tested). **Dependencies can be installed with Conda (e.g. Nextflow and Python)**
+Besides the Nextflow run time (requires Java), and container engine the dependencies required by mikrokondo are fairly minimal requiring only Python 3.10 (more recent Python versions will work as well) to run. Currently mikrokondo has been tested with fully with Singularity (partially with Apptainer, containers all work not all workflow paths tested) and partially tested with Docker (not all workflow paths tested). **Dependencies can be installed with Conda (e.g. Nextflow and Python)**. To download the pipeline run:
+
+`git clone https://github.com/phac-nml/mikrokondo.git`
+
+### Dependencies listed
+
+- Python (3.10>=)
+- Nextflow (22.10.1>=)
+- Container service (Docker, Singularity, Apptainer have been tested)
+- The source code: `git clone https://github.com/phac-nml/mikrokondo.git`
+
 
 
 ## Resources to download

diff --git a/docs/workflows/CleanAssemble.md b/docs/workflows/CleanAssemble.md
@@ -1,16 +1,24 @@
 # Clean Assemble
 ## workflows/local/CleanAssemble
 
+## Included sub-workflows
+
+- `input_check.nf`
+- `clean_reads.nf`
+- `assemble_reads.nf`
+- `hybrid_assembly.nf`
+- `polish_assemblies.nf`
+
 ## Steps <!-- I need to add in the links to the workflow pages once they exist -->
 1. **QC reads** subworkflow steps in brief are listed below, for further information see (clean_reads.nf)
-	- Reads are checked for known sequencing contamination  
-	- Quality metrics are calculated  
-	- Reads are trimmed  
-	- Coverage is estimated  
-	- Sample is subsampled to set level (OPTIONAL)  
-	- Read set is assessed to be either an isolate or metagenomic sample (from presence of multiple taxa)  
+	- Reads are checked for known sequencing contamination
+	- Quality metrics are calculated
+	- Reads are trimmed
+	- Coverage is estimated
+	- Sample is subsampled to set level (OPTIONAL)
+	- Read set is assessed to be either an isolate or metagenomic sample (from presence of multiple taxa)
 
-2. **Assemble reads** using the '<SOMETHING>' flag, read sets will be diverted to either the assemble_reads (short reads) or hybrid_assembly (short and/or long reads) workflow. Though the data is handled differently in eash subworklow, both generate a contigs file and a bandage image and have an option of initial polishing via Racon. See (assemble_reads.nf) and (hybrid_assembly.nf) subworkflow pages for more details. <!-- ADD IN LINKS TO PAGES -->  
+2. **Assemble reads** using the '<SOMETHING>' flag, read sets will be diverted to either the assemble_reads (short reads) or hybrid_assembly (short and/or long reads) workflow. Though the data is handled differently in eash subworklow, both generate a contigs file and a bandage image and have an option of initial polishing via Racon. See (assemble_reads.nf) and (hybrid_assembly.nf) subworkflow pages for more details. <!-- ADD IN LINKS TO PAGES -->
 
 3. **Polish assembles** (OPTIONAL) Polishing of contigs can be added (polish_assemblies.nf). To make changes to the default workflow, see setting 'optional flags' page <!-- ADD IN LINK TO PAGE -->
 
@@ -19,12 +27,12 @@
 	+ Short read - Illumina
 	+ Long read:
 		* Nanopore
-		* :warning:Pacbio (untested)
+		* Pacbio
 
 ## Output
 - quality trimmed and deconned reads (fastq)
 - estimated genome size
 - estimated heterozygozity
 - assembled contigs (fasta)
 - bandage image (png)
-- software versions
+- software versions