Merge pull request nf-core#157 from ekushele/master

Output demultiplexed fast5 files
qbic-projects · Jan 24, 2022 · 9bcf02f · 9bcf02f
2 parents 72e4588 + 105669a
commit 9bcf02f
Show file tree

Hide file tree

Showing 9 changed files with 204 additions and 116 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,13 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [3.0.1] - ?
+
+### Major enhancements
+
+* Add `demux_fast5` module to output demultiplexed fast5 files when `--output_demultiplex_fast5` is set
+* Add `--trim_barcodes` in Guppy basecaller to trim the barcodes fromm output fastq
+
 ## [3.0.0] - ?
 
 ### Major enhancements

diff --git a/README.md b/README.md
@@ -32,7 +32,7 @@ On release, automated continuous integration tests run the pipeline on a [full-s
 
 ## Pipeline Summary
 
-1. Basecalling and/or demultiplexing ([`Guppy`](https://nanoporetech.com/nanopore-sequencing-data-analysis) or [`qcat`](https://github.com/nanoporetech/qcat); *optional*)
+1. Basecalling and/or demultiplexing ([`Guppy`](https://nanoporetech.com/nanopore-sequencing-data-analysis), [`demux_fast5`](https://github.com/nanoporetech/ont_fast5_api#demux_fast5) or [`qcat`](https://github.com/nanoporetech/qcat); *optional*)
 2. Sequencing QC ([`pycoQC`](https://github.com/a-slide/pycoQC), [`NanoPlot`](https://github.com/wdecoster/NanoPlot))
 3. Raw read DNA cleaning ([NanoLyse](https://github.com/wdecoster/nanolyse); *optional*)
 4. Raw read QC ([`NanoPlot`](https://github.com/wdecoster/NanoPlot), [`FastQC`](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
@@ -96,7 +96,7 @@ An example input samplesheet for performing both basecalling and demultiplexing
 
 nf-core/nanoseq was originally written by [Chelsea Sawyer](https://github.com/csawye01) and [Harshil Patel](https://github.com/drpatelh) from [The Bioinformatics & Biostatistics Group](https://www.crick.ac.uk/research/science-technology-platforms/bioinformatics-and-biostatistics/) for use at [The Francis Crick Institute](https://www.crick.ac.uk/), London. Other primary contributors include [Laura Wratten](https://github.com/lwratten), [Ying Chen](https://github.com/cying111), [Yuk Kei Wan](https://github.com/yuukiiwa) and [Jonathan Goeke](https://github.com/jonathangoeke) from the [Genome Institute of Singapore](https://www.a-star.edu.sg/gis), [Christopher Hakkaart](https://github.com/christopher-hakkaart) from [Institute of Medical Genetics and Applied Genomics](https://www.medizin.uni-tuebingen.de/de/das-klinikum/einrichtungen/institute/medizinische-genetik-und-angewandte-genomik), Germany, [Johannes Alneberg](https://github.com/alneberg) and [Franziska Bonath](https://github.com/FranBonath) from [SciLifeLab](https://www.scilifelab.se/), Sweden.
 
-Many thanks to others who have helped out along the way too, including (but not limited to): [@crickbabs](https://github.com/crickbabs), [@AnnaSyme](https://github.com/AnnaSyme).
+Many thanks to others who have helped out along the way too, including (but not limited to): [@crickbabs](https://github.com/crickbabs), [@AnnaSyme](https://github.com/AnnaSyme),[@ekushele](https://github.com/ekushele).
 
 ## Contributions and Support
 

diff --git a/conf/modules.config b/conf/modules.config
@@ -60,6 +60,19 @@ if (!params.skip_basecalling) {
     }
 }
 
+if (!params.skip_basecalling && params.output_demultiplex_fast5) {
+    process {
+        withName: DEMUX_FAST5 {
+            publishDir = [
+                path: { "${params.outdir}/demux_fast5" },
+                mode: 'copy',
+                enabled: true,
+                saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
+            ]
+        }
+    }
+}
+
 if (params.skip_basecalling && !params.skip_demultiplexing) {
     process {
         withName: QCAT {
@@ -605,4 +618,4 @@ if (!params.skip_multiqc) {
             ]
         }
     }
-}
+}
diff --git a/docs/output.md b/docs/output.md
@@ -31,6 +31,10 @@ The directories listed below will be created in the output directory after the p
     Sequencing telemetry file generated by *Guppy*.
 * `guppy/basecalling/guppy_basecaller_log-<date>.log`
     Log file for *Guppy* execution.
+* `demux_fast5/demultiplexed_fast5/<barcode*>/`
+    fast5 output files for each barcode.
+* `demux_fast5/demultiplexed_fast5/unclassified/`
+    fast5 files with reads were unassigned to any given barcode.
 * `qcat/fastq/<barcode*>.fastq.gz`
     fastq output files for each barcode.
 * `qcat/fastq/none.fastq.gz`
@@ -39,13 +43,15 @@ The directories listed below will be created in the output directory after the p
 </details>
 
 *Documentation*:
-[Guppy](https://nanoporetech.com/nanopore-sequencing-data-analysis), [qcat](https://github.com/nanoporetech/qcat)
+[Guppy](https://nanoporetech.com/nanopore-sequencing-data-analysis), [demux_fasat5](https://github.com/nanoporetech/ont_fast5_api#demux_fast5), [qcat](https://github.com/nanoporetech/qcat)
 
 *Description*:
 The pipeline has been written to deal with the various scenarios where you would like to include/exclude the basecalling and demultiplexing steps. This will be dependent on what type of input data you would like to provide the pipeline. Additionally, if you would like to align your samples to a reference genome there are various options for providing this information. Please see [`usage.md`](usage.md#--input) for more details about the format of the input samplesheet, associated commands and how to provide reference genome data.
 
 *Guppy* will be used to basecall and demultiplex the data. Various options have been provided to customise specific parameters and to be able to run *Guppy* on GPUs.
 
+*demux_fasat5* will demultiplex the fast5 files, gives the *Guppy* summary file.
+
 If you have a pre-basecalled fastq file then *qcat* will be used to perform the demultiplexing if you provide the `--skip_basecalling` parameter. If you would like to skip both of these steps entirely then you can provide `--skip_basecalling --skip_demultiplexing` when running the pipeline. As a result, the structure of the output folder will depend on which steps you have chosen to run in the pipeline.
 
 ## Removal of DNA contaminants

diff --git a/modules/local/demux_fast5.nf b/modules/local/demux_fast5.nf
@@ -0,0 +1,35 @@
+process DEMUX_FAST5 {
+	label 'process_medium'
+	publishDir "${params.outdir}",
+		mode: params.publish_dir_mode,
+		saveAs: { filename -> saveFiles(filename:filename, options:params.options, publish_dir:getSoftwareName(task.process)) }
+
+
+	conda     (params.enable_conda ? "bioconda:ont-fast5-api:4.0.0--pyhdfd78af_0" : null)
+	if (workflow.containerEngine == 'singularity' && !params.singularity_pull_docker_container) {
+		container "https://depot.galaxyproject.org/singularity/ont-fast5-api:4.0.0--pyhdfd78af_0" 
+	} else {
+		container "quay.io/biocontainers/ont-fast5-api:4.0.0--pyhdfd78af_0"
+	}
+
+	input:
+	path(input_path), stageAs: 'input_path/*'
+	tuple val(meta), path(input_summary)
+
+	output:
+	path "demultiplexed_fast5/*"   , emit: fast5
+	path "versions.yml"            , emit: versions
+
+	script:
+	"""
+	demux_fast5 \\
+	--input  input_path \\
+	--save_path ./demultiplexed_fast5 \\
+	--summary_file $input_summary
+	
+	cat <<-END_VERSIONS > versions.yml
+	"${task.process}":
+	    demux_fast5: \$(echo \$(python -c\'import ont_fast5_api;print(ont_fast5_api.__version__)\'))
+	END_VERSIONS
+	"""
+}
diff --git a/modules/local/guppy.nf b/modules/local/guppy.nf
@@ -1,62 +1,65 @@
 process GUPPY {
-    label 'process_medium'
-
-    if (params.guppy_gpu) {
-        container = 'genomicpariscentre/guppy-gpu:4.0.14'
-        clusterOptions = params.gpu_cluster_options
-    } else {
-        container = 'genomicpariscentre/guppy:4.0.14'
-    }
-
-    input:
-    path(input_path), stageAs: 'input_path/*'
-    val meta
-    path guppy_config
-    path guppy_model
-
-    output:
-    path "fastq/*.fastq.gz"                    , emit: fastq
-    tuple val(meta), path("basecalling/*.txt") , emit: summary
-    path "basecalling/*"                       , emit: called
-    path "versions.yml"                        , emit: versions
-
-    script:
-    def barcode_kit  = params.barcode_kit ? "--barcode_kits $params.barcode_kit" : ""
-    def barcode_ends = params.barcode_both_ends ? "--require_barcodes_both_ends" : ""
-    def proc_options = params.guppy_gpu ? "--device $params.gpu_device --num_callers $task.cpus --cpu_threads_per_caller $params.guppy_cpu_threads --gpu_runners_per_device $params.guppy_gpu_runners" : "--num_callers 2 --cpu_threads_per_caller ${task.cpus/2}"
-    def config   = "--flowcell $params.flowcell --kit $params.kit"
-    if (params.guppy_config) config = file(params.guppy_config).exists() ? "--config ./$guppy_config" : "--config $params.guppy_config"
-    def model    = ""
-    if (params.guppy_model)  model  = file(params.guppy_model).exists() ? "--model ./$guppy_model" : "--model $params.guppy_model"
-    """
-    guppy_basecaller \\
-        --input_path input_path \\
-        --save_path ./basecalling \\
-        --records_per_fastq 0 \\
-        --compress_fastq \\
-        $barcode_kit \\
-        $proc_options \\
-        $barcode_ends \\
-        $config \\
-        $model
-
-    cat <<-END_VERSIONS > versions.yml
-    "${task.process}":
-        guppy: \$(echo \$(guppy_basecaller --version 2>&1) | sed -r 's/.{81}//')
-    END_VERSIONS
-
-    ## Concatenate fastq files
-    mkdir fastq
-    cd basecalling
-    if [ "\$(find . -type d -name "barcode*" )" != "" ]
-    then
-        for dir in barcode*/
-        do
-            dir=\${dir%*/}
-            cat \$dir/*.fastq.gz > ../fastq/\$dir.fastq.gz
-        done
-    else
-        cat *.fastq.gz > ../fastq/${meta.id}.fastq.gz
-    fi
-    """
+	label 'process_medium'
+
+	if (params.guppy_gpu) {
+		container = 'genomicpariscentre/guppy-gpu:5.0.16'
+		clusterOptions = params.gpu_cluster_options
+	} else {
+		container = 'genomicpariscentre/guppy:5.0.16'
+	}
+
+
+	input:
+	path(input_path), stageAs: 'input_path/*'
+	val meta
+	path guppy_config
+	path guppy_model
+
+	output:
+	path "fastq/*.fastq.gz"                    , emit: fastq
+	tuple val(meta), path("basecalling/*.txt") , emit: summary
+	path "basecalling/*"                       , emit: called
+	path "versions.yml"                        , emit: versions
+
+
+	script:
+	def trim_barcodes = params.trim_barcodes ? "--trim_barcodes" : ""
+	def barcode_kit  = params.barcode_kit ? "--barcode_kits $params.barcode_kit" : ""
+	def barcode_ends = params.barcode_both_ends ? "--require_barcodes_both_ends" : ""
+	def proc_options = params.guppy_gpu ? "--device $params.gpu_device --num_callers $task.cpus --cpu_threads_per_caller $params.guppy_cpu_threads --gpu_runners_per_device $params.guppy_gpu_runners" : "--num_callers 2 --cpu_threads_per_caller ${task.cpus/2}"
+	def config   = "--flowcell $params.flowcell --kit $params.kit"
+	if (params.guppy_config) config = file(params.guppy_config).exists() ? "--config ./$guppy_config" : "--config $params.guppy_config"
+	def model    = ""
+	if (params.guppy_model)  model  = file(params.guppy_model).exists() ? "--model ./$guppy_model" : "--model $params.guppy_model"
+	"""
+	guppy_basecaller \\
+		--input_path input_path \\
+		--save_path ./basecalling \\
+		--records_per_fastq 0 \\
+		--compress_fastq \\
+		$barcode_kit \\
+		$proc_options \\
+		$barcode_ends \\
+		$config \\
+		$model
+
+	cat <<-END_VERSIONS > versions.yml
+	"${task.process}":
+	    guppy: \$(echo \$(guppy_basecaller --version 2>&1) | sed -r 's/.{81}//')
+	END_VERSIONS
+
+	## Concatenate fastq files
+	mkdir fastq
+	cd basecalling
+	if [ "\$(find . -type d -name "barcode*" )" != "" ]
+	then
+		for dir in pass/barcode*/
+		do
+			dir=\$(basename \${dir%*/})
+			cat pass/\$dir/*.fastq.gz > ../fastq/\$dir.fastq.gz
+		done
+	else
+		cat *.fastq.gz > ../fastq/${meta.id}.fastq.gz
+	fi
+	"""
 }
diff --git a/nextflow.config b/nextflow.config
@@ -19,13 +19,15 @@ params {
     kit                        = null
     barcode_kit                = null
     barcode_both_ends          = false
+    trim_barcodes              = false
     guppy_config               = null
     guppy_model                = null
     guppy_gpu                  = false
     guppy_gpu_runners          = 6
     guppy_cpu_threads          = 1
     gpu_device                 = 'auto'
     gpu_cluster_options        = null
+    output_demultiplex_fast5   = false
     qcat_min_score             = 60
     qcat_detect_middle         = false
     skip_basecalling           = false

diff --git a/nextflow_schema.json b/nextflow_schema.json
@@ -88,6 +88,11 @@
                     "fa_icon": "fas fa-barcode",
                     "description": "Require barcode on both ends for Guppy basecaller."
                 },
+                "trim_barcodes": {
+                    "type": "boolean",
+                    "fa_icon": "fas fa-barcode",
+                    "description": "Wether to trim the barcodes from the output sequences in the FastQ files from Guppy basecaller."
+                },
                 "guppy_config": {
                     "type": "string",
                     "help_text": "Cannot be used in conjunction with `--flowcell` and `--kit`. This can be a local file (e.g. `/your/dir/guppy_conf.cfg`) or a string specifying a configuration stored in the `/opt/ont/guppy/data/` directory of Guppy.",
@@ -127,6 +132,11 @@
                     "type": "string",
                     "description": "Cluster options required to use GPU resources (e.g. '--part=gpu --gres=gpu:1').",
                     "fa_icon": "fas fa-fish"
+                },
+				"output_demultiplex_fast5": {
+                    "type": "boolean",
+                    "description": "Output emultiplex fast5 files with demux_fast5.",
+                    "fa_icon": "fas fa-file-code"
                 },
                 "qcat_min_score": {
                     "type": "integer",
@@ -149,6 +159,7 @@
                     "description": "Skip demultiplexing with Guppy/qcat.",
                     "fa_icon": "fas fa-fast-forward"
                 },
+
                 "run_nanolyse": {
                     "type": "boolean",
                     "description": "Filter reads from FastQ files using NanoLyse",