Skip to content

Commit

Permalink
Merge pull request nf-core#157 from ekushele/master
Browse files Browse the repository at this point in the history
Output demultiplexed fast5 files
  • Loading branch information
yuukiiwa authored Jan 24, 2022
2 parents 72e4588 + 105669a commit 9bcf02f
Show file tree
Hide file tree
Showing 9 changed files with 204 additions and 116 deletions.
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,13 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [3.0.1] - ?

### Major enhancements

* Add `demux_fast5` module to output demultiplexed fast5 files when `--output_demultiplex_fast5` is set
* Add `--trim_barcodes` in Guppy basecaller to trim the barcodes fromm output fastq

## [3.0.0] - ?

### Major enhancements
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ On release, automated continuous integration tests run the pipeline on a [full-s

## Pipeline Summary

1. Basecalling and/or demultiplexing ([`Guppy`](https://nanoporetech.com/nanopore-sequencing-data-analysis) or [`qcat`](https://github.com/nanoporetech/qcat); *optional*)
1. Basecalling and/or demultiplexing ([`Guppy`](https://nanoporetech.com/nanopore-sequencing-data-analysis), [`demux_fast5`](https://github.com/nanoporetech/ont_fast5_api#demux_fast5) or [`qcat`](https://github.com/nanoporetech/qcat); *optional*)
2. Sequencing QC ([`pycoQC`](https://github.com/a-slide/pycoQC), [`NanoPlot`](https://github.com/wdecoster/NanoPlot))
3. Raw read DNA cleaning ([NanoLyse](https://github.com/wdecoster/nanolyse); *optional*)
4. Raw read QC ([`NanoPlot`](https://github.com/wdecoster/NanoPlot), [`FastQC`](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
Expand Down Expand Up @@ -96,7 +96,7 @@ An example input samplesheet for performing both basecalling and demultiplexing

nf-core/nanoseq was originally written by [Chelsea Sawyer](https://github.com/csawye01) and [Harshil Patel](https://github.com/drpatelh) from [The Bioinformatics & Biostatistics Group](https://www.crick.ac.uk/research/science-technology-platforms/bioinformatics-and-biostatistics/) for use at [The Francis Crick Institute](https://www.crick.ac.uk/), London. Other primary contributors include [Laura Wratten](https://github.com/lwratten), [Ying Chen](https://github.com/cying111), [Yuk Kei Wan](https://github.com/yuukiiwa) and [Jonathan Goeke](https://github.com/jonathangoeke) from the [Genome Institute of Singapore](https://www.a-star.edu.sg/gis), [Christopher Hakkaart](https://github.com/christopher-hakkaart) from [Institute of Medical Genetics and Applied Genomics](https://www.medizin.uni-tuebingen.de/de/das-klinikum/einrichtungen/institute/medizinische-genetik-und-angewandte-genomik), Germany, [Johannes Alneberg](https://github.com/alneberg) and [Franziska Bonath](https://github.com/FranBonath) from [SciLifeLab](https://www.scilifelab.se/), Sweden.

Many thanks to others who have helped out along the way too, including (but not limited to): [@crickbabs](https://github.com/crickbabs), [@AnnaSyme](https://github.com/AnnaSyme).
Many thanks to others who have helped out along the way too, including (but not limited to): [@crickbabs](https://github.com/crickbabs), [@AnnaSyme](https://github.com/AnnaSyme),[@ekushele](https://github.com/ekushele).

## Contributions and Support

Expand Down
15 changes: 14 additions & 1 deletion conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,19 @@ if (!params.skip_basecalling) {
}
}

if (!params.skip_basecalling && params.output_demultiplex_fast5) {
process {
withName: DEMUX_FAST5 {
publishDir = [
path: { "${params.outdir}/demux_fast5" },
mode: 'copy',
enabled: true,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}
}
}

if (params.skip_basecalling && !params.skip_demultiplexing) {
process {
withName: QCAT {
Expand Down Expand Up @@ -605,4 +618,4 @@ if (!params.skip_multiqc) {
]
}
}
}
}
8 changes: 7 additions & 1 deletion docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,10 @@ The directories listed below will be created in the output directory after the p
Sequencing telemetry file generated by *Guppy*.
* `guppy/basecalling/guppy_basecaller_log-<date>.log`
Log file for *Guppy* execution.
* `demux_fast5/demultiplexed_fast5/<barcode*>/`
fast5 output files for each barcode.
* `demux_fast5/demultiplexed_fast5/unclassified/`
fast5 files with reads were unassigned to any given barcode.
* `qcat/fastq/<barcode*>.fastq.gz`
fastq output files for each barcode.
* `qcat/fastq/none.fastq.gz`
Expand All @@ -39,13 +43,15 @@ The directories listed below will be created in the output directory after the p
</details>

*Documentation*:
[Guppy](https://nanoporetech.com/nanopore-sequencing-data-analysis), [qcat](https://github.com/nanoporetech/qcat)
[Guppy](https://nanoporetech.com/nanopore-sequencing-data-analysis), [demux_fasat5](https://github.com/nanoporetech/ont_fast5_api#demux_fast5), [qcat](https://github.com/nanoporetech/qcat)

*Description*:
The pipeline has been written to deal with the various scenarios where you would like to include/exclude the basecalling and demultiplexing steps. This will be dependent on what type of input data you would like to provide the pipeline. Additionally, if you would like to align your samples to a reference genome there are various options for providing this information. Please see [`usage.md`](usage.md#--input) for more details about the format of the input samplesheet, associated commands and how to provide reference genome data.

*Guppy* will be used to basecall and demultiplex the data. Various options have been provided to customise specific parameters and to be able to run *Guppy* on GPUs.

*demux_fasat5* will demultiplex the fast5 files, gives the *Guppy* summary file.

If you have a pre-basecalled fastq file then *qcat* will be used to perform the demultiplexing if you provide the `--skip_basecalling` parameter. If you would like to skip both of these steps entirely then you can provide `--skip_basecalling --skip_demultiplexing` when running the pipeline. As a result, the structure of the output folder will depend on which steps you have chosen to run in the pipeline.

## Removal of DNA contaminants
Expand Down
35 changes: 35 additions & 0 deletions modules/local/demux_fast5.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
process DEMUX_FAST5 {
label 'process_medium'
publishDir "${params.outdir}",
mode: params.publish_dir_mode,
saveAs: { filename -> saveFiles(filename:filename, options:params.options, publish_dir:getSoftwareName(task.process)) }


conda (params.enable_conda ? "bioconda:ont-fast5-api:4.0.0--pyhdfd78af_0" : null)
if (workflow.containerEngine == 'singularity' && !params.singularity_pull_docker_container) {
container "https://depot.galaxyproject.org/singularity/ont-fast5-api:4.0.0--pyhdfd78af_0"
} else {
container "quay.io/biocontainers/ont-fast5-api:4.0.0--pyhdfd78af_0"
}

input:
path(input_path), stageAs: 'input_path/*'
tuple val(meta), path(input_summary)

output:
path "demultiplexed_fast5/*" , emit: fast5
path "versions.yml" , emit: versions

script:
"""
demux_fast5 \\
--input input_path \\
--save_path ./demultiplexed_fast5 \\
--summary_file $input_summary
cat <<-END_VERSIONS > versions.yml
"${task.process}":
demux_fast5: \$(echo \$(python -c\'import ont_fast5_api;print(ont_fast5_api.__version__)\'))
END_VERSIONS
"""
}
123 changes: 63 additions & 60 deletions modules/local/guppy.nf
Original file line number Diff line number Diff line change
@@ -1,62 +1,65 @@
process GUPPY {
label 'process_medium'

if (params.guppy_gpu) {
container = 'genomicpariscentre/guppy-gpu:4.0.14'
clusterOptions = params.gpu_cluster_options
} else {
container = 'genomicpariscentre/guppy:4.0.14'
}

input:
path(input_path), stageAs: 'input_path/*'
val meta
path guppy_config
path guppy_model

output:
path "fastq/*.fastq.gz" , emit: fastq
tuple val(meta), path("basecalling/*.txt") , emit: summary
path "basecalling/*" , emit: called
path "versions.yml" , emit: versions

script:
def barcode_kit = params.barcode_kit ? "--barcode_kits $params.barcode_kit" : ""
def barcode_ends = params.barcode_both_ends ? "--require_barcodes_both_ends" : ""
def proc_options = params.guppy_gpu ? "--device $params.gpu_device --num_callers $task.cpus --cpu_threads_per_caller $params.guppy_cpu_threads --gpu_runners_per_device $params.guppy_gpu_runners" : "--num_callers 2 --cpu_threads_per_caller ${task.cpus/2}"
def config = "--flowcell $params.flowcell --kit $params.kit"
if (params.guppy_config) config = file(params.guppy_config).exists() ? "--config ./$guppy_config" : "--config $params.guppy_config"
def model = ""
if (params.guppy_model) model = file(params.guppy_model).exists() ? "--model ./$guppy_model" : "--model $params.guppy_model"
"""
guppy_basecaller \\
--input_path input_path \\
--save_path ./basecalling \\
--records_per_fastq 0 \\
--compress_fastq \\
$barcode_kit \\
$proc_options \\
$barcode_ends \\
$config \\
$model
cat <<-END_VERSIONS > versions.yml
"${task.process}":
guppy: \$(echo \$(guppy_basecaller --version 2>&1) | sed -r 's/.{81}//')
END_VERSIONS
## Concatenate fastq files
mkdir fastq
cd basecalling
if [ "\$(find . -type d -name "barcode*" )" != "" ]
then
for dir in barcode*/
do
dir=\${dir%*/}
cat \$dir/*.fastq.gz > ../fastq/\$dir.fastq.gz
done
else
cat *.fastq.gz > ../fastq/${meta.id}.fastq.gz
fi
"""
label 'process_medium'

if (params.guppy_gpu) {
container = 'genomicpariscentre/guppy-gpu:5.0.16'
clusterOptions = params.gpu_cluster_options
} else {
container = 'genomicpariscentre/guppy:5.0.16'
}


input:
path(input_path), stageAs: 'input_path/*'
val meta
path guppy_config
path guppy_model

output:
path "fastq/*.fastq.gz" , emit: fastq
tuple val(meta), path("basecalling/*.txt") , emit: summary
path "basecalling/*" , emit: called
path "versions.yml" , emit: versions


script:
def trim_barcodes = params.trim_barcodes ? "--trim_barcodes" : ""
def barcode_kit = params.barcode_kit ? "--barcode_kits $params.barcode_kit" : ""
def barcode_ends = params.barcode_both_ends ? "--require_barcodes_both_ends" : ""
def proc_options = params.guppy_gpu ? "--device $params.gpu_device --num_callers $task.cpus --cpu_threads_per_caller $params.guppy_cpu_threads --gpu_runners_per_device $params.guppy_gpu_runners" : "--num_callers 2 --cpu_threads_per_caller ${task.cpus/2}"
def config = "--flowcell $params.flowcell --kit $params.kit"
if (params.guppy_config) config = file(params.guppy_config).exists() ? "--config ./$guppy_config" : "--config $params.guppy_config"
def model = ""
if (params.guppy_model) model = file(params.guppy_model).exists() ? "--model ./$guppy_model" : "--model $params.guppy_model"
"""
guppy_basecaller \\
--input_path input_path \\
--save_path ./basecalling \\
--records_per_fastq 0 \\
--compress_fastq \\
$barcode_kit \\
$proc_options \\
$barcode_ends \\
$config \\
$model
cat <<-END_VERSIONS > versions.yml
"${task.process}":
guppy: \$(echo \$(guppy_basecaller --version 2>&1) | sed -r 's/.{81}//')
END_VERSIONS
## Concatenate fastq files
mkdir fastq
cd basecalling
if [ "\$(find . -type d -name "barcode*" )" != "" ]
then
for dir in pass/barcode*/
do
dir=\$(basename \${dir%*/})
cat pass/\$dir/*.fastq.gz > ../fastq/\$dir.fastq.gz
done
else
cat *.fastq.gz > ../fastq/${meta.id}.fastq.gz
fi
"""
}
2 changes: 2 additions & 0 deletions nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,15 @@ params {
kit = null
barcode_kit = null
barcode_both_ends = false
trim_barcodes = false
guppy_config = null
guppy_model = null
guppy_gpu = false
guppy_gpu_runners = 6
guppy_cpu_threads = 1
gpu_device = 'auto'
gpu_cluster_options = null
output_demultiplex_fast5 = false
qcat_min_score = 60
qcat_detect_middle = false
skip_basecalling = false
Expand Down
11 changes: 11 additions & 0 deletions nextflow_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,11 @@
"fa_icon": "fas fa-barcode",
"description": "Require barcode on both ends for Guppy basecaller."
},
"trim_barcodes": {
"type": "boolean",
"fa_icon": "fas fa-barcode",
"description": "Wether to trim the barcodes from the output sequences in the FastQ files from Guppy basecaller."
},
"guppy_config": {
"type": "string",
"help_text": "Cannot be used in conjunction with `--flowcell` and `--kit`. This can be a local file (e.g. `/your/dir/guppy_conf.cfg`) or a string specifying a configuration stored in the `/opt/ont/guppy/data/` directory of Guppy.",
Expand Down Expand Up @@ -127,6 +132,11 @@
"type": "string",
"description": "Cluster options required to use GPU resources (e.g. '--part=gpu --gres=gpu:1').",
"fa_icon": "fas fa-fish"
},
"output_demultiplex_fast5": {
"type": "boolean",
"description": "Output emultiplex fast5 files with demux_fast5.",
"fa_icon": "fas fa-file-code"
},
"qcat_min_score": {
"type": "integer",
Expand All @@ -149,6 +159,7 @@
"description": "Skip demultiplexing with Guppy/qcat.",
"fa_icon": "fas fa-fast-forward"
},

"run_nanolyse": {
"type": "boolean",
"description": "Filter reads from FastQ files using NanoLyse",
Expand Down
Loading

0 comments on commit 9bcf02f

Please sign in to comment.