nf-core · daisymut · Nov 17, 2023 · Oct 29, 2023 · Oct 29, 2023 · Oct 31, 2023
diff --git a/.nf-core.yml b/.nf-core.yml
@@ -1 +1,4 @@
 repository_type: pipeline
+lint:
+  multiqc_config:
+    report_comment: False
diff --git a/CITATIONS.md b/CITATIONS.md
@@ -12,7 +12,7 @@
 
 - [BWA](https://www.ncbi.nlm.nih.gov/pubmed/19451168/)
 
-> Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009 Jul 15;25(14):1754-60. doi: 10.1093/bioinformatics/btp324. Epub 2009 May 18. PubMed PMID: 19451168; PubMed Central PMCID: PMC2705234.
+  > Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009 Jul 15;25(14):1754-60. doi: 10.1093/bioinformatics/btp324. Epub 2009 May 18. PubMed PMID: 19451168; PubMed Central PMCID: PMC2705234.
 
 - [deepTools](https://www.ncbi.nlm.nih.gov/pubmed/27079975/)
 
@@ -32,7 +32,9 @@
 
   > Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8. PubMed PMID: 19505943; PubMed Central PMCID: PMC2723002.
 
-- [Trim Galore!](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/)
+- [Trimmomatic](https://pubmed.ncbi.nlm.nih.gov/24695404/)
+
+  > Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014 Aug 1;30(15):2114-20. doi: 10.1093/bioinformatics/btu170. Epub 2014 Apr 1. PMID: 24695404; PMCID: PMC4103590.
 
 ## R packages
 

diff --git a/README.md b/README.md
@@ -46,9 +46,6 @@ to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/i
 with `-profile test` before running the workflow on actual data.
 :::
 
-<!-- TODO nf-core: Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
-     Explain what rows and columns represent. For instance (please edit as appropriate -->
-
 First, prepare a samplesheet with your input data that looks as follows:
 
 `samplesheet.csv`:
@@ -64,11 +61,10 @@ Each row represents a fastq file (single-end) or a pair of fastq files (paired e
 
 Now, you can run the pipeline using:
 
-<!-- TODO nf-core: update the following command to include all required parameters for a minimal example -->
-
 ```bash
 nextflow run nf-core/sammyseq \
    -profile <docker/singularity/.../institute> \
+   --fasta reference_genome.fa \
    --input samplesheet.csv \
    --outdir <OUTDIR>
 ```
@@ -78,19 +74,12 @@ or
 ```bash
 nextflow run nf-core/sammyseq \
    -profile <docker/singularity/.../institute> \
+   --fasta reference_genome.fa \
    --input samplesheet.csv \
    --outdir <OUTDIR> \
    --comparisonFile comparisons.csv
 ```
 
-`comparisons.csv`:
-
-```csv
-sample1,sample2
-CTRL004_S2,CTRL004_S3
-CTRL004_S2,CTRL004_S4
-```
-
 :::warning
 Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those
 provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_;

diff --git a/assets/multiqc_config.yml b/assets/multiqc_config.yml
@@ -1,8 +1,8 @@
 report_comment: >
 
-  This report has been generated by the <a href="https://github.com/nf-core/sammyseq/releases/tag/0.01" target="_blank">nf-core/sammyseq</a>
+  This report has been generated by the <a href="https://github.com/nf-core/sammyseq/tree/dev" target="_blank">nf-core/sammyseq</a>
   analysis pipeline. For information about how to interpret these results, please see the
-  <a href="https://nf-co.re/sammyseq/0.01/docs/output" target="_blank">documentation</a>.
+  <a href="https://nf-co.re/sammyseq/dev/docs/output" target="_blank">documentation</a>.
 
 report_section_order:
   "nf-core-sammyseq-methods-description":

diff --git a/assets/samplesheet.csv b/assets/samplesheet.csv
@@ -1,3 +1,3 @@
-sample,fastq_1,fastq_2
-SAMPLE_PAIRED_END,/path/to/fastq/files/AEG588A1_S1_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A1_S1_L002_R2_001.fastq.gz
-SAMPLE_SINGLE_END,/path/to/fastq/files/AEG588A4_S4_L003_R1_001.fastq.gz,
+sample,fastq_1,fastq_2,experimentalID,fraction
+SAMPLE_PAIRED_END,/path/to/fastq/files/AEG588A1_S1_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A1_S1_L002_R2_001.fastq.gz,SAMPLE_PAIRED_END_EXPID,S2
+SAMPLE_SINGLE_END,/path/to/fastq/files/AEG588A4_S4_L003_R1_001.fastq.gz,,SAMPLE_SINGLE_END_EXPID,S2
diff --git a/conf/base.config b/conf/base.config
@@ -10,7 +10,7 @@
 
 process {
 
-    // TODO nf-core: Check the defaults for all processes
+    // nf-core: Check the defaults for all processes
     cpus   = { check_max( 1    * task.attempt, 'cpus'   ) }
     memory = { check_max( 6.GB * task.attempt, 'memory' ) }
     time   = { check_max( 4.h  * task.attempt, 'time'   ) }
@@ -24,7 +24,7 @@ process {
     //        These labels are used and recognised by default in DSL2 files hosted on nf-core/modules.
     //        If possible, it would be nice to keep the same label naming convention when
     //        adding in your local modules too.
-    // TODO nf-core: Customise requirements for specific processes.
+    // nf-core: Customise requirements for specific processes.
     // See https://www.nextflow.io/docs/latest/config.html#config-process-selectors
     withLabel:process_single {
         cpus   = { check_max( 1                  , 'cpus'    ) }

diff --git a/conf/modules.config b/conf/modules.config
@@ -26,6 +26,14 @@ process {
         ]
     }
 
+    withName : ".*PREPARE_GENOME:.*" {
+        publishDir = [
+            path: { "${params.outdir}/genome" },
+            mode: params.publish_dir_mode,
+            enabled: params.save_reference
+        ]
+    }
+
     withName: FASTQC {
         ext.args = '--quiet'
     }
@@ -58,20 +66,20 @@ process {
     // Alignment, Picard MarkDuplicates and Filtering options
     //
 
-    withName: '.*FASTQ_ALIGN_BWAALN:BWA_.*' {
+    withName: '.*FASTQ_ALIGN_BWAALN:.*' {
         publishDir = [
             [
-                path: { "${params.outdir}/BWA" },
+                path: { "${params.outdir}/alignment/bwa" },
                 mode: params.publish_dir_mode,
-                pattern: '*.bam*',
+                pattern: '*.{bam,bai}',
                 saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
             ]
         ]
     }
 
 
-    withName: '.*BAM_MARKDUPLICATES_PICARD:PICARD_MARKDUPLICATES' {
-        ext.args   = '--ASSUME_SORTED true --REMOVE_DUPLICATES true --VALIDATION_STRINGENCY LENIENT --TMP_DIR tmp'
+    withName: '.*BAM_MARKDUPLICATES_PICARD:PICARD_MARKDUPLICATES.*' {
+        ext.args   = '--ASSUME_SORTED true --REMOVE_DUPLICATES false --VALIDATION_STRINGENCY LENIENT --TMP_DIR tmp'
         ext.prefix = { "${meta.id}.md" }
         publishDir = [
             [
@@ -81,7 +89,7 @@ process {
                 saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
             ],
             [
-                path: { "${params.outdir}/markduplicates/bam" },
+                path: { "${params.outdir}/alignment/markduplicates" },
                 mode: params.publish_dir_mode,
                 pattern: '*.md.{bam,bai}',
                 saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
@@ -90,10 +98,10 @@ process {
         ]
     }
 
-    withName: '.*BAM_MARKDUPLICATES_PICARD:SAMTOOLS_INDEX' {
+    withName: '.*BAM_MARKDUPLICATES_PICARD:SAMTOOLS_INDEX.*' {
         ext.prefix  = { "${meta.id}.markdup.sorted" }
         publishDir  = [
-            path: { "${params.outdir}/markduplicates/bam" },
+            path: { "${params.outdir}/alignment/markduplicates" },
             mode: params.publish_dir_mode,
             saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
             pattern: '*.{bai,csi}'
@@ -110,4 +118,39 @@ process {
         ]
     }
 
+    withName : ".*SAMTOOLS_FAIDX.*" {
+        publishDir = [
+            path: { "${params.outdir}/genome" },
+            mode: params.publish_dir_mode,
+        ]
+    }
+
+    withName : ".*DEEPTOOLS_BAMCOVERAGE.*" {
+
+        publishDir = [
+            path: { "${params.outdir}/single_tracks/deeptools" },
+            mode: params.publish_dir_mode,
+            pattern: '*.bigWig'
+        ]
+    }
+
+    withName : ".*RTWOSAMPLESMLE.*" {
+        publishDir = [
+            path: { "${params.outdir}/comparisons/spp_mle" },
+            mode: params.publish_dir_mode,
+        ]
+    }
+
+
+}
+
+if (params.blackListFile != null) {
+        process {
+            withName: '.*DEEPTOOLS_BAMCOVERAGE.*' {
+                ext.args    = "–blackListFileName ${params.blackListFile}"
+            }
+            }
 }
+
+
+
diff --git a/conf/test.config b/conf/test.config
@@ -23,5 +23,5 @@ params {
     input  = 'https://genome.isasi.cnr.it/biocomp/test-datasets/sammyseq/testdata/chr22/samplesheet_test_github_chr22_tinier.csv'
 
     // Genome references
-    fasta  = 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/genome.fasta'
+    fasta  = 'https://genome.isasi.cnr.it/biocomp/test-datasets/sammyseq/testdata/chr22/chr22.fa'
 }
diff --git a/docs/output.md b/docs/output.md
@@ -6,41 +6,100 @@ This document describes the output produced by the pipeline. Most of the plots a
 
 The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
 
-<!-- TODO nf-core: Write this documentation describing your workflow's output -->
-
 ## Pipeline overview
 
 The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:
 
-- [FastQC](#fastqc) - Raw read QC
-- [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline
-- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution
+- [FastQC](#fastqc)
+- [Trim reads](#trim-reads)
+- [Alignment on Reference](#alignment-on-reference)
+- [Mark Duplicate reads](#mark-duplicate-reads)
+- [Signal track generation](#signal-track-generation)
+- [Comparisons](#comparisons)
+- [MultiQC](#multiqc)
+- [Pipeline information](#pipeline-information)
+
+### Read quality check
+
+#### FastQC
+
+[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about the sequenced reads. It provides information about the quality score distribution across reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/).
 
-### FastQC
+#### Trim reads
+
+[`Trimmomatic`](http://www.usadellab.org/cms/?page=trimmomatic) is a software used to trim adapter sequences and low quality bases from the end of reads and quality check after this step is performed again with Fastqc.
 
 <details markdown="1">
 <summary>Output files</summary>
 
 - `fastqc/`
   - `*_fastqc.html`: FastQC report containing quality metrics.
   - `*_fastqc.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images.
+  - `*_trim_fastqc.html`: FastQC report containing quality metrics for trimmed reads.
+  - `*_trim_fastqc.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images for trimmed reads.
 
 </details>
 
-[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/).
+:::note
+The FastQC plots displayed in the MultiQC report shows both _untrimmed_ and _trimmed_ reads so they can be directly compared.
+:::
+
+### Alignment on Reference
 
-![MultiQC - FastQC sequence counts plot](images/mqc_fastqc_counts.png)
+The alignment is performed using [BWA](https://github.com/lh3/bwa) and the aligned reads are then sorted by chromosome coordinates with [samtools](https://www.htslib.org/doc/samtools.html).
 
-![MultiQC - FastQC mean quality scores plot](images/mqc_fastqc_quality.png)
+<details markdown="1">
+<summary>Output files</summary>
 
-![MultiQC - FastQC adapter content plot](images/mqc_fastqc_adapter.png)
+- `alignment/bwa/`
+  - `<sample>.bam` and `<sample>.bam.bai`
 
-:::note
-The FastQC plots displayed in the MultiQC report shows _untrimmed_ reads. They may contain adapter sequence and potentially regions with low quality.
-:::
+</details>
+
+### Mark Duplicate reads
+
+Read pairs that are likely to have originated from duplicates of the same original DNA fragments through some artificial processes are identified. These are considered to be non-independent observations, so all but a single read pair within each set of duplicates are marked, not removed from the bam file.
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `alignment/markduplicates/`
+  - `<sample>.md.bam` and `<sample>.md.bam.bai`
+- `reports/markduplicates/`
+  - `<sample>.md.MarkDuplicates.metrics.txt`
+
+</details>
+
+### Signal track generation
+
+[deepTools](https://deeptools.readthedocs.io/en/develop/content/list_of_tools.html) is used to generate single fraction signals in [bigWig](https://genome.ucsc.edu/goldenpath/help/bigWig.html) format, an indexed binary format useful for displaying dense, continuous data in Genome Browsers such as the [UCSC](https://genome.ucsc.edu/cgi-bin/hgTracks) and [IGV](http://software.broadinstitute.org/software/igv/). The bigWig format is also supported by various bioinformatics software for downstream processing such as meta-profile plotting.
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `single_tracks/deeptools/`
+  - `<sample>.bigWig`
+
+</details>
+
+### Comparisons
+
+When `--comparisonFile` is set, the difference between sample1 and sample2 read density profile smoothed by the Gaussian kernel is calculated and saved in bigwig format, as described in Kharchenko PK, Tolstorukov MY, Park PJ "Design and analysis of ChIP-seq experiments for DNA-binding proteins" Nat. Biotech. doi:10.1038/nbt.1508
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `comparisons/spp_mle/`
+  - `<sample1>.md_VS_<sample2>.md.bw`
+
+</details>
 
 ### MultiQC
 
+[MultiQC](http://multiqc.info) is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
+
+Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see <http://multiqc.info>.
+
 <details markdown="1">
 <summary>Output files</summary>
 
@@ -51,12 +110,22 @@ The FastQC plots displayed in the MultiQC report shows _untrimmed_ reads. They m
 
 </details>
 
-[MultiQC](http://multiqc.info) is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
+### Reference genome files
 
-Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see <http://multiqc.info>.
+A number of genome-specific files if required by some of the analysis steps. If the `--save_reference` parameter is provided then the alignment indices generated by the pipeline will be saved in this directory.
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `genome/`
+  - `bwa/`: Directory containing BWA indices.
+
+</details>
 
 ### Pipeline information
 
+[Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.
+
 <details markdown="1">
 <summary>Output files</summary>
 
@@ -67,5 +136,3 @@ Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQ
   - Parameters used by the pipeline run: `params.json`.
 
 </details>
-
-[Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.