Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge lanes pe #18

Merged
merged 24 commits into from
Nov 17, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .nf-core.yml
Original file line number Diff line number Diff line change
@@ -1 +1,4 @@
repository_type: pipeline
lint:
multiqc_config:
report_comment: False
6 changes: 4 additions & 2 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@

- [BWA](https://www.ncbi.nlm.nih.gov/pubmed/19451168/)

> Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009 Jul 15;25(14):1754-60. doi: 10.1093/bioinformatics/btp324. Epub 2009 May 18. PubMed PMID: 19451168; PubMed Central PMCID: PMC2705234.
> Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009 Jul 15;25(14):1754-60. doi: 10.1093/bioinformatics/btp324. Epub 2009 May 18. PubMed PMID: 19451168; PubMed Central PMCID: PMC2705234.

- [deepTools](https://www.ncbi.nlm.nih.gov/pubmed/27079975/)

Expand All @@ -32,7 +32,9 @@

> Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8. PubMed PMID: 19505943; PubMed Central PMCID: PMC2723002.

- [Trim Galore!](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/)
- [Trimmomatic](https://pubmed.ncbi.nlm.nih.gov/24695404/)

> Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014 Aug 1;30(15):2114-20. doi: 10.1093/bioinformatics/btu170. Epub 2014 Apr 1. PMID: 24695404; PMCID: PMC4103590.

## R packages

Expand Down
15 changes: 2 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,9 +46,6 @@ to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/i
with `-profile test` before running the workflow on actual data.
:::

<!-- TODO nf-core: Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
Explain what rows and columns represent. For instance (please edit as appropriate -->

First, prepare a samplesheet with your input data that looks as follows:

`samplesheet.csv`:
Expand All @@ -64,11 +61,10 @@ Each row represents a fastq file (single-end) or a pair of fastq files (paired e

Now, you can run the pipeline using:

<!-- TODO nf-core: update the following command to include all required parameters for a minimal example -->

```bash
nextflow run nf-core/sammyseq \
-profile <docker/singularity/.../institute> \
--fasta reference_genome.fa \
--input samplesheet.csv \
--outdir <OUTDIR>
```
Expand All @@ -78,19 +74,12 @@ or
```bash
nextflow run nf-core/sammyseq \
-profile <docker/singularity/.../institute> \
--fasta reference_genome.fa \
--input samplesheet.csv \
--outdir <OUTDIR> \
--comparisonFile comparisons.csv
```

`comparisons.csv`:

```csv
sample1,sample2
CTRL004_S2,CTRL004_S3
CTRL004_S2,CTRL004_S4
```

:::warning
Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those
provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_;
Expand Down
4 changes: 2 additions & 2 deletions assets/multiqc_config.yml
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
report_comment: >

This report has been generated by the <a href="https://github.com/nf-core/sammyseq/releases/tag/0.01" target="_blank">nf-core/sammyseq</a>
This report has been generated by the <a href="https://github.com/nf-core/sammyseq/tree/dev" target="_blank">nf-core/sammyseq</a>
analysis pipeline. For information about how to interpret these results, please see the
<a href="https://nf-co.re/sammyseq/0.01/docs/output" target="_blank">documentation</a>.
<a href="https://nf-co.re/sammyseq/dev/docs/output" target="_blank">documentation</a>.

report_section_order:
"nf-core-sammyseq-methods-description":
Expand Down
6 changes: 3 additions & 3 deletions assets/samplesheet.csv
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
sample,fastq_1,fastq_2
SAMPLE_PAIRED_END,/path/to/fastq/files/AEG588A1_S1_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A1_S1_L002_R2_001.fastq.gz
SAMPLE_SINGLE_END,/path/to/fastq/files/AEG588A4_S4_L003_R1_001.fastq.gz,
sample,fastq_1,fastq_2,experimentalID,fraction
SAMPLE_PAIRED_END,/path/to/fastq/files/AEG588A1_S1_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A1_S1_L002_R2_001.fastq.gz,SAMPLE_PAIRED_END_EXPID,S2
SAMPLE_SINGLE_END,/path/to/fastq/files/AEG588A4_S4_L003_R1_001.fastq.gz,,SAMPLE_SINGLE_END_EXPID,S2
4 changes: 2 additions & 2 deletions conf/base.config
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@

process {

// TODO nf-core: Check the defaults for all processes
// nf-core: Check the defaults for all processes
cpus = { check_max( 1 * task.attempt, 'cpus' ) }
memory = { check_max( 6.GB * task.attempt, 'memory' ) }
time = { check_max( 4.h * task.attempt, 'time' ) }
Expand All @@ -24,7 +24,7 @@ process {
// These labels are used and recognised by default in DSL2 files hosted on nf-core/modules.
// If possible, it would be nice to keep the same label naming convention when
// adding in your local modules too.
// TODO nf-core: Customise requirements for specific processes.
// nf-core: Customise requirements for specific processes.
// See https://www.nextflow.io/docs/latest/config.html#config-process-selectors
withLabel:process_single {
cpus = { check_max( 1 , 'cpus' ) }
Expand Down
59 changes: 51 additions & 8 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,14 @@ process {
]
}

withName : ".*PREPARE_GENOME:.*" {
publishDir = [
path: { "${params.outdir}/genome" },
mode: params.publish_dir_mode,
enabled: params.save_reference
]
}

withName: FASTQC {
ext.args = '--quiet'
}
Expand Down Expand Up @@ -58,20 +66,20 @@ process {
// Alignment, Picard MarkDuplicates and Filtering options
//

withName: '.*FASTQ_ALIGN_BWAALN:BWA_.*' {
withName: '.*FASTQ_ALIGN_BWAALN:.*' {
publishDir = [
[
path: { "${params.outdir}/BWA" },
path: { "${params.outdir}/alignment/bwa" },
mode: params.publish_dir_mode,
pattern: '*.bam*',
pattern: '*.{bam,bai}',
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
]
}


withName: '.*BAM_MARKDUPLICATES_PICARD:PICARD_MARKDUPLICATES' {
ext.args = '--ASSUME_SORTED true --REMOVE_DUPLICATES true --VALIDATION_STRINGENCY LENIENT --TMP_DIR tmp'
withName: '.*BAM_MARKDUPLICATES_PICARD:PICARD_MARKDUPLICATES.*' {
ext.args = '--ASSUME_SORTED true --REMOVE_DUPLICATES false --VALIDATION_STRINGENCY LENIENT --TMP_DIR tmp'
ext.prefix = { "${meta.id}.md" }
publishDir = [
[
Expand All @@ -81,7 +89,7 @@ process {
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
],
[
path: { "${params.outdir}/markduplicates/bam" },
path: { "${params.outdir}/alignment/markduplicates" },
mode: params.publish_dir_mode,
pattern: '*.md.{bam,bai}',
saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
Expand All @@ -90,10 +98,10 @@ process {
]
}

withName: '.*BAM_MARKDUPLICATES_PICARD:SAMTOOLS_INDEX' {
withName: '.*BAM_MARKDUPLICATES_PICARD:SAMTOOLS_INDEX.*' {
ext.prefix = { "${meta.id}.markdup.sorted" }
publishDir = [
path: { "${params.outdir}/markduplicates/bam" },
path: { "${params.outdir}/alignment/markduplicates" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
pattern: '*.{bai,csi}'
Expand All @@ -110,4 +118,39 @@ process {
]
}

withName : ".*SAMTOOLS_FAIDX.*" {
publishDir = [
path: { "${params.outdir}/genome" },
mode: params.publish_dir_mode,
]
}

withName : ".*DEEPTOOLS_BAMCOVERAGE.*" {

publishDir = [
path: { "${params.outdir}/single_tracks/deeptools" },
mode: params.publish_dir_mode,
pattern: '*.bigWig'
]
}

withName : ".*RTWOSAMPLESMLE.*" {
publishDir = [
path: { "${params.outdir}/comparisons/spp_mle" },
mode: params.publish_dir_mode,
]
}


}

if (params.blackListFile != null) {
process {
withName: '.*DEEPTOOLS_BAMCOVERAGE.*' {
ext.args = "–blackListFileName ${params.blackListFile}"
}
}
}



2 changes: 1 addition & 1 deletion conf/test.config
Original file line number Diff line number Diff line change
Expand Up @@ -23,5 +23,5 @@ params {
input = 'https://genome.isasi.cnr.it/biocomp/test-datasets/sammyseq/testdata/chr22/samplesheet_test_github_chr22_tinier.csv'

// Genome references
fasta = 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/genome.fasta'
fasta = 'https://genome.isasi.cnr.it/biocomp/test-datasets/sammyseq/testdata/chr22/chr22.fa'
}
101 changes: 84 additions & 17 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,41 +6,100 @@ This document describes the output produced by the pipeline. Most of the plots a

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

<!-- TODO nf-core: Write this documentation describing your workflow's output -->

## Pipeline overview

The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:

- [FastQC](#fastqc) - Raw read QC
- [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline
- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution
- [FastQC](#fastqc)
- [Trim reads](#trim-reads)
- [Alignment on Reference](#alignment-on-reference)
- [Mark Duplicate reads](#mark-duplicate-reads)
- [Signal track generation](#signal-track-generation)
- [Comparisons](#comparisons)
- [MultiQC](#multiqc)
- [Pipeline information](#pipeline-information)

### Read quality check

#### FastQC

[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about the sequenced reads. It provides information about the quality score distribution across reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/).

### FastQC
#### Trim reads

[`Trimmomatic`](http://www.usadellab.org/cms/?page=trimmomatic) is a software used to trim adapter sequences and low quality bases from the end of reads and quality check after this step is performed again with Fastqc.

<details markdown="1">
<summary>Output files</summary>

- `fastqc/`
- `*_fastqc.html`: FastQC report containing quality metrics.
- `*_fastqc.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images.
- `*_trim_fastqc.html`: FastQC report containing quality metrics for trimmed reads.
- `*_trim_fastqc.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images for trimmed reads.

</details>

[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/).
:::note
The FastQC plots displayed in the MultiQC report shows both _untrimmed_ and _trimmed_ reads so they can be directly compared.
:::

### Alignment on Reference

![MultiQC - FastQC sequence counts plot](images/mqc_fastqc_counts.png)
The alignment is performed using [BWA](https://github.com/lh3/bwa) and the aligned reads are then sorted by chromosome coordinates with [samtools](https://www.htslib.org/doc/samtools.html).

![MultiQC - FastQC mean quality scores plot](images/mqc_fastqc_quality.png)
<details markdown="1">
<summary>Output files</summary>

![MultiQC - FastQC adapter content plot](images/mqc_fastqc_adapter.png)
- `alignment/bwa/`
- `<sample>.bam` and `<sample>.bam.bai`

:::note
The FastQC plots displayed in the MultiQC report shows _untrimmed_ reads. They may contain adapter sequence and potentially regions with low quality.
:::
</details>

### Mark Duplicate reads

Read pairs that are likely to have originated from duplicates of the same original DNA fragments through some artificial processes are identified. These are considered to be non-independent observations, so all but a single read pair within each set of duplicates are marked, not removed from the bam file.

<details markdown="1">
<summary>Output files</summary>

- `alignment/markduplicates/`
- `<sample>.md.bam` and `<sample>.md.bam.bai`
- `reports/markduplicates/`
- `<sample>.md.MarkDuplicates.metrics.txt`

</details>

### Signal track generation

[deepTools](https://deeptools.readthedocs.io/en/develop/content/list_of_tools.html) is used to generate single fraction signals in [bigWig](https://genome.ucsc.edu/goldenpath/help/bigWig.html) format, an indexed binary format useful for displaying dense, continuous data in Genome Browsers such as the [UCSC](https://genome.ucsc.edu/cgi-bin/hgTracks) and [IGV](http://software.broadinstitute.org/software/igv/). The bigWig format is also supported by various bioinformatics software for downstream processing such as meta-profile plotting.

<details markdown="1">
<summary>Output files</summary>

- `single_tracks/deeptools/`
- `<sample>.bigWig`

</details>

### Comparisons

When `--comparisonFile` is set, the difference between sample1 and sample2 read density profile smoothed by the Gaussian kernel is calculated and saved in bigwig format, as described in Kharchenko PK, Tolstorukov MY, Park PJ "Design and analysis of ChIP-seq experiments for DNA-binding proteins" Nat. Biotech. doi:10.1038/nbt.1508

<details markdown="1">
<summary>Output files</summary>

- `comparisons/spp_mle/`
- `<sample1>.md_VS_<sample2>.md.bw`

</details>

### MultiQC

[MultiQC](http://multiqc.info) is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.

Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see <http://multiqc.info>.

<details markdown="1">
<summary>Output files</summary>

Expand All @@ -51,12 +110,22 @@ The FastQC plots displayed in the MultiQC report shows _untrimmed_ reads. They m

</details>

[MultiQC](http://multiqc.info) is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
### Reference genome files

Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see <http://multiqc.info>.
A number of genome-specific files if required by some of the analysis steps. If the `--save_reference` parameter is provided then the alignment indices generated by the pipeline will be saved in this directory.

<details markdown="1">
<summary>Output files</summary>

- `genome/`
- `bwa/`: Directory containing BWA indices.

</details>

### Pipeline information

[Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.

<details markdown="1">
<summary>Output files</summary>

Expand All @@ -67,5 +136,3 @@ Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQ
- Parameters used by the pipeline run: `params.json`.

</details>

[Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.
Loading