Skip to content

Commit

Permalink
Merge pull request #1005 from nf-core/dsl2-add-bedtools-coverage
Browse files Browse the repository at this point in the history
adding bedtool coverage from eager 2
  • Loading branch information
jfy133 authored Jul 21, 2023
2 parents 6d107c4 + b90d52f commit 4bf74d5
Show file tree
Hide file tree
Showing 11 changed files with 262 additions and 1 deletion.
4 changes: 4 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,10 @@

> Bushnell B, Rood J, Singer E (2017) BBMerge – Accurate paired shotgun read merging via overlap. PLOS ONE 12(10): e0185056. [https://doi.org/10.1371/journal.pone.0185056](https://doi.org/10.1371/journal.pone.0185056)
- [BEDTools](https://doi.org/10.1093/bioinformatics/btq033)

> Quinlan, A. R., & Hall, I. M. (2010). BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics, 26(6), 841–842. [https://doi.org/10.1093/bioinformatics/btq033](https://doi.org/10.1093/bioinformatics/btq033)
- [PreSeq](https://doi.org/10.1038/nmeth.2375)

> Daley, T., & Smith, A. D. (2013). Predicting the molecular complexity of sequencing libraries. Nature Methods, 10(4), 325–327. doi: [10.1038/nmeth.2375](https://doi.org/10.1038/nmeth.2375)
Expand Down
31 changes: 31 additions & 0 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -613,6 +613,37 @@ process {
]
}

//
// BEDTOOLS_COVERAGE
//
withName: SAMTOOLS_VIEW_GENOME {
tag = { "${meta.reference}|${meta.sample_id}_${meta.library_id}" }
publishDir = [
enabled: false
]
}

withName: BEDTOOLS_COVERAGE_DEPTH {
tag = { "${meta.reference}|${meta.sample_id}_${meta.library_id}" }
ext.args = '-mean -nonamecheck'
ext.prefix = { "${meta.sample_id}_${meta.library_id}_${meta.reference}_depth" }
publishDir = [
path: { "${params.outdir}/mapstats/bedtools" },
mode: params.publish_dir_mode
]
}

withName: BEDTOOLS_COVERAGE_BREADTH {
tag = { "${meta.reference}|${meta.sample_id}_${meta.library_id}" }
ext.args = '-nonamecheck'
ext.prefix = { "${meta.sample_id}_${meta.library_id}_${meta.reference}_breadth" }
publishDir = [
path: { "${params.outdir}/mapstats/bedtools" },
mode: params.publish_dir_mode
]
}


//
// DAMAGE MANIPULATION
//
Expand Down
4 changes: 4 additions & 0 deletions conf/test.config
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,10 @@ params {
bamfiltering_minreadlength = 30
bamfiltering_mappingquality = 37

// Map Stats
run_bedtools_coverage = true
mapstats_bedtools_featurefile = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Mammoth/Mammoth_MT_Krause.gff3'

// Metagenomic screening
run_metagenomicscreening = false

Expand Down
20 changes: 20 additions & 0 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -377,6 +377,26 @@ The resulting histogram file will contain estimated deduplication statistics at

These curves will be displayed in the pipeline run's MultiQC report, however you can also use this file for plotting yourself for further exploration e.g. in R to find your sequencing target depth.

### Mapping Statistics

#### Bedtools

<details markdown="1">
<summary>Output file</summary>

- `mapstats/bedtools/`

- `*.breadth.gz`: This file will have the contents of your annotation file (e.g. BED/GFF), and the following subsequent columns: no. reads on feature, # bases at depth, length of feature, and % of feature.
- `*.depth.gz`: This file will have the the contents of your annotation file (e.g. BED/GFF), and an additional column which is mean depth coverage (i.e. average number of reads covering each position).

</details>

[bedtools](https://github.com/arq5x/bedtools2) utilities are a swiss-army knife of tools for a wide-range of genomics analysis tasks. Bedtools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF.

The `bedtools coverage` tool computes both the depth and breadth of coverage of features in file B (alignment file) on the features in file A (provied by `--mapstats_bedtools_featurefile` when running the eager workflow). One advantage that bedtools coverage offers is that it not only counts the number of features that overlap an interval in file A, it also computes the fraction of bases in the interval in A that were overlapped by one or more features. Thus, bedtools coverage also computes the breadth of coverage observed for each interval in A.

The output from this module can be useful for things such as checking for the presence/absence of virulence factors in ancient pathogen genomes, or getting statistics on SNP capture positions.

### Damage Manipulation

There are three different options for manipulation of ancient DNA damage.
Expand Down
5 changes: 5 additions & 0 deletions modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,11 @@
"git_sha": "911696ea0b62df80e900ef244d7867d177971f73",
"installed_by": ["modules"]
},
"bedtools/coverage": {
"branch": "master",
"git_sha": "911696ea0b62df80e900ef244d7867d177971f73",
"installed_by": ["modules"]
},
"bowtie2/align": {
"branch": "master",
"git_sha": "603ecbd9f45300c9788f197d2a15a005685b4220",
Expand Down
34 changes: 34 additions & 0 deletions modules/local/samtools_view_genome.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
process SAMTOOLS_VIEW_GENOME {
tag "$meta.id"
label 'process_low'

conda "bioconda::samtools=1.17"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/samtools:1.17--h00cdaf9_0' :
'biocontainers/samtools:1.17--h00cdaf9_0' }"

input:
tuple val(meta), path(input), path(index)

output:
path "genome.txt", emit: genome
path "versions.yml", emit: versions

when:
task.ext.when == null || task.ext.when

script:
def args = task.ext.args ?: ''
"""
samtools \\
view \\
--threads ${task.cpus-1} \\
-H \\
$input | grep '@SQ' | sed 's#@SQ\tSN:\\|LN:##g' > genome.txt
cat <<-END_VERSIONS > versions.yml
"${task.process}":
samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//')
END_VERSIONS
"""
}
39 changes: 39 additions & 0 deletions modules/nf-core/bedtools/coverage/main.nf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

58 changes: 58 additions & 0 deletions modules/nf-core/bedtools/coverage/meta.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 4 additions & 0 deletions nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,10 @@ params {
max_multiqc_email_size = '25.MB'
multiqc_methods_description = null

// bedtools options
run_bedtools_coverage = false
mapstats_bedtools_featurefile = null

// Boilerplate options
outdir = null
publish_dir_mode = 'copy'
Expand Down
25 changes: 25 additions & 0 deletions nextflow_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -998,6 +998,28 @@
}
}
},
"feature_annotation_statistics": {
"title": "Feature Annotation Statistics",
"type": "object",
"description": "Options for getting reference annotation statistics (e.g. gene coverages)",
"default": "",
"properties": {
"run_bedtools_coverage": {
"type": "boolean",
"description": "Turn on ability to calculate no. reads, depth and breadth coverage of features in reference.",
"fa_icon": "fas fa-chart-area",
"help_text": "Specifies to turn on the bedtools module, producing statistics for breadth (or percent coverage), and depth (or X fold) coverages.\n\n> Modifies tool parameter(s):\n- bedtools coverage: `-mean`"
},
"mapstats_bedtools_featurefile": {
"type": "string",
"description": "Path to GFF or BED file containing positions of features in reference file (--fasta). Path should be enclosed in quotes.",
"fa_icon": "fas fa-file-signature",
"help_text": "Specify the path to a GFF/BED containing the feature coordinates (or any acceptable input for [`bedtools coverage`](https://bedtools.readthedocs.io/en/latest/content/tools/coverage.html)). Must be in quotes.\n"
}
},
"fa_icon": "fas fa-scroll",
"help_text": "If you're interested in looking at coverage stats for certain features on your\nreference such as genes, SNPs etc., you can use the following bedtools module\nfor this purpose.\n\nMore documentation on bedtools can be seen in the [bedtools\ndocumentation](https://bedtools.readthedocs.io/en/latest/)\n\nIf using TSV input, bedtools is run after library merging of same-named library\nBAMs that have the same type of UDG treatment.\n"
},
"host_removal": {
"title": "Host Removal",
"type": "object",
Expand Down Expand Up @@ -1133,6 +1155,9 @@
},
{
"$ref": "#/definitions/host_removal"
},
{
"$ref": "#/definitions/feature_annotation_statistics"
}
]
}
39 changes: 38 additions & 1 deletion workflows/eager.nf
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,11 @@ if ( params.metagenomics_complexity_tool == 'prinseq' && params.metagenomics_pri
exit 1, ("[nf-core/eager] ERROR: Metagenomics: You picked PRINSEQ++ with 'entropy' mode but provided a dust score. Please specify an entropy filter threshold using the --metagenomics_complexity_entropy flag")
}
}
if( params.run_bedtools_coverage ){
if( !params.mapstats_bedtools_featurefile ) {
exit 1, "[nf-core/eager] ERROR: you have turned on bedtools coverage, but not specified a BED or GFF file with --mapstats_bedtools_featurefile. Please validate your parameters."
}
}

// TODO What to do when params.preprocessing_excludeunmerged is provided but the data is SE?
if ( params.deduplication_tool == 'dedup' && ! params.preprocessing_excludeunmerged ) { exit 1, "[nf-core/eager] ERROR: Dedup can only be used on collapsed (i.e. merged) PE reads. For all other cases, please set --deduplication_tool to 'markduplicates'."}
Expand Down Expand Up @@ -81,6 +86,7 @@ include { CALCULATE_DAMAGE } from '../subworkflows/local/calculate_
//
// MODULE: Installed directly from nf-core/modules
//

include { FASTQC } from '../modules/nf-core/fastqc/main'
include { MULTIQC } from '../modules/nf-core/multiqc/main'
include { CUSTOM_DUMPSOFTWAREVERSIONS } from '../modules/nf-core/custom/dumpsoftwareversions/main'
Expand All @@ -92,6 +98,8 @@ include { MTNUCRATIO } from '../modules/n
include { HOST_REMOVAL } from '../modules/local/host_removal'
include { ENDORSPY } from '../modules/nf-core/endorspy/main'
include { SAMTOOLS_FLAGSTAT as SAMTOOLS_FLAGSTATS_BAM_INPUT } from '../modules/nf-core/samtools/flagstat/main'
include { BEDTOOLS_COVERAGE as BEDTOOLS_COVERAGE_DEPTH ; BEDTOOLS_COVERAGE as BEDTOOLS_COVERAGE_BREADTH } from '../modules/nf-core/bedtools/coverage/main'
include { SAMTOOLS_VIEW_GENOME } from '../modules/local/samtools_view_genome.nf'

/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -385,7 +393,36 @@ workflow EAGER {
ch_versions = ch_versions.mix( PRESEQ_LCEXTRAP.out.versions )
}

//

//
// MODULE: Bedtools coverage
//

if ( params.run_bedtools_coverage ) {

ch_anno_for_bedtools = Channel.fromPath(params.mapstats_bedtools_featurefile, checkIfExists: true).collect()

ch_dedupped_for_bedtools = ch_dedupped_bams.combine(ch_anno_for_bedtools)
.map{
meta, bam, bai, anno ->
[meta, anno, bam]
}

// Running samtools view to get header
SAMTOOLS_VIEW_GENOME(ch_dedupped_bams)

ch_genome_for_bedtools = SAMTOOLS_VIEW_GENOME.out.genome

BEDTOOLS_COVERAGE_BREADTH(ch_dedupped_for_bedtools, ch_genome_for_bedtools)
BEDTOOLS_COVERAGE_DEPTH(ch_dedupped_for_bedtools, ch_genome_for_bedtools)

ch_versions = ch_versions.mix( SAMTOOLS_VIEW_GENOME.out.versions )
ch_versions = ch_versions.mix( BEDTOOLS_COVERAGE_BREADTH.out.versions )
ch_versions = ch_versions.mix( BEDTOOLS_COVERAGE_DEPTH.out.versions )
}


//
// SUBWORKFLOW: Calculate Damage
//

Expand Down

0 comments on commit 4bf74d5

Please sign in to comment.