Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DSL2: library merge #1043

Merged
merged 15 commits into from
Feb 2, 2024
8 changes: 4 additions & 4 deletions .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,11 @@
"python.linting.flake8Path": "/opt/conda/bin/flake8",
"python.linting.pycodestylePath": "/opt/conda/bin/pycodestyle",
"python.linting.pydocstylePath": "/opt/conda/bin/pydocstyle",
"python.linting.pylintPath": "/opt/conda/bin/pylint",
"python.linting.pylintPath": "/opt/conda/bin/pylint"
},

// Add the IDs of extensions you want installed when the container is created.
"extensions": ["ms-python.python", "ms-python.vscode-pylance", "nf-core.nf-core-extensionpack"],
},
},
"extensions": ["ms-python.python", "ms-python.vscode-pylance", "nf-core.nf-core-extensionpack"]
}
}
}
2 changes: 1 addition & 1 deletion .github/workflows/download_pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -64,4 +64,4 @@ jobs:
env:
NXF_SINGULARITY_CACHEDIR: ./
NXF_SINGULARITY_HOME_MOUNT: true
run: nextflow run ./${{ env.REPOTITLE_LOWERCASE }}/$( sed 's/\W/_/g' <<< ${{ env.REPO_BRANCH }}) -stub -profile test,singularity --outdir ./results
run: nextflow run ./${{ env.REPOTITLE_LOWERCASE }}/$( sed 's/\W/_/g' <<< ${{ env.REPO_BRANCH }}) -stub -profile test,singularity --outdir ./results
1 change: 1 addition & 0 deletions .prettierignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,4 @@ testing*
bin/
*.cff
test/
dev_docs.md
43 changes: 43 additions & 0 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -883,4 +883,47 @@ process {
enabled: true
]
}

//
// LIBRARY MERGE
//

withName: SAMTOOLS_MERGE_LIBRARIES {
tag = { "${meta.reference}|${meta.sample_id}" }
ext.prefix = { "${meta.sample_id}_${meta.reference}_unsorted" }
publishDir = [
enabled: false
]
}

withName: SAMTOOLS_SORT_MERGED_LIBRARIES {
tag = { "${meta.reference}|${meta.sample_id}" }
ext.prefix = { "${meta.sample_id}_${meta.reference}" }
publishDir = [
path: { "${params.outdir}/final_bams/" },
mode: params.publish_dir_mode,
pattern: '*.bam'
]
}

withName: SAMTOOLS_INDEX_MERGED_LIBRARIES {
tag = { "${meta.reference}|${meta.sample_id}" }
ext.args = { params.fasta_largeref ? "-c" : "" }
ext.prefix = { "${meta.sample_id}_${meta.reference}" }
publishDir = [
path: { "${params.outdir}/final_bams/" },
mode: params.publish_dir_mode,
pattern: '*.{bai,csi}'
]
}

withName: SAMTOOLS_FLAGSTAT_MERGED_LIBRARIES {
tag = { "${meta.reference}|${meta.sample_id}" }
ext.prefix = { "${meta.sample_id}_${meta.reference}" }
publishDir = [
path: { "${params.outdir}/final_bams/" },
mode: params.publish_dir_mode,
pattern: '*.flagstat'
]
}
}
38 changes: 14 additions & 24 deletions docs/development/dev_docs.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,36 +9,26 @@ To add new input files or options to the reference sheet, you have to complete a
1. Add your new parameter to nextflow.config.
2. Add parameter description to schema (nf-core schema build).
3. Read in new parameter (params.<NEW>) as input within the reference_indexing_single local subworkflow.
1. Add new line to the large `.map{}` operation starting on [line 80(https://github.com/nf-core/eager/blob/d4211582f349cc30c88202c12942218f99006041/subworkflows/local/reference_indexing_single.nf#L80)] and add check if the file exists.
`def <PARAM_NAME> = params.<NEW> != null ? file(params.<NEW>, checkIfExists: true ) : ""`
2. Add <PARAM_NAME> to the result of the map operation. Double-check the order!
3. With the `ch_ref_index_single.multiMap{}` below you add the reference name as a meta. You can also combine your new parameter with others if useful for the workflow step.
`<NEW_SUBCHANNE>: [ meta, <PARAM_NAME> ]`
4. Add your ch_ref_index_single.<NEW_SUBCHANNEL> to the final emit.
`<NEW_EMIT> = ch_ref_index_single.<NEW_SUBCHANNEL>`
1. Add new line to the large `.map{}` operation starting on [line 80](https://github.com/nf-core/eager/blob/d4211582f349cc30c88202c12942218f99006041/subworkflows/local/reference_indexing_single.nf#L80) and add check if the file exists. `def <PARAM_NAME> = params.<NEW> != null ? file(params.<NEW>, checkIfExists: true ) : ""`
2. Add <PARAM_NAME> to the result of the map operation. Double-check the order!
3. With the `ch_ref_index_single.multiMap{}` below you add the reference name as a meta. You can also combine your new parameter with others if useful for the workflow step. `<NEW_SUBCHANNE>: [ meta, <PARAM_NAME> ]`
4. Add your ch_ref_index_single.<NEW_SUBCHANNEL> to the final emit. `<NEW_EMIT> = ch_ref_index_single.<NEW_SUBCHANNEL>`

### Multi-reference input workflow

1. Add new column named <SOFTWARE_FILETYPE> and test data to the test reference sheet (https://github.com/nf-core/test-datasets/blob/eager/reference/reference_sheet_multiref.csv).
2. Read in new input within the reference_indexing_multi local subworkflow.
1. Add new line to the large `.map{}` operation starting on [line 30](https://github.com/nf-core/eager/blob/d4211582f349cc30c88202c12942218f99006041/subworkflows/local/reference_indexing_multi.nf#L30). Add check if the file exists if appropriate.
`def <PARAM_NAME> = row["<SOFTWARE_FILETYPE>"] != "" ? file(row["<SOFTWARE_FILETYPE>"], checkIfExists: true) : ""`
2. Add <PARAM_NAME> to the result of the `.map{}` operation. Double-check the order!
3. With the `ch_input_from_referencesheet.multiMap{}` below you add the reference name as a meta. You can also combine your new parameter with others if useful for the workflow step.
`<NEW_SUBCHANNEL>: [ meta, <PARAM_NAME> ]`
4. Add ch_input_from_referencesheet.<NEW_SUBCHANNEL> to the final emit.
`<NEW_EMIT> = ch_input_from_referencesheet.<NEW_SUBCHANNEL>`
1. Add new line to the large `.map{}` operation starting on [line 30](https://github.com/nf-core/eager/blob/d4211582f349cc30c88202c12942218f99006041/subworkflows/local/reference_indexing_multi.nf#L30). Add check if the file exists if appropriate. `def <PARAM_NAME> = row["<SOFTWARE_FILETYPE>"] != "" ? file(row["<SOFTWARE_FILETYPE>"], checkIfExists: true) : ""`
2. Add <PARAM_NAME> to the result of the `.map{}` operation. Double-check the order!
3. With the `ch_input_from_referencesheet.multiMap{}` below you add the reference name as a meta. You can also combine your new parameter with others if useful for the workflow step. `<NEW_SUBCHANNEL>: [ meta, <PARAM_NAME> ]`
4. Add ch_input_from_referencesheet.<NEW_SUBCHANNEL> to the final emit. `<NEW_EMIT> = ch_input_from_referencesheet.<NEW_SUBCHANNEL>`

### Combining in the Reference Indexing workflow

1. Add you new parameter channel to the `if` condition selecting between the direct parameter input or the reference sheet input.
1. below "REFERENCE_INDEXING_MULTI" for reference sheet input
`<NEW_CHANNEL> = REFERENCE_INDEXING_MULTI.out.<NEW_EMIT>`
2. below "REFERENCE_INDEXING_SINGLE"
`<NEW_CHANNEL> = REFERENCE_INDEXING_SINGLE.out.<NEW_EMIT>`
3. Filter out options that have not been provided.
`<NEW_CHANNEL> = <NEW_CHANNEL>.filter{ it[1] != "" }`
4. Add unzipping of zipped input files with GUNZIP.
5. Add <NEW_CHANNEL> to the final emit.
`<NEW_EMIT> = <NEW_CHANNEL>`
6. Call new inputs within the main eager.nf with `REFERENCE_INDEXING.out.<NEW_EMIT>`.
1. below "REFERENCE_INDEXING_MULTI" for reference sheet input `<NEW_CHANNEL> = REFERENCE_INDEXING_MULTI.out.<NEW_EMIT>`
2. below "REFERENCE_INDEXING_SINGLE" `<NEW_CHANNEL> = REFERENCE_INDEXING_SINGLE.out.<NEW_EMIT>`
3. Filter out options that have not been provided. `<NEW_CHANNEL> = <NEW_CHANNEL>.filter{ it[1] != "" }`
4. Add unzipping of zipped input files with GUNZIP.
5. Add <NEW_CHANNEL> to the final emit. `<NEW_EMIT> = <NEW_CHANNEL>`
6. Call new inputs within the main eager.nf with `REFERENCE_INDEXING.out.<NEW_EMIT>`.
57 changes: 57 additions & 0 deletions docs/development/manual_tests.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,14 @@ Tool Specific combinations

- All together

- Library merge

- single reference: no damage manipulation ✅
- single reference: with damage manipulation, on raw data ✅
- single reference: with damage manipulation (trimming), on trimmed data ✅
- single reference: with damage manipulation (pmd + trimming), on pmd filtered data ✅
- multi reference: no damage manipulation ✅

### Multi-reference tests

```bash
Expand Down Expand Up @@ -667,3 +675,52 @@ nextflow run . -profile test,docker \
--damage_manipulation_bamutils_trim_double_stranded_half_udg_left 1 \
--damage_manipulation_bamutils_trim_double_stranded_half_udg_right 2
```

### LIBRARY_MERGE

```bash
## Library merge on single reference, no damage manipulation.
## EXPECT: 1 bam.bai/flagstat set per sample/reference combination. 6 files total.
## Check the headers of the bams to ensure that the correct number of bams are merged (1 for JK2802, 2 for JK2782).
## Also, check that the bams merged are the deduplication output.
## NOTE: JK2782 seems to have some PG tags repeated, as they apply to each input file separately.
nextflow run main.nf -profile test,docker --outdir ./results -w work/ -resume --genotyping_source 'raw' -ansi-log false -dump-channels
```

```bash
## Library merge on single reference, merge trimmed bams.
## EXPECT: 1 bam.bai/flagstat set per sample/reference combination. 6 files total.
## Check the headers of the bams to ensure that the correct number of bams are merged (1 for JK2802, 2 for JK2782).
## Also, check that the bams merged are trimmed. (JK2802 is full udg, but header confirms merged bam is "trimmed")
## NOTE: JK2782 seems to have some PG tags repeated, as they apply to each input file separately.
nextflow run main.nf -profile test,docker --outdir ./results -w work/ -resume --genotyping_source 'trimmed' -ansi-log false -dump-channels \
--run_trim_bam \
--damage_manipulation_bamutils_trim_double_stranded_none_udg_left 5 \
--damage_manipulation_bamutils_trim_double_stranded_none_udg_right 7 \
--damage_manipulation_bamutils_trim_double_stranded_half_udg_left 1 \
--damage_manipulation_bamutils_trim_double_stranded_half_udg_right 2
```

```bash
## Library merge on single reference, merge pmd bams. Trimming ran but not used downstream.
## EXPECT: 1 bam.bai/flagstat set per sample/reference combination. 6 files total.
## Check the headers of the bams to ensure that the correct number of bams are merged (1 for JK2802, 2 for JK2782).
## Also, check that the bams merged are the pmd ones.
## NOTE: JK2782 seems to have some PG tags repeated, as they apply to each input file separately.
nextflow run main.nf -profile test,docker --outdir ./results -w work/ -resume --genotyping_source 'pmd' -ansi-log false -dump-channels \
--run_trim_bam \
--run_pmd_filtering \
--damage_manipulation_bamutils_trim_double_stranded_none_udg_left 5 \
--damage_manipulation_bamutils_trim_double_stranded_none_udg_right 7 \
--damage_manipulation_bamutils_trim_double_stranded_half_udg_left 1 \
--damage_manipulation_bamutils_trim_double_stranded_half_udg_right 2
```

```bash
## Library merge on multi reference. No damage manipulation.
## EXPECT: 1 bam.bai/flagstat set per sample/reference combination. 15 files total. (2 refs * 2 samples * 3 files) + BAM input only on one reference (+3)
## Check the headers of the bams to ensure that the correct number of bams are merged (1 for JK2802, 2 for JK2782).
## Also, check that the bams merged are the dedupped ones.
## NOTE: PG tags are repeated for each chromosome in the reference, times each library! Maybe there's some flag missing from samtools MERGE runs?
nextflow run main.nf -profile test_multiref,docker --outdir ./results -w work/ -resume --genotyping_source 'raw' -ansi-log false -dump-channels
```
52 changes: 52 additions & 0 deletions subworkflows/local/merge_libraries.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
//
// Merge libraries of the same sample, then sort, index, and flagstat the merged bam
//

include { SAMTOOLS_MERGE as SAMTOOLS_MERGE_LIBRARIES } from '../../modules/nf-core/samtools/merge/main'
include { SAMTOOLS_SORT as SAMTOOLS_SORT_MERGED_LIBRARIES } from '../../modules/nf-core/samtools/sort/main'
include { SAMTOOLS_INDEX as SAMTOOLS_INDEX_MERGED_LIBRARIES } from '../../modules/nf-core/samtools/index/main'
include { SAMTOOLS_FLAGSTAT as SAMTOOLS_FLAGSTAT_MERGED_LIBRARIES } from '../../modules/nf-core/samtools/flagstat/main'

workflow MERGE_LIBRARIES {
take:
ch_bam_bai // [ [ meta ], bam , bai ]

main:
ch_versions = Channel.empty()
ch_multiqc_files = Channel.empty()

ch_library_merge_input = ch_bam_bai
.map { WorkflowEager.addNewMetaFromAttributes( it, ["id", "sample_id", "strandedness", "reference"], ["id", "sample_id", "strandedness", "reference"], false ) }
.groupTuple(by: 0)
// Discrad library-level metas, and bais. Add single_end: true to all metas (no SE/PE distinction from here on)
.map {
meta, lib_metas, bam, bai ->
[ meta + [ 'single_end': true ], bam ]
}

SAMTOOLS_MERGE_LIBRARIES ( ch_library_merge_input, [[], []], [[], []] )
ch_versions = ch_versions.mix( SAMTOOLS_MERGE_LIBRARIES.out.versions.first() )

SAMTOOLS_SORT_MERGED_LIBRARIES ( SAMTOOLS_MERGE_LIBRARIES.out.bam )
ch_versions = ch_versions.mix( SAMTOOLS_SORT_MERGED_LIBRARIES.out.versions.first() )

SAMTOOLS_INDEX_MERGED_LIBRARIES ( SAMTOOLS_SORT_MERGED_LIBRARIES.out.bam )
ch_versions = ch_versions.mix( SAMTOOLS_INDEX_MERGED_LIBRARIES.out.versions.first() )

// Join merged sample-level bams and their bais for genotyping
ch_merged_bams = SAMTOOLS_SORT_MERGED_LIBRARIES.out.bam
.join( SAMTOOLS_INDEX_MERGED_LIBRARIES.out.bai )

// Not sure if FLAGSTAT is really needed, but added here for completeness
SAMTOOLS_FLAGSTAT_MERGED_LIBRARIES ( ch_merged_bams )
ch_versions = ch_versions.mix( SAMTOOLS_FLAGSTAT_MERGED_LIBRARIES.out.versions.first() )

ch_merged_flagstat = SAMTOOLS_FLAGSTAT_MERGED_LIBRARIES.out.flagstat
ch_multiqc_files = ch_multiqc_files.mix( SAMTOOLS_FLAGSTAT_MERGED_LIBRARIES.out.flagstat )

emit:
bam_bai = ch_merged_bams // [ [ meta ], bam , bai ]
flagstat = ch_merged_flagstat // [ [ meta ], flagstat ]
versions = ch_versions
mqc = ch_multiqc_files // Same as flagstat
}
16 changes: 14 additions & 2 deletions workflows/eager.nf
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ include { MANIPULATE_DAMAGE } from '../subworkflows/local/manipulate
include { METAGENOMICS_COMPLEXITYFILTER } from '../subworkflows/local/metagenomics_complexityfilter'
include { ESTIMATE_CONTAMINATION } from '../subworkflows/local/estimate_contamination'
include { CALCULATE_DAMAGE } from '../subworkflows/local/calculate_damage'
include { MERGE_LIBRARIES } from '../subworkflows/local/merge_libraries'

/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -530,11 +531,22 @@ workflow EAGER {
MANIPULATE_DAMAGE( ch_dedupped_bams, ch_fasta_for_deduplication.fasta, REFERENCE_INDEXING.out.pmd_masking )
ch_multiqc_files = ch_multiqc_files.mix( MANIPULATE_DAMAGE.out.flagstat.collect{it[1]}.ifEmpty([]) )
ch_versions = ch_versions.mix( MANIPULATE_DAMAGE.out.versions )
ch_bams_for_genotyping = params.genotyping_source == 'rescaled' ? MANIPULATE_DAMAGE.out.rescaled : params.genotyping_source == 'pmd' ? MANIPULATE_DAMAGE.out.filtered : params.genotyping_source == 'trimmed' ? MANIPULATE_DAMAGE.out.trimmed : ch_dedupped_bams
ch_bams_for_library_merge = params.genotyping_source == 'rescaled' ? MANIPULATE_DAMAGE.out.rescaled : params.genotyping_source == 'pmd' ? MANIPULATE_DAMAGE.out.filtered : params.genotyping_source == 'trimmed' ? MANIPULATE_DAMAGE.out.trimmed : ch_dedupped_bams
TCLamnidis marked this conversation as resolved.
Show resolved Hide resolved
} else {
ch_bams_for_genotyping = ch_dedupped_bams
ch_bams_for_library_merge = ch_dedupped_bams
}

//
// SUBWORKFLOW: MERGE LIBRARIES
//

// The bams being merged are always the ones specified by params.genotyping_source,
// unless the user skipped damage manipulation, in which case it is the DEDUPLICATION output.
MERGE_LIBRARIES ( ch_bams_for_library_merge )
ch_versions = ch_versions.mix( MERGE_LIBRARIES.out.versions )
ch_bams_for_genotyping = MERGE_LIBRARIES.out.bam_bai
ch_multiqc_files = ch_multiqc_files.mix( MERGE_LIBRARIES.out.mqc.collect{it[1]}.ifEmpty([]) ) // Not sure if this is needed, or if it needs to be moved to line 564?

//
// MODULE: MultiQC
//
Expand Down
Loading