Rearrange processes and modules, add SV workflow for Delly2 #8

nwiltsie · 2024-07-29T18:59:07Z

Description

~~This addresses 99% of #7, although I still have one lingering issue in that the final VCF fails indexing.~~

Closes #7.

I've rearranged all of the processes to be more modular - there are now three high-level blocks of validation (common), feature extraction (SNV or SV), and stability prediction (common). There are now two NFTest cases (one each for the SNV and SV branches), but they remain smoke tests without any assertions.

I've also added a pipeline diagram. I waffled back-and-forth on this but ultimately ended up using Mermaid rather than PlantUML so that I could use those "Parameterized Input" bubbles. GitHub's UI renders Mermaid code blocks, but for the README I manually rendered an SVG. I'm intending to extend our PlantUML-rendering action to do the same thing.

%%{init: {"flowchart": {"htmlLabels": false}} }%%

flowchart TD

  classDef input fill:#ffffb3
  classDef output fill:#b3de69
  classDef gatk fill:#bebada
  classDef bcftools fill:#fdb462
  classDef R fill:#8dd3c7
  classDef linux fill:#fb8072

  subgraph legend ["`**Legend**`"]
      direction RL
    subgraph nodes ["`**Nodes**`"]
      input[["Input File"]]:::input
      input_node(["Parameterized Input"]):::input
      output[["Output file"]]:::output
    end

    subgraph processes ["`**Processes**`"]
      gatk_docker[GATK]:::gatk
      bcftools_docker[bcftools]:::bcftools
      r_docker[Rscript]:::R
      linux_docker[Generic Linux]:::linux
    end
  end

  legend
  ~~~ input_vcf[["Input VCF"]]:::input
  --> pipeval:::linux
  --> sv_vs_snv{{Variant Caller?}}

  sv_vs_snv ------> r_liftover
  header_contigs .-> r_liftover
  chain_file2 ..-> r_liftover
  gnomad_rds .-> r_extract_sv

  subgraph SV ["`**Delly2**`"]
    %% Other input files
    header_contigs([header_contigs]):::input
    chain_file2([chain_file]):::input
    gnomad_rds([gnomad_rds]):::input

    r_liftover[liftover-Delly2-vcf.R]:::R
    ---> r_extract_sv[extract-VCF-features-SV.R]:::R

  end

  chain_file .-> bcftools_liftover
  sv_vs_snv --> bcftools_liftover

  subgraph SNV ["`**Mutect2, HaplotypeCaller, Strelka2, Muse2, SomaticSniper**`"]
    funcotator_sources([funcotator_sources]):::input
    chain_file([chain_file]):::input
    repeat_bed([repeat_bed]):::input

    bcftools_liftover[bcftools +liftover]:::bcftools
    ---> gatk_func[gatk Funcotator]:::gatk
    --> bcftools_annotate["`bcftools annotate*RepeatMasker*`"]:::bcftools
    --> bcftools_annotate2["`bcftools annotate*Trinucleotide*`"]:::bcftools
    --> r_extract_snv[extract-VCF-features.R]:::R
  end

  funcotator_sources .-> gatk_func
  repeat_bed .-> bcftools_annotate

  joinpaths{ }
  r_extract_snv --> joinpaths
  r_extract_sv --> joinpaths
  joinpaths ---> r_predict_stability

  subgraph Predict Stability ["`&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Predict Stability**`"]
    r_predict_stability[predict-liftover-stability.R]:::R
    --> bcftools_annotate3["`bcftools annotate*Stability*`"]:::bcftools

    rf_model([rf_model]):::input .-> r_predict_stability
  end

  bcftools_annotate3 --> output_vcfs[["Output VCFs"]]:::output

Testing Results

Checklist

I have read the code review guidelines and the code review best practice on GitHub check-list.
I have reviewed the Nextflow pipeline standards.
The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].
I have set up or verified the branch protection rule following the github standards before opening this pull request.
I have added my name to the contributors listings in the manifest block in the nextflow.config as part of this pull request, am listed
already, or do not wish to be listed. (This acknowledgement is optional.)
I have added the changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.
I have updated the version number in the metadata.yaml and manifest block of the nextflow.config file following semver, or the version number has already been updated. (Leave it unchecked if you are unsure about new version number and discuss it with the infrastructure team in this PR.)
I have tested the pipeline on at least one A-mini sample.

nwiltsie · 2024-07-29T19:24:16Z

I'll also note that R package management is a nightmare for reproducibility. I finally hit on using renv, which allowed me to version-pin all of the packages. Bioconductor came close, but it doesn't include all packages and allows for "bugfix" updates within a larger release:

install() also nudges users to remain current within a release, by default checking for out-of-date packages and asking if the user would like to update.

nwiltsie · 2024-07-29T20:55:40Z

NFTest output (one of the tests failed): /hot/software/pipeline/pipeline-StableLift/Nextflow/development/unreleased/nwiltsie-regroup-modules/log-nftest-20240729T184802Z.log

nwiltsie · 2024-07-30T00:01:23Z

Thanks to @nkwang24, the SV test is now passing: /hot/software/pipeline/pipeline-StableLift/Nextflow/development/unreleased/nwiltsie-regroup-modules/log-nftest-20240729T233458Z.log.

yashpatel6 · 2024-07-31T22:14:33Z

I'll look over this later today or early tomorrow!

yashpatel6

A general note, we'll want to include the main tool level directory at the end here

Also added a couple of suggestions for process names added in this PR

Dockerfile

config/schema.yaml

yashpatel6 · 2024-08-01T22:24:13Z

config/template.config

+    repeat_bed = "/hot/ref/database/RepeatMasker-3.0.1/processed/GRCh38/GRCh38_RepeatMasker_intervals.bed"
+
+    // SV files
+    // FIXME Should this be bundled?


clarification: Is this a question of whether the files should be bundled into the Docker?

Either into the Docker image, with the pipeline, or to a given reference path on disk. Put another way, is a user expected to (1) provide this file for each pipeline run, (2) have a standard copy locally, or (3) have it automatically provided for them by the pipeline?

I'm thinking we can divide the various input files into the 3 categories you listed:

(1) RF models (6 tools x 2 conversion directions = 12 total @ ~10Mb - 1Gb) hosted separately for user to download
(2) Expect user to have standard resource files such as reference fastas, chain files, funcotator sources
(3) Bundle the non-standard resource files (repeat_bed, header_contigs, gnomad_rds) into the Docker

Is this what you had in mind?

Cool - so I think that works out as:

We upload RF models as attachments on pipeline releases.

Users handle standard resource files.

We bundle the non-standard resource files with the pipeline (the repeat_bed file is used outside of the docker image). That means they get checked into this repository and version-controlled.

For the non-standard resource files, it may be better to include as release attachments rather than version-control them

@yashpatel6 so you assert that there should be two categories of files?

User-provided standard files

Everything else, distributed as release attachments

That's what I would suggest yes; the concern I would have about bundling the non-standard files into the Docker is the case where a user may want to make changes or provide a different file for those and having it bundled and then the user providing the paths in the config like other resources seems more consistent and allowing of that behavior

docs/pipeline.mmd

main.nf

module/predict_stability.nf

module/scripts/predict-liftover-stability.R

module/sv_workflow.nf

yashpatel6

Generally looks good! One remaining comment for the output directory:

pipeline-StableLift/config/methods.config

Line 12 in eb8983a

    
           params.output_dir_base = "${params.output_dir}/${manifest.name}-${manifest.version}/${params.sample_id.replace(' ', '_')}"

To include StableLift- at the end:

params.output_dir_base = "${params.output_dir}/${manifest.name}-${manifest.version}/${params.sample_id.replace(' ', '_')}/StableLift-<manifest.version>"

It seems a bit redundant in this case since the main tool is the pipeline itself but this just brings it in line with the rest of the pipelines' output structure we follow

nwiltsie · 2024-08-07T15:39:28Z

Okay @yashpatel6, I've added StableLift-${manifest.version} to the end of the output path (5f27a63) and figured out how to slightly reduce the version pin specificity (c9fde8c). Assuming that's all good can you give a final approval?

yashpatel6

Great! Looks good!

nwiltsie added 30 commits July 24, 2024 10:17

Rename workflow to snv_annotations, absorb Funcotator

3d1f16d

s/RepeatMasker-v3.0.1/RepeatMasker-3.0.1/

7b773ab

Use stablelift image from main

ad98a45

Add original copy of extract-vcf-features-SV.R

dae4905

Add --output-rds argument

5df8c97

Add workflow for SV

4716ac5

Refactor, support SV and SNV

8bbcfe2

Add stubs to all processes

0e499b3

Bugfix, need leading params

1f18dfe

Bugfix, remove module/ from relative path

83cd1e7

Remove redundant process

966788b

Bugfix, clean up an undefined stub variable

645f154

Bugfix, clean up more undefined stub variables

88ae361

Get rid of variables in utils module

04e7f26

Clean up variables in sv_workflow.nf

dfcb703

Clean up variables in snv_workflow.nf

257eb74

Clean up variables in snv_annotations.nf

01d3df7

Replace colons with slashes

3359f08

Combine intermediate files

162f22e

Rename NFTest case as SNV-specific

92a73a5

Add SV-specific NFTest, bugfix for parameters

68c7085

Bundle rtracklayer into Docker

4ae07bf

Group arguments in Dockerfile

8b87584

Small bugfixes

5a2745a

Pre-copy folder to standard path

7a48523

Remove quotes

2c4953f

Try a different mechanism to get library paths

f03a5af

Use branch version of image

98750e9

Bugfixes, test cleanup for SV case

642aa56

Add mermaid flow diagram

6df89f0

nwiltsie added 4 commits July 29, 2024 08:55

Add output at end of pipeline

971a713

Pull in latest changes to predict-liftover-stability.R

9291ff0

Bugfix, channel mis-match

729a970

Update CHANGELOG

5977e7e

nwiltsie requested a review from a team as a code owner July 29, 2024 18:59

Fix lints

8e10b3b

nwiltsie assigned nkwang24 Jul 29, 2024

yashpatel6 self-assigned this Jul 29, 2024

Sort VCF after liftover in SV branch

22ccbc2

yashpatel6 reviewed Aug 1, 2024

View reviewed changes

nwiltsie added 4 commits August 2, 2024 10:37

Reword 'Variant Caller' to 'Variant Type'

4701b01

Remove unused R function

a5570f4

s/run_sv_liftover/liftover_SV_StableLift/

b12a428

s/run_intersect_gnomad/annotate_gnomAD_StableLift/

eb8983a

yashpatel6 reviewed Aug 6, 2024

View reviewed changes

nwiltsie added 2 commits August 6, 2024 16:29

Add 'StableLift-${manifest.version}' to output_dir_base

5f27a63

Use wildcards for aptitude package build versions

c9fde8c

nwiltsie requested a review from yashpatel6 August 7, 2024 15:39

yashpatel6 approved these changes Aug 7, 2024

View reviewed changes

nwiltsie merged commit ca2b0ec into main Aug 7, 2024
8 checks passed

nwiltsie deleted the nwiltsie-regroup-modules branch August 7, 2024 16:57

nwiltsie mentioned this pull request Aug 20, 2024

Update GATK version #12

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rearrange processes and modules, add SV workflow for Delly2 #8

Rearrange processes and modules, add SV workflow for Delly2 #8

nwiltsie commented Jul 29, 2024 •

edited

Loading

nwiltsie commented Jul 29, 2024

nwiltsie commented Jul 29, 2024

nwiltsie commented Jul 30, 2024

yashpatel6 commented Jul 31, 2024

yashpatel6 left a comment

yashpatel6 Aug 1, 2024

nwiltsie Aug 1, 2024

nkwang24 Aug 2, 2024

nwiltsie Aug 2, 2024

yashpatel6 Aug 6, 2024

nwiltsie Aug 6, 2024

yashpatel6 Aug 6, 2024

yashpatel6 left a comment

nwiltsie commented Aug 7, 2024

yashpatel6 left a comment

Rearrange processes and modules, add SV workflow for Delly2 #8

Rearrange processes and modules, add SV workflow for Delly2 #8

Conversation

nwiltsie commented Jul 29, 2024 • edited Loading

Description

Testing Results

Checklist

nwiltsie commented Jul 29, 2024

nwiltsie commented Jul 29, 2024

nwiltsie commented Jul 30, 2024

yashpatel6 commented Jul 31, 2024

yashpatel6 left a comment

Choose a reason for hiding this comment

yashpatel6 Aug 1, 2024

Choose a reason for hiding this comment

nwiltsie Aug 1, 2024

Choose a reason for hiding this comment

nkwang24 Aug 2, 2024

Choose a reason for hiding this comment

nwiltsie Aug 2, 2024

Choose a reason for hiding this comment

yashpatel6 Aug 6, 2024

Choose a reason for hiding this comment

nwiltsie Aug 6, 2024

Choose a reason for hiding this comment

yashpatel6 Aug 6, 2024

Choose a reason for hiding this comment

yashpatel6 left a comment

Choose a reason for hiding this comment

nwiltsie commented Aug 7, 2024

yashpatel6 left a comment

Choose a reason for hiding this comment

nwiltsie commented Jul 29, 2024 •

edited

Loading