Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rearrange processes and modules, add SV workflow for Delly2 #8

Merged
merged 42 commits into from
Aug 7, 2024

Conversation

nwiltsie
Copy link
Member

@nwiltsie nwiltsie commented Jul 29, 2024

Description

This addresses 99% of #7, although I still have one lingering issue in that the final VCF fails indexing.

Closes #7.

I've rearranged all of the processes to be more modular - there are now three high-level blocks of validation (common), feature extraction (SNV or SV), and stability prediction (common). There are now two NFTest cases (one each for the SNV and SV branches), but they remain smoke tests without any assertions.

I've also added a pipeline diagram. I waffled back-and-forth on this but ultimately ended up using Mermaid rather than PlantUML so that I could use those "Parameterized Input" bubbles. GitHub's UI renders Mermaid code blocks, but for the README I manually rendered an SVG. I'm intending to extend our PlantUML-rendering action to do the same thing.

%%{init: {"flowchart": {"htmlLabels": false}} }%%

flowchart TD

  classDef input fill:#ffffb3
  classDef output fill:#b3de69
  classDef gatk fill:#bebada
  classDef bcftools fill:#fdb462
  classDef R fill:#8dd3c7
  classDef linux fill:#fb8072

  subgraph legend ["`**Legend**`"]
      direction RL
    subgraph nodes ["`**Nodes**`"]
      input[["Input File"]]:::input
      input_node(["Parameterized Input"]):::input
      output[["Output file"]]:::output
    end

    subgraph processes ["`**Processes**`"]
      gatk_docker[GATK]:::gatk
      bcftools_docker[bcftools]:::bcftools
      r_docker[Rscript]:::R
      linux_docker[Generic Linux]:::linux
    end
  end

  legend
  ~~~ input_vcf[["Input VCF"]]:::input
  --> pipeval:::linux
  --> sv_vs_snv{{Variant Caller?}}

  sv_vs_snv ------> r_liftover
  header_contigs .-> r_liftover
  chain_file2 ..-> r_liftover
  gnomad_rds .-> r_extract_sv

  subgraph SV ["`**Delly2**`"]
    %% Other input files
    header_contigs([header_contigs]):::input
    chain_file2([chain_file]):::input
    gnomad_rds([gnomad_rds]):::input

    r_liftover[liftover-Delly2-vcf.R]:::R
    ---> r_extract_sv[extract-VCF-features-SV.R]:::R

  end

  chain_file .-> bcftools_liftover
  sv_vs_snv --> bcftools_liftover

  subgraph SNV ["`**Mutect2, HaplotypeCaller, Strelka2, Muse2, SomaticSniper**`"]
    funcotator_sources([funcotator_sources]):::input
    chain_file([chain_file]):::input
    repeat_bed([repeat_bed]):::input

    bcftools_liftover[bcftools +liftover]:::bcftools
    ---> gatk_func[gatk Funcotator]:::gatk
    --> bcftools_annotate["`bcftools annotate*RepeatMasker*`"]:::bcftools
    --> bcftools_annotate2["`bcftools annotate*Trinucleotide*`"]:::bcftools
    --> r_extract_snv[extract-VCF-features.R]:::R
  end

  funcotator_sources .-> gatk_func
  repeat_bed .-> bcftools_annotate

  joinpaths{ }
  r_extract_snv --> joinpaths
  r_extract_sv --> joinpaths
  joinpaths ---> r_predict_stability

  subgraph Predict Stability ["`        **Predict Stability**`"]
    r_predict_stability[predict-liftover-stability.R]:::R
    --> bcftools_annotate3["`bcftools annotate*Stability*`"]:::bcftools

    rf_model([rf_model]):::input .-> r_predict_stability
  end

  bcftools_annotate3 --> output_vcfs[["Output VCFs"]]:::output

Loading

Testing Results

Checklist

  • I have read the code review guidelines and the code review best practice on GitHub check-list.

  • I have reviewed the Nextflow pipeline standards.

  • The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].

  • I have set up or verified the branch protection rule following the github standards before opening this pull request.

  • I have added my name to the contributors listings in the manifest block in the nextflow.config as part of this pull request, am listed
    already, or do not wish to be listed. (This acknowledgement is optional.)

  • I have added the changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.

  • I have updated the version number in the metadata.yaml and manifest block of the nextflow.config file following semver, or the version number has already been updated. (Leave it unchecked if you are unsure about new version number and discuss it with the infrastructure team in this PR.)

  • I have tested the pipeline on at least one A-mini sample.

@nwiltsie nwiltsie requested a review from a team as a code owner July 29, 2024 18:59
@nwiltsie
Copy link
Member Author

I'll also note that R package management is a nightmare for reproducibility. I finally hit on using renv, which allowed me to version-pin all of the packages. Bioconductor came close, but it doesn't include all packages and allows for "bugfix" updates within a larger release:

install() also nudges users to remain current within a release, by default checking for out-of-date packages and asking if the user would like to update.

@nwiltsie
Copy link
Member Author

NFTest output (one of the tests failed): /hot/software/pipeline/pipeline-StableLift/Nextflow/development/unreleased/nwiltsie-regroup-modules/log-nftest-20240729T184802Z.log

@yashpatel6 yashpatel6 self-assigned this Jul 29, 2024
@nwiltsie
Copy link
Member Author

Thanks to @nkwang24, the SV test is now passing: /hot/software/pipeline/pipeline-StableLift/Nextflow/development/unreleased/nwiltsie-regroup-modules/log-nftest-20240729T233458Z.log.

@yashpatel6
Copy link

I'll look over this later today or early tomorrow!

Copy link

@yashpatel6 yashpatel6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general note, we'll want to include the main tool level directory at the end here

Also added a couple of suggestions for process names added in this PR

Dockerfile Outdated Show resolved Hide resolved
config/schema.yaml Show resolved Hide resolved
repeat_bed = "/hot/ref/database/RepeatMasker-3.0.1/processed/GRCh38/GRCh38_RepeatMasker_intervals.bed"

// SV files
// FIXME Should this be bundled?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarification: Is this a question of whether the files should be bundled into the Docker?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either into the Docker image, with the pipeline, or to a given reference path on disk. Put another way, is a user expected to (1) provide this file for each pipeline run, (2) have a standard copy locally, or (3) have it automatically provided for them by the pipeline?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we can divide the various input files into the 3 categories you listed:

(1) RF models (6 tools x 2 conversion directions = 12 total @ ~10Mb - 1Gb) hosted separately for user to download
(2) Expect user to have standard resource files such as reference fastas, chain files, funcotator sources
(3) Bundle the non-standard resource files (repeat_bed, header_contigs, gnomad_rds) into the Docker

Is this what you had in mind?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool - so I think that works out as:

  1. We upload RF models as attachments on pipeline releases.
  2. Users handle standard resource files.
  3. We bundle the non-standard resource files with the pipeline (the repeat_bed file is used outside of the docker image). That means they get checked into this repository and version-controlled.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the non-standard resource files, it may be better to include as release attachments rather than version-control them

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yashpatel6 so you assert that there should be two categories of files?

  1. User-provided standard files
  2. Everything else, distributed as release attachments

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I would suggest yes; the concern I would have about bundling the non-standard files into the Docker is the case where a user may want to make changes or provide a different file for those and having it bundled and then the user providing the paths in the config like other resources seems more consistent and allowing of that behavior

docs/pipeline.mmd Outdated Show resolved Hide resolved
main.nf Show resolved Hide resolved
module/predict_stability.nf Show resolved Hide resolved
module/scripts/predict-liftover-stability.R Outdated Show resolved Hide resolved
module/sv_workflow.nf Outdated Show resolved Hide resolved
module/sv_workflow.nf Outdated Show resolved Hide resolved
Copy link

@yashpatel6 yashpatel6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good! One remaining comment for the output directory:

params.output_dir_base = "${params.output_dir}/${manifest.name}-${manifest.version}/${params.sample_id.replace(' ', '_')}"

To include StableLift- at the end:

params.output_dir_base = "${params.output_dir}/${manifest.name}-${manifest.version}/${params.sample_id.replace(' ', '_')}/StableLift-<manifest.version>"

It seems a bit redundant in this case since the main tool is the pipeline itself but this just brings it in line with the rest of the pipelines' output structure we follow

@nwiltsie
Copy link
Member Author

nwiltsie commented Aug 7, 2024

Okay @yashpatel6, I've added StableLift-${manifest.version} to the end of the output path (5f27a63) and figured out how to slightly reduce the version pin specificity (c9fde8c). Assuming that's all good can you give a final approval?

Copy link

@yashpatel6 yashpatel6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Looks good!

@nwiltsie nwiltsie merged commit ca2b0ec into main Aug 7, 2024
8 checks passed
@nwiltsie nwiltsie deleted the nwiltsie-regroup-modules branch August 7, 2024 16:57
@nwiltsie nwiltsie mentioned this pull request Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

StableLift-SV
3 participants