From d63d0826a2d76b7582ca415068f314e6bd27e576 Mon Sep 17 00:00:00 2001 From: Mark Walker Date: Thu, 27 Jun 2024 15:40:03 -0400 Subject: [PATCH 1/4] Add filtering documentation Typo Typo 2 Add to overview Round out readme Typo 3 Add detail Minor changes Fix workspace table Update inputs Fix templates --- README.md | 134 ++++++++++++++++-- .../cohort_mode_workspace_dashboard.md.tmpl | 15 +- .../FilterGenotypes.json.tmpl | 42 ++++++ .../JoinRawCalls.json.tmpl | 28 ++++ .../SVConcordance.json.tmpl | 12 ++ .../cohort_mode/workspace.tsv.tmpl | 2 + .../FilterGenotypes.fixed_cutoffs.json.tmpl | 8 +- inputs/values/resources_hg38.json | 25 ++-- scripts/test/terra_validation.py | 2 +- 9 files changed, 226 insertions(+), 42 deletions(-) create mode 100644 inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/FilterGenotypes.json.tmpl create mode 100644 inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/JoinRawCalls.json.tmpl create mode 100644 inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/SVConcordance.json.tmpl diff --git a/README.md b/README.md index 96d5f8356..0423490ae 100644 --- a/README.md +++ b/README.md @@ -24,8 +24,10 @@ A structural variation discovery pipeline for Illumina short-read whole-genome s * [GenotypeBatch](#genotype-batch) - Genotyping * [RegenotypeCNVs](#regenotype-cnvs) - Genotype refinement (optional) * [MakeCohortVcf](#make-cohort-vcf) - Cross-batch integration, complex event resolution, and VCF cleanup - * [Module 07](#module07) - Downstream Filtering - * [AnnotateVcf](#annotate-vcf) - Annotation + * [JoinRawCalls](#join-raw-calls) - Merges unfiltered calls across batches + * [SVConcordance](#svconcordance) - Calculates genotype concordance with raw calls + * [FilterGenotypes](#filter-genotypes) - Performs genotype filtering + * [AnnotateVcf](#annotate-vcf) - Functional and allele frequency annotation * [Module 09](#module09) - QC and Visualization * Additional modules - Mosaic and de novo * [CI/CD](#cicd) @@ -159,7 +161,9 @@ The pipeline consists of a series of modules that perform the following: * [FilterBatch](#filter-batch): Variant filtering; outlier exclusion * [GenotypeBatch](#genotype-batch): Genotyping * [MakeCohortVcf](#make-cohort-vcf): Cross-batch integration; complex variant resolution and re-genotyping; vcf cleanup -* [Module 07](#module07): Downstream filtering, including minGQ, batch effect check, outlier samples removal and final recalibration; +* [JoinRawCalls](#join-raw-calls): Merges unfiltered calls across batches +* [SVConcordance](#svconcordance): Calculates genotype concordance with raw calls +* [FilterGenotypes](#filter-genotypes): Performs genotype filtering * [AnnotateVcf](#annotate-vcf): Annotations, including functional annotation, allele frequency (AF) annotation and AF annotation with external population callsets; * [Module 09](#module09): Visualization, including scripts that generates IGV screenshots and rd plots. * Additional modules to be added: de novo and mosaic scripts @@ -471,23 +475,123 @@ Combines variants across multiple batches, resolves complex variants, re-genotyp #### Outputs: * Finalized "cleaned" VCF and QC plots -## Module 07 (in development) -Apply downstream filtering steps to the cleaned VCF to further control the false discovery rate; all steps are optional and users should decide based on the specific purpose of their projects. +## JoinRawCalls -Filtering methods include: -* minGQ - remove variants based on the genotype quality across populations. -Note: Trio families are required to build the minGQ filtering model in this step. We provide tables pre-trained with the 1000 genomes samples at different FDR thresholds for projects that lack family structures, and they can be found at the paths below. These tables assume that GQ has a scale of [0,999], so they will not work with newer VCFs where GQ has a scale of [0,99]. +Merges raw unfiltered calls across batches. Concordance between these genotypes and the joint call set usually can be indicative of variant quality and is used downstream for genotype filtering. + +#### Prerequisites: +* [ClusterBatch](#cluster-batch) + +#### Inputs: +* Clustered Manta, Wham, Scramble, Melt, and/or depth VCF URIs ([ClusterBatch](#cluster-batch)) +* Ped file +* Reference sequence + +#### Outputs: +* VCF of clustered raw calls +* Ploidy table + +## SVConcordance + +Computes genotype concordance metrics between all variants in the joint call set and raw calls. + +#### Prerequisites: +* [MakeCohortVcf](#make-cohort-vcf) +* [JoinRawCalls](#join-raw-calls) + +#### Inputs: +* Cleaned ("eval") VCF URI ([MakeCohortVcf](#make-cohort-vcf)) +* Joined raw call ("truth") VCF URI ([JoinRawCalls](#join-raw-calls)) +* Reference dictionary URI + +#### Outputs: +* VCF with concordance annotations + +## FilterGenotypes + +Performs genotype quality recalibration using a machine learning model based on [xgboost](https://github.com/dmlc/xgboost), and filters genotypes. + +The ML model uses the following features: + +* Genotype properties: + * Allele frequency (AF), no-call counts + * Genotype quality (GQ) + * Supporting evidence types (EV) and respective genotype qualities (PE_GQ, SR_GQ, RD_GQ) + * Raw call concordance (CONC_ST) +* Variant properties: + * Variant type (SVTYPE) and size (SVLEN) + * FILTER status + * Calling algorithms (ALGORITHMS) + * Supporting evidence types (EVIDENCE) + * Two-sided SR support flag (BOTHSIDES_SUPPORT) + * Evidence overdispersion flag (PESR_GT_OVERDISPERSION) + * SR noise flag (HIGH_SR_BACKGROUND) + * Raw call concordance (STATUS, NON_REF_GENOTYPE_CONCORDANCE, VAR_PPV, VAR_SENSITIVITY, TRUTH_AF) +* Reference context with respect to UCSC Genome Browser tracks: + * RepeatMasker + * Segmental duplications + * Simple repeats + * K-mer mappability (umap_s100 and umap_s24) + +For ease of use, we provide a model pre-trained on high-quality data with truth data derived from long-read calls: ``` -gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.10perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt -gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.1perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt -gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.5perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt +gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/gatk-sv-recalibrator.aou_phase_1.v1.model ``` +See the SV "Genotype Filter" section on page 34 of the [All of Us Genomic Quality Report C2022Q4R9 CDR v7](https://support.researchallofus.org/hc/en-us/articles/4617899955092-All-of-Us-Genomic-Quality-Report-ARCHIVED-C2022Q4R9-CDR-v7) for further details on model training. -* BatchEffect - remove variants that show significant discrepancies in allele frequencies across batches -* FilterOutlierSamplesPostMinGQ - remove outlier samples with unusually high or low number of SVs -* FilterCleanupQualRecalibration - sanitize filter columns and recalibrate variant QUAL scores for easier interpretation +All valid genotypes are annotated with a "scaled logit" (SL) score, which is rescaled to non-negative adjusted GQs on [1, 99]. Note that the rescaled GQs should *not* be interpreted as probabilities. Original genotype qualities are retained in the OGQ field. + +A more positive SL score indicates higher probability of correctness of the given genotype. Genotypes are therefore filtered using SL thresholds that depend on SV type and size. This workflow also generates QC plots using the [MainVcfQc](https://github.com/broadinstitute/gatk-sv/blob/main/wdl/MainVcfQc.wdl) workflow to review call set quality (see below for recommended practices). + +This workflow can be run in one of two modes: + +1. (Recommended) The user explicitly provides a set of SL cutoffs through the `sl_filter_args` parameter, e.g. + ``` + "--small-del-threshold 93 --medium-del-threshold 150 --small-dup-threshold -51 --medium-dup-threshold -4 --ins-threshold -13 --inv-threshold -19" + ``` + Genotypes with SL scores less than the cutoffs are set to no-call (`./.`). The above values were taken directly from Appendix N of the [All of Us Genomic Quality Report C2022Q4R9 CDR v7 ](https://support.researchallofus.org/hc/en-us/articles/4617899955092-All-of-Us-Genomic-Quality-Report-ARCHIVED-C2022Q4R9-CDR-v7). Users should adjust the thresholds depending on data quality and desired accuracy. Please see the arguments in [this script](https://github.com/broadinstitute/gatk-sv/blob/main/src/sv-pipeline/scripts/apply_sl_filter.py) for all available options. + +2. (Advanced) The user provides truth labels for a subset of non-reference calls, and SL cutoffs are automatically optimized. These truth labels should be provided as a json file in the following format: + ``` + { + "sample_1": { + "good_variant_ids": ["variant_1", "variant_3"], + "bad_variant_ids": ["variant_5", "variant_10"] + }, + "sample_2": { + "good_variant_ids": ["variant_2", "variant_13"], + "bad_variant_ids": ["variant_8", "variant_11"] + } + } + ``` + where "good_variant_ids" and "bad_variant_ids" are lists of variant IDs corresponding to non-reference (i.e. het or hom-var) sample genotypes that are true positives and false positives, respectively. SL cutoffs are optimized by maximizing the [F-score](https://en.wikipedia.org/wiki/F-score) with "beta" parameter `fmax_beta`, which modulates the weight given to precision over recall (lower values give higher precision). + +In both modes, the workflow additionally filters variants based on the "no-call rate", the proportion of genotypes that were filtered in a given variant. Variants exceeding the `no_call_rate_cutoff` are assigned a `HIGH_NCR` filter status. + +We recommend users observe the following basic criteria to assess the overall quality of the filtered call set: + +* Number of PASS variants (excluding BND) between 7,000 and 11,000. +* At least 75% of variants in Hardy-Weinberg equilibrium (HWE). Note that this could be lower, depending on how how closely the cohort adheres to the assumptions of the Hardy-Weinberg model. However, HWE is expected to at least improve after filtering. +* Low *de novo* inheritance rate (if applicable), typically 5-10%. + +These criteria can be assessed from the plots in the `main_vcf_qc_tarball` output, which is generated by default. + +#### Prerequisites: +* [SVConcordance](#svconcordance) + +#### Inputs: +* VCF with genotype concordance annotations URI ([SVConcordance](#svconcordance)) +* Ploidy table URI ([JoinRawCalls](#join-raw-calls)) +* GQRecalibrator model URI +* Either a set of SL cutoffs or truth labels + +#### Outputs: +* The filtered VCF +* Call set QC plots (optional) +* Optimized SL cutoffs with filtering QC plots and data tables (if running mode [2] with truth labels) +* A copy of the VCF with only SL annotation and GQ recalibration (before filtering) -## AnnotateVcf (in development) +## AnnotateVcf *Formerly Module08Annotation* Add annotations, such as the inferred function and allele frequencies of variants, to final VCF. diff --git a/inputs/templates/terra_workspaces/cohort_mode/cohort_mode_workspace_dashboard.md.tmpl b/inputs/templates/terra_workspaces/cohort_mode/cohort_mode_workspace_dashboard.md.tmpl index b210e9b53..549d0b1c3 100644 --- a/inputs/templates/terra_workspaces/cohort_mode/cohort_mode_workspace_dashboard.md.tmpl +++ b/inputs/templates/terra_workspaces/cohort_mode/cohort_mode_workspace_dashboard.md.tmpl @@ -61,13 +61,16 @@ The following workflows are included in this workspace, to be executed in this o 13. `13-ResolveComplexVariants`: Complex variant resolution 14. `14-GenotypeComplexVariants`: Complex variant re-genotyping 15. `15-CleanVcf`: VCF cleanup -16. `16-MainVcfQc`: Generates VCF QC reports -17. `17-AnnotateVcf`: Cohort VCF annotations, including functional annotation, allele frequency (AF) annotation, and AF annotation with external population callsets +16. `16-JoinRawCalls`: Combines unfiltered calls (from step 5) across batches +17. `17-SVConcordance`: Annotates variants with genotype concordance against raw calls +18. `18-FilterGenotypes`: Performs genotype filtering to improve precision and generates QC plots +19. `19-AnnotateVcf`: Cohort VCF annotations, including functional annotation, allele frequency (AF) annotation, and AF annotation with external population callsets -Additional downstream modules, such as those for filtering and visualization, are under development. They are not included in this workspace at this time, but the source code can be found in the [GATK-SV GitHub repository](https://github.com/broadinstitute/gatk-sv). See **Downstream steps** towards the bottom of this page for more information. +Additional downstream modules, such as those for visualization, are under development. They are not included in this workspace at this time, but the source code can be found in the [GATK-SV GitHub repository](https://github.com/broadinstitute/gatk-sv). See **Downstream steps** towards the bottom of this page for more information. Extra workflows (Not part of canonical pipeline, but included for your convenience. May require manual configuration): -* `PlotSVCountsPerSample: Plot SV counts per sample per SV type +* `MainVcfQc`: Generates VCF QC reports (is run during 18-FilterGenotypes by default) +* `PlotSVCountsPerSample`: Plot SV counts per sample per SV type * `FilterOutlierSamples`: Filter outlier samples (in terms of SV counts) from a single VCF. Recommended to run `PlotSVCountsPerSample` beforehand (configured with the single VCF you want to filter) to enable IQR cutoff choice. For detailed instructions on running the pipeline in Terra, see **Step-by-step instructions** below. @@ -202,11 +205,11 @@ Read the full MergeBatchSites documentation [here](https://github.com/broadinsti Read the full GenotypeBatch documentation [here](https://github.com/broadinstitute/gatk-sv#genotype-batch). * Use the same `sample_set` definitions you used for `03-TrainGCNV` through `08-FilterBatchSamples`. -#### 11-RegenotypeCNVs, 12-CombineBatches, 13-ResolveComplexVariants, 14-GenotypeComplexVariants, 15-CleanVcf, 16-MainVcfQc, and 17-AnnotateVcf +#### 11-RegenotypeCNVs, 12-CombineBatches, 13-ResolveComplexVariants, 14-GenotypeComplexVariants, 15-CleanVcf, 16-JoinRawCalls, 17-SVConcordance, 18-FilterGenotypes, and 19-AnnotateVcf Read the full documentation for [RegenotypeCNVs](https://github.com/broadinstitute/gatk-sv#regenotype-cnvs), [MakeCohortVcf](https://github.com/broadinstitute/gatk-sv#make-cohort-vcf) (which includes `CombineBatches`, `ResolveComplexVariants`, `GenotypeComplexVariants`, `CleanVcf`, `MainVcfQc`), and [AnnotateVcf](https://github.com/broadinstitute/gatk-sv#annotate-vcf) on the README. * Use the same cohort `sample_set_set` you created and used for `09-MergeBatchSites`. #### Downstream steps -Additional downstream steps are under development. Read about some of them on the README [here](https://github.com/broadinstitute/gatk-sv#module07). Please note that the VCF produced by `15-CleanVcf` (and annotated by `17-AnnotateVcf`) prioritizes sensitivity, but additional downstream filtration is recommended to improve specificity. Filtration methods are under active development by the GATK-SV team; stay tuned for updates. +Additional downstream steps are under development. Read about some of them on the README [here](https://github.com/broadinstitute/gatk-sv#module07). diff --git a/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/FilterGenotypes.json.tmpl b/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/FilterGenotypes.json.tmpl new file mode 100644 index 000000000..819477ad0 --- /dev/null +++ b/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/FilterGenotypes.json.tmpl @@ -0,0 +1,42 @@ +{ + "FilterGenotypes.vcf": "${this.concordance_vcf}", + "FilterGenotypes.output_prefix": "${this.sample_set_id}", + "FilterGenotypes.ploidy_table": "${this.ploidy_table}", + "FilterGenotypes.gq_recalibrator_model_file": "${workspace.recalibrate_gq_model_file}", + "FilterGenotypes.sl_filter_args": "--small-del-threshold 93 --medium-del-threshold 150 --small-dup-threshold -51 --medium-dup-threshold -4 --ins-threshold -13 --inv-threshold -19", + + "FilterGenotypes.genome_tracks": "${workspace.recalibrate_gq_genome_tracks}", + "FilterGenotypes.genome_tracks": [ + {{ reference_resources.recalibrate_gq_genome_track_repeatmasker | tojson }}, + {{ reference_resources.recalibrate_gq_genome_track_segdup | tojson }}, + {{ reference_resources.recalibrate_gq_genome_track_simple_repeats | tojson }}, + {{ reference_resources.recalibrate_gq_genome_track_umap100 | tojson }}, + {{ reference_resources.recalibrate_gq_genome_track_umap24 | tojson }} + ], + "FilterGenotypes.recalibrate_gq_args": [ + "--keep-homvar false", + "--keep-homref true", + "--keep-multiallelic true", + "--skip-genotype-filtering true", + "--min-samples-to-estimate-allele-frequency -1" + ], + + "FilterGenotypes.ped_file": "${workspace.cohort_ped_file}", + "FilterGenotypes.primary_contigs_fai": "${workspace.primary_contigs_fai}", + "FilterGenotypes.site_level_comparison_datasets": [ + {{ reference_resources.ccdg_abel_site_level_benchmarking_dataset | tojson }}, + {{ reference_resources.gnomad_v2_collins_site_level_benchmarking_dataset | tojson }}, + {{ reference_resources.hgsv_byrska_bishop_site_level_benchmarking_dataset | tojson }}, + {{ reference_resources.thousand_genomes_site_level_benchmarking_dataset | tojson }} + ], + "FilterGenotypes.runtime_override_plot_qc_per_family": { + "mem_gb": 15, + "disk_gb": 100 + }, + + "FilterGenotypes.linux_docker": "${workspace.linux_docker}", + "FilterGenotypes.gatk_docker": "${workspace.gq_recalibrator_docker}", + "FilterGenotypes.gq_recalibrator_docker": "${workspace.gq_recalibrator_docker}", + "FilterGenotypes.sv_base_mini_docker": "${workspace.sv_base_mini_docker}", + "FilterGenotypes.sv_pipeline_docker": "${workspace.sv_pipeline_docker}" +} diff --git a/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/JoinRawCalls.json.tmpl b/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/JoinRawCalls.json.tmpl new file mode 100644 index 000000000..1d6723722 --- /dev/null +++ b/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/JoinRawCalls.json.tmpl @@ -0,0 +1,28 @@ +{ + "JoinRawCalls.gatk_docker": "${workspace.gatk_docker}", + "JoinRawCalls.sv_base_mini_docker": "${workspace.sv_base_mini_docker}", + "JoinRawCalls.sv_pipeline_docker": "${workspace.sv_pipeline_docker}", + + "JoinRawCalls.clustered_depth_vcfs" : "${this.clustered_depth_vcf}", + "JoinRawCalls.clustered_depth_vcf_indexes" : "${this.clustered_depth_vcf_index}", + + "JoinRawCalls.clustered_manta_vcfs" : "${this.clustered_manta_vcf}", + "JoinRawCalls.clustered_manta_vcf_indexes" : "${this.clustered_manta_vcf_index}", + + "JoinRawCalls.clustered_wham_vcfs" : "${this.clustered_wham_vcf}", + "JoinRawCalls.clustered_wham_vcf_indexes" : "${this.clustered_wham_vcf_index}", + + "JoinRawCalls.clustered_melt_vcfs" : "${this.clustered_melt_vcf}", + "JoinRawCalls.clustered_melt_vcf_indexes" : "${this.clustered_melt_vcf_index}", + + "JoinRawCalls.FormatVcfForGatk.formatter_args": "--fix-end", + + "JoinRawCalls.ped_file": "${workspace.cohort_ped_file}", + + "JoinRawCalls.contig_list": "${workspace.primary_contigs_list}", + "JoinRawCalls.reference_fasta": "${workspace.reference_fasta}", + "JoinRawCalls.reference_fasta_fai": "${workspace.reference_index}", + "JoinRawCalls.reference_dict": "${workspace.reference_dict}", + + "JoinRawCalls.prefix": "${this.sample_set_id}" +} diff --git a/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/SVConcordance.json.tmpl b/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/SVConcordance.json.tmpl new file mode 100644 index 000000000..2b7b98f72 --- /dev/null +++ b/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/SVConcordance.json.tmpl @@ -0,0 +1,12 @@ +{ + "SVConcordance.gatk_docker": "${workspace.gatk_docker}", + "SVConcordance.sv_base_mini_docker": "${workspace.sv_base_mini_docker}", + + "SVConcordance.eval_vcf" : "${this.cleaned_vcf}", + "SVConcordance.truth_vcf" : "${this.joined_raw_calls_vcf}", + + "SVConcordance.output_prefix": "${this.sample_set_id}", + + "SVConcordance.contig_list": "${workspace.primary_contigs_list}", + "SVConcordance.reference_dict": "${workspace.reference_dict}" +} diff --git a/inputs/templates/terra_workspaces/cohort_mode/workspace.tsv.tmpl b/inputs/templates/terra_workspaces/cohort_mode/workspace.tsv.tmpl index ed622c665..a9359d4d6 100644 --- a/inputs/templates/terra_workspaces/cohort_mode/workspace.tsv.tmpl +++ b/inputs/templates/terra_workspaces/cohort_mode/workspace.tsv.tmpl @@ -5,6 +5,7 @@ gatk_docker {{ dockers.gatk_docker }} gatk_docker_pesr_override {{ dockers.gatk_docker_pesr_override }} gcnv_gatk_docker {{ dockers.gatk_docker }} genomes_in_the_cloud_docker {{ dockers.genomes_in_the_cloud_docker }} +gq_recalibrator_docker {{ dockers.gq_recalibrator_docker }} linux_docker {{ dockers.linux_docker }} manta_docker {{ dockers.manta_docker }} samtools_cloud_docker {{ dockers.samtools_cloud_docker }} @@ -39,6 +40,7 @@ preprocessed_intervals {{ reference_resources.preprocessed_intervals }} primary_contigs_fai {{ reference_resources.primary_contigs_fai }} primary_contigs_list {{ reference_resources.primary_contigs_list }} protein_coding_gtf {{ reference_resources.protein_coding_gtf }} +recalibrate_gq_model_file {{ reference_resources.aou_recalibrate_gq_model_file }} reference_build {{ reference_resources.reference_build }} reference_dict {{ reference_resources.reference_dict }} reference_fasta {{ reference_resources.reference_fasta }} diff --git a/inputs/templates/test/FilterGenotypes/FilterGenotypes.fixed_cutoffs.json.tmpl b/inputs/templates/test/FilterGenotypes/FilterGenotypes.fixed_cutoffs.json.tmpl index 19bce81e9..3f90718bd 100644 --- a/inputs/templates/test/FilterGenotypes/FilterGenotypes.fixed_cutoffs.json.tmpl +++ b/inputs/templates/test/FilterGenotypes/FilterGenotypes.fixed_cutoffs.json.tmpl @@ -14,18 +14,14 @@ "--min-samples-to-estimate-allele-frequency -1" ], - "FilterGenotypes.ped_file": {{ test_batch.ped_file | tojson }}, - "FilterGenotypes.primary_contigs_fai": {{ reference_resources.primary_contigs_fai | tojson }}, + "FilterGenotypes.ped_file": "${workspace.cohort_ped_file}", + "FilterGenotypes.primary_contigs_fai": "${workspace.primary_contigs_fai}", "FilterGenotypes.site_level_comparison_datasets": [ {{ reference_resources.ccdg_abel_site_level_benchmarking_dataset | tojson }}, {{ reference_resources.gnomad_v2_collins_site_level_benchmarking_dataset | tojson }}, {{ reference_resources.hgsv_byrska_bishop_site_level_benchmarking_dataset | tojson }}, {{ reference_resources.thousand_genomes_site_level_benchmarking_dataset | tojson }} ], - "FilterGenotypes.sample_level_comparison_datasets": [ - {{ reference_resources.hgsv_byrska_bishop_sample_level_benchmarking_dataset | tojson }} - ], - "FilterGenotypes.sample_renaming_tsv": {{ reference_resources.hgsv_byrska_bishop_sample_renaming_tsv | tojson }}, "FilterGenotypes.runtime_override_per_sample_benchmark_plot": { "mem_gb": 30, "disk_gb": 50 diff --git a/inputs/values/resources_hg38.json b/inputs/values/resources_hg38.json index c587cfc85..5bcb3dc45 100644 --- a/inputs/values/resources_hg38.json +++ b/inputs/values/resources_hg38.json @@ -50,20 +50,17 @@ "aou_recalibrate_gq_model_file": "gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/gatk-sv-recalibrator.aou_phase_1.v1.model", "hgdp_recalibrate_gq_model_file": "gs://gatk-sv-hgdp/mw-sv-concordance-update/hgdp.gq_recalibrator.model", - "recalibrate_gq_genome_tracks": [ - "gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/ucsc-genome-tracks/hg38-RepeatMasker.bed.gz", - "gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/ucsc-genome-tracks/hg38-Segmental-Dups.bed.gz", - "gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/ucsc-genome-tracks/hg38-Simple-Repeats.bed.gz", - "gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/ucsc-genome-tracks/hg38_umap_s100.bed.gz", - "gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/ucsc-genome-tracks/hg38_umap_s24.bed.gz" - ], - "recalibrate_gq_genome_track_indexes": [ - "gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/ucsc-genome-tracks/hg38-RepeatMasker.bed.gz.tbi", - "gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/ucsc-genome-tracks/hg38-Segmental-Dups.bed.gz.tbi", - "gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/ucsc-genome-tracks/hg38-Simple-Repeats.bed.gz.tbi", - "gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/ucsc-genome-tracks/hg38_umap_s100.bed.gz.tbi", - "gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/ucsc-genome-tracks/hg38_umap_s24.bed.gz.tbi" - ], + + "recalibrate_gq_genome_track_repeatmasker": "gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/ucsc-genome-tracks/hg38-RepeatMasker.bed.gz", + "recalibrate_gq_genome_track_repeatmasker_index": "gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/ucsc-genome-tracks/hg38-RepeatMasker.bed.gz.tbi", + "recalibrate_gq_genome_track_segdup": "gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/ucsc-genome-tracks/hg38-Segmental-Dups.bed.gz", + "recalibrate_gq_genome_track_segdup_index": "gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/ucsc-genome-tracks/hg38-Segmental-Dups.bed.gz.tbi", + "recalibrate_gq_genome_track_simple_repeats": "gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/ucsc-genome-tracks/hg38-Simple-Repeats.bed.gz", + "recalibrate_gq_genome_track_simple_repeats_index": "gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/ucsc-genome-tracks/hg38-Simple-Repeats.bed.gz.tbi", + "recalibrate_gq_genome_track_umap100": "gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/ucsc-genome-tracks/hg38_umap_s100.bed.gz", + "recalibrate_gq_genome_track_umap100_index": "gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/ucsc-genome-tracks/hg38_umap_s100.bed.gz.tbi", + "recalibrate_gq_genome_track_umap24": "gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/ucsc-genome-tracks/hg38_umap_s24.bed.gz", + "recalibrate_gq_genome_track_umap24_index": "gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/ucsc-genome-tracks/hg38_umap_s24.bed.gz.tbi", "ccdg_abel_site_level_benchmarking_dataset" : ["CCDG_abel", "gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/CCDG_Abel"], "gnomad_v2_collins_sample_level_benchmarking_dataset": ["gnomAD_v2_Collins", "gs://gatk-sv-resources-secure/resources/hg38_benchmarking/gnomAD_v2_Collins_perSample.tar.gz"], diff --git a/scripts/test/terra_validation.py b/scripts/test/terra_validation.py index 84d782890..2e9a64756 100644 --- a/scripts/test/terra_validation.py +++ b/scripts/test/terra_validation.py @@ -113,7 +113,7 @@ def main(): parser.add_argument("-j", "--womtool-jar", help="Path to womtool jar", required=True) parser.add_argument("-n", "--num-input-jsons", help="Number of Terra input JSONs expected", - required=False, default=21, type=int) + required=False, default=24, type=int) parser.add_argument("--log-level", help="Specify level of logging information, ie. info, warning, error (not case-sensitive)", required=False, default="INFO") From ad9360a65fd8e25b26302626cdf30281558898ba Mon Sep 17 00:00:00 2001 From: Mark Walker Date: Tue, 2 Jul 2024 10:07:15 -0400 Subject: [PATCH 2/4] Fix FilterGenotypes config template --- .../workflow_configurations/FilterGenotypes.json.tmpl | 1 - 1 file changed, 1 deletion(-) diff --git a/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/FilterGenotypes.json.tmpl b/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/FilterGenotypes.json.tmpl index 819477ad0..6f3ff2955 100644 --- a/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/FilterGenotypes.json.tmpl +++ b/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/FilterGenotypes.json.tmpl @@ -36,7 +36,6 @@ "FilterGenotypes.linux_docker": "${workspace.linux_docker}", "FilterGenotypes.gatk_docker": "${workspace.gq_recalibrator_docker}", - "FilterGenotypes.gq_recalibrator_docker": "${workspace.gq_recalibrator_docker}", "FilterGenotypes.sv_base_mini_docker": "${workspace.sv_base_mini_docker}", "FilterGenotypes.sv_pipeline_docker": "${workspace.sv_pipeline_docker}" } From 05b802583a6b608d4e6e6cfce489affdf0e5bd95 Mon Sep 17 00:00:00 2001 From: Mark Walker Date: Mon, 30 Sep 2024 11:03:45 -0400 Subject: [PATCH 3/4] Reviewer comments --- README.md | 14 +++++++------- .../cohort_mode_workspace_dashboard.md.tmpl | 4 ++-- .../FilterGenotypes.json.tmpl | 2 +- .../JoinRawCalls.json.tmpl | 18 +++++++++--------- .../SVConcordance.json.tmpl | 2 +- .../FilterGenotypes.fixed_cutoffs.json.tmpl | 4 ++-- 6 files changed, 22 insertions(+), 22 deletions(-) diff --git a/README.md b/README.md index 0423490ae..41803eb83 100644 --- a/README.md +++ b/README.md @@ -164,7 +164,7 @@ The pipeline consists of a series of modules that perform the following: * [JoinRawCalls](#join-raw-calls): Merges unfiltered calls across batches * [SVConcordance](#svconcordance): Calculates genotype concordance with raw calls * [FilterGenotypes](#filter-genotypes): Performs genotype filtering -* [AnnotateVcf](#annotate-vcf): Annotations, including functional annotation, allele frequency (AF) annotation and AF annotation with external population callsets; +* [AnnotateVcf](#annotate-vcf): Annotations, including functional annotation, allele frequency (AF) annotation and AF annotation with external population callsets * [Module 09](#module09): Visualization, including scripts that generates IGV screenshots and rd plots. * Additional modules to be added: de novo and mosaic scripts @@ -483,8 +483,8 @@ Merges raw unfiltered calls across batches. Concordance between these genotypes * [ClusterBatch](#cluster-batch) #### Inputs: -* Clustered Manta, Wham, Scramble, Melt, and/or depth VCF URIs ([ClusterBatch](#cluster-batch)) -* Ped file +* Clustered Manta, Wham, depth, Scramble, and/or MELT VCF URIs ([ClusterBatch](#cluster-batch)) +* PED file * Reference sequence #### Outputs: @@ -514,7 +514,7 @@ Performs genotype quality recalibration using a machine learning model based on The ML model uses the following features: * Genotype properties: - * Allele frequency (AF), no-call counts + * Non-reference and no-call allele counts * Genotype quality (GQ) * Supporting evidence types (EV) and respective genotype qualities (PE_GQ, SR_GQ, RD_GQ) * Raw call concordance (CONC_ST) @@ -541,7 +541,7 @@ See the SV "Genotype Filter" section on page 34 of the [All of Us Genomic Qualit All valid genotypes are annotated with a "scaled logit" (SL) score, which is rescaled to non-negative adjusted GQs on [1, 99]. Note that the rescaled GQs should *not* be interpreted as probabilities. Original genotype qualities are retained in the OGQ field. -A more positive SL score indicates higher probability of correctness of the given genotype. Genotypes are therefore filtered using SL thresholds that depend on SV type and size. This workflow also generates QC plots using the [MainVcfQc](https://github.com/broadinstitute/gatk-sv/blob/main/wdl/MainVcfQc.wdl) workflow to review call set quality (see below for recommended practices). +A more positive SL score indicates higher probability that the give genotype is not homozygous for the reference allele. Genotypes are therefore filtered using SL thresholds that depend on SV type and size. This workflow also generates QC plots using the [MainVcfQc](https://github.com/broadinstitute/gatk-sv/blob/main/wdl/MainVcfQc.wdl) workflow to review call set quality (see below for recommended practices). This workflow can be run in one of two modes: @@ -586,10 +586,10 @@ These criteria can be assessed from the plots in the `main_vcf_qc_tarball` outpu * Either a set of SL cutoffs or truth labels #### Outputs: -* The filtered VCF +* Filtered VCF * Call set QC plots (optional) * Optimized SL cutoffs with filtering QC plots and data tables (if running mode [2] with truth labels) -* A copy of the VCF with only SL annotation and GQ recalibration (before filtering) +* VCF with only SL annotation and GQ recalibration (before filtering) ## AnnotateVcf *Formerly Module08Annotation* diff --git a/inputs/templates/terra_workspaces/cohort_mode/cohort_mode_workspace_dashboard.md.tmpl b/inputs/templates/terra_workspaces/cohort_mode/cohort_mode_workspace_dashboard.md.tmpl index 549d0b1c3..224f83af9 100644 --- a/inputs/templates/terra_workspaces/cohort_mode/cohort_mode_workspace_dashboard.md.tmpl +++ b/inputs/templates/terra_workspaces/cohort_mode/cohort_mode_workspace_dashboard.md.tmpl @@ -207,9 +207,9 @@ Read the full GenotypeBatch documentation [here](https://github.com/broadinstitu #### 11-RegenotypeCNVs, 12-CombineBatches, 13-ResolveComplexVariants, 14-GenotypeComplexVariants, 15-CleanVcf, 16-JoinRawCalls, 17-SVConcordance, 18-FilterGenotypes, and 19-AnnotateVcf -Read the full documentation for [RegenotypeCNVs](https://github.com/broadinstitute/gatk-sv#regenotype-cnvs), [MakeCohortVcf](https://github.com/broadinstitute/gatk-sv#make-cohort-vcf) (which includes `CombineBatches`, `ResolveComplexVariants`, `GenotypeComplexVariants`, `CleanVcf`, `MainVcfQc`), and [AnnotateVcf](https://github.com/broadinstitute/gatk-sv#annotate-vcf) on the README. +Read the full documentation for [RegenotypeCNVs](https://github.com/broadinstitute/gatk-sv#regenotype-cnvs), [MakeCohortVcf](https://github.com/broadinstitute/gatk-sv#make-cohort-vcf) (which includes `CombineBatches`, `ResolveComplexVariants`, `GenotypeComplexVariants`, `CleanVcf`), [`JoinRawCalls`](https://github.com/broadinstitute/gatk-sv#join-raw-calls), [`SVConcordance`](https://github.com/broadinstitute/gatk-sv#svconcordance), [`FilterGenotypes`](https://github.com/broadinstitute/gatk-sv#filter-genotypes), and [AnnotateVcf](https://github.com/broadinstitute/gatk-sv#annotate-vcf) on the README. * Use the same cohort `sample_set_set` you created and used for `09-MergeBatchSites`. #### Downstream steps -Additional downstream steps are under development. Read about some of them on the README [here](https://github.com/broadinstitute/gatk-sv#module07). +Additional downstream steps are under development. diff --git a/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/FilterGenotypes.json.tmpl b/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/FilterGenotypes.json.tmpl index 6f3ff2955..1ef362b49 100644 --- a/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/FilterGenotypes.json.tmpl +++ b/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/FilterGenotypes.json.tmpl @@ -1,6 +1,6 @@ { "FilterGenotypes.vcf": "${this.concordance_vcf}", - "FilterGenotypes.output_prefix": "${this.sample_set_id}", + "FilterGenotypes.output_prefix": "${this.sample_set_set_id}", "FilterGenotypes.ploidy_table": "${this.ploidy_table}", "FilterGenotypes.gq_recalibrator_model_file": "${workspace.recalibrate_gq_model_file}", "FilterGenotypes.sl_filter_args": "--small-del-threshold 93 --medium-del-threshold 150 --small-dup-threshold -51 --medium-dup-threshold -4 --ins-threshold -13 --inv-threshold -19", diff --git a/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/JoinRawCalls.json.tmpl b/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/JoinRawCalls.json.tmpl index 1d6723722..0103480b1 100644 --- a/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/JoinRawCalls.json.tmpl +++ b/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/JoinRawCalls.json.tmpl @@ -3,17 +3,17 @@ "JoinRawCalls.sv_base_mini_docker": "${workspace.sv_base_mini_docker}", "JoinRawCalls.sv_pipeline_docker": "${workspace.sv_pipeline_docker}", - "JoinRawCalls.clustered_depth_vcfs" : "${this.clustered_depth_vcf}", - "JoinRawCalls.clustered_depth_vcf_indexes" : "${this.clustered_depth_vcf_index}", + "JoinRawCalls.clustered_depth_vcfs" : "${this.sample_sets.clustered_depth_vcf}", + "JoinRawCalls.clustered_depth_vcf_indexes" : "${this.sample_sets.clustered_depth_vcf_index}", - "JoinRawCalls.clustered_manta_vcfs" : "${this.clustered_manta_vcf}", - "JoinRawCalls.clustered_manta_vcf_indexes" : "${this.clustered_manta_vcf_index}", + "JoinRawCalls.clustered_manta_vcfs" : "${this.sample_sets.clustered_manta_vcf}", + "JoinRawCalls.clustered_manta_vcf_indexes" : "${this.sample_sets.clustered_manta_vcf_index}", - "JoinRawCalls.clustered_wham_vcfs" : "${this.clustered_wham_vcf}", - "JoinRawCalls.clustered_wham_vcf_indexes" : "${this.clustered_wham_vcf_index}", + "JoinRawCalls.clustered_wham_vcfs" : "${this.sample_sets.clustered_wham_vcf}", + "JoinRawCalls.clustered_wham_vcf_indexes" : "${this.sample_sets.clustered_wham_vcf_index}", - "JoinRawCalls.clustered_melt_vcfs" : "${this.clustered_melt_vcf}", - "JoinRawCalls.clustered_melt_vcf_indexes" : "${this.clustered_melt_vcf_index}", + "JoinRawCalls.clustered_scramble_vcfs" : "${this.sample_sets.clustered_scramble_vcf}", + "JoinRawCalls.clustered_scramble_vcf_indexes" : "${this.sample_sets.clustered_scramble_vcf_index}", "JoinRawCalls.FormatVcfForGatk.formatter_args": "--fix-end", @@ -24,5 +24,5 @@ "JoinRawCalls.reference_fasta_fai": "${workspace.reference_index}", "JoinRawCalls.reference_dict": "${workspace.reference_dict}", - "JoinRawCalls.prefix": "${this.sample_set_id}" + "JoinRawCalls.prefix": "${this.sample_set_set_id}" } diff --git a/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/SVConcordance.json.tmpl b/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/SVConcordance.json.tmpl index 2b7b98f72..ff48489ac 100644 --- a/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/SVConcordance.json.tmpl +++ b/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/SVConcordance.json.tmpl @@ -5,7 +5,7 @@ "SVConcordance.eval_vcf" : "${this.cleaned_vcf}", "SVConcordance.truth_vcf" : "${this.joined_raw_calls_vcf}", - "SVConcordance.output_prefix": "${this.sample_set_id}", + "SVConcordance.output_prefix": "${this.sample_set_set_id}", "SVConcordance.contig_list": "${workspace.primary_contigs_list}", "SVConcordance.reference_dict": "${workspace.reference_dict}" diff --git a/inputs/templates/test/FilterGenotypes/FilterGenotypes.fixed_cutoffs.json.tmpl b/inputs/templates/test/FilterGenotypes/FilterGenotypes.fixed_cutoffs.json.tmpl index 3f90718bd..5035f0d0b 100644 --- a/inputs/templates/test/FilterGenotypes/FilterGenotypes.fixed_cutoffs.json.tmpl +++ b/inputs/templates/test/FilterGenotypes/FilterGenotypes.fixed_cutoffs.json.tmpl @@ -14,8 +14,8 @@ "--min-samples-to-estimate-allele-frequency -1" ], - "FilterGenotypes.ped_file": "${workspace.cohort_ped_file}", - "FilterGenotypes.primary_contigs_fai": "${workspace.primary_contigs_fai}", + "FilterGenotypes.ped_file": {{ test_batch.ped_file | tojson }}, + "FilterGenotypes.primary_contigs_fai": {{ reference_resources.primary_contigs_fai | tojson }}, "FilterGenotypes.site_level_comparison_datasets": [ {{ reference_resources.ccdg_abel_site_level_benchmarking_dataset | tojson }}, {{ reference_resources.gnomad_v2_collins_site_level_benchmarking_dataset | tojson }}, From 9482cb368365986b42228ba13b30272b79e6f15c Mon Sep 17 00:00:00 2001 From: Mark Walker Date: Mon, 21 Oct 2024 11:25:04 -0400 Subject: [PATCH 4/4] Typo --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 41803eb83..1a6408e83 100644 --- a/README.md +++ b/README.md @@ -541,7 +541,7 @@ See the SV "Genotype Filter" section on page 34 of the [All of Us Genomic Qualit All valid genotypes are annotated with a "scaled logit" (SL) score, which is rescaled to non-negative adjusted GQs on [1, 99]. Note that the rescaled GQs should *not* be interpreted as probabilities. Original genotype qualities are retained in the OGQ field. -A more positive SL score indicates higher probability that the give genotype is not homozygous for the reference allele. Genotypes are therefore filtered using SL thresholds that depend on SV type and size. This workflow also generates QC plots using the [MainVcfQc](https://github.com/broadinstitute/gatk-sv/blob/main/wdl/MainVcfQc.wdl) workflow to review call set quality (see below for recommended practices). +A more positive SL score indicates higher probability that the given genotype is not homozygous for the reference allele. Genotypes are therefore filtered using SL thresholds that depend on SV type and size. This workflow also generates QC plots using the [MainVcfQc](https://github.com/broadinstitute/gatk-sv/blob/main/wdl/MainVcfQc.wdl) workflow to review call set quality (see below for recommended practices). This workflow can be run in one of two modes: