Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update CombineBatches workflow #732

Merged
merged 10 commits into from
Jan 14, 2025
Merged

Update CombineBatches workflow #732

merged 10 commits into from
Jan 14, 2025

Conversation

mwalker174
Copy link
Collaborator

@mwalker174 mwalker174 commented Oct 3, 2024

Replaces most methods in the CombineBatches workflow with a greatly simplified set of tasks that utilize GATK SVCluster and the new GroupedSVCluster tool (see PR). SVCluster replaces most of the current functionality including VCF joining and clustering, while GroupedSVCluster introduces refined clustering (a.k.a. "reclustering") that has become a best practice for larger call sets.

Clustering refinement is critical for consolidating redundant variants in repetitive sequence contexts such as simple repeats and segmental duplications. This also addresses an issue with duplicate insertions that share coordinates but have slightly different split read signatures (i.e. different END positions).

Genotype merging was improved slightly in this PR as well.

In addition, this PR makes some minor improvements to VCF formatting and parsing:

  • In GenotypeBatch, CNVs are now formatted the same way as in CleanVcf, i.e. no genotypes and <CNV> ALT allele and SVTYPE=CNV, rather than using alt alleles <CN0>,<CN1>,… and having SVTYPE=DUP. In case a user needs to run on a VCF with the old format, there is a legacy_vcfs flag in CombineBatches that will update to the new format prior to processing.
  • Within CombineBatches, records are annotated with HIGH_SR_BACKGROUND and BOTHSIDE_PASS INFO field flags rather than passing around separate lists, which is cumbersome.
  • Minor improvements to some downstream scripts to use .get() for accessing FORMAT fields rather than brackets. This was required in some cases because GATK omits a FORMAT field if it is null for all samples in a given record. Pysam then throws an error since the requested key does not exist, whereas .get() returns None.
  • The VCF is converted back to the "old" format at the end of CombineBatches to minimize risk of bugs in downstream workflows.
  • Minor change to the breakpoint overlap filter: variants are prioritized on BOTHSIDE_PASS status (binary) rather than fraction of supporting batches.

@mwalker174 mwalker174 force-pushed the mw_gatk_combine_batches branch from b5de9f0 to 3a170b6 Compare November 1, 2024 15:40
@mwalker174 mwalker174 force-pushed the mw_gatk_combine_batches branch from 3a170b6 to 6a6d00b Compare November 18, 2024 16:47
@mwalker174 mwalker174 force-pushed the mw_gatk_combine_batches branch from 2a39fcb to 9b729de Compare December 2, 2024 16:05
Add reformatting to GenotypeBatch

Expose reformat_script

Start ripping stuff out

Finish rewriting wdl and template

Add TODO and delete unused task

Don't assign genotypes to CNVs in add_genotypes.py

Set mixed_breakend_window to 1mb

Fixes

Start implementing ContextAwareClustering

Fix wdl

Update wdl and gatk docker

Comment

Set context overlap parameters

Update docker

Fix size in ReformatGenotypedVcf

Update to reformat_genotyped_vcf.py

Update legacy reformatter, fix GenotypeBatchMetrics wdl

Remove reformat step from GenotypeBatch

Use RD_CN if CN is unavailable for mCNVs in svtk vcf2bed

Add additional_args to svcluster and groupedsvcluster

Add join step

Update runtime attr

Filter legacy records with invalid coords (needs testing)

Fix record dropping; add --fix-end to wdl call

Representative breakpoint summary strategy

Update gatk docker

Integerate SR flags into VCF

Update dockers

Parse last column in SR flags lists

Gatk to svtk formatting

Fix CNV strands and overlap breakpoint filter bothside pass file parsing

Breakpoint overlap filter now sorts by BOTHSIDES_SUPPORT status rather than fraction

Set empty FILTER statuses to PASS

Use safer get() methods instead of brackets for accessing FORMAT fields

Delete unused VcfClusterSingleChromsome.wdl

Remove other unused wdls

Do not require CN or RD_CN to be defined for all samples for CNVs in get_called_samples()

Fix multi-allelic ALTs and genotype parsing

Fix multi-allelic formatting in cleanvcf5

Clean vcf 5 script override

Add SR1POS and SR2POS to gatk format to recover INS END coordinate

Reset dockers to main

Fix mCNV alts again

Update gatk docker

context to track

Integrate into top-level WDLs and update json templates

Update terra config

Fix MakeCohortVcf inputs

Remove MakeCohortVcf json templates

Remove duplicate runtime_override_clean_background_fail input in GATKSVPipelineSingleSample

Update yaml for testing

Remove duplicate runtime_override_breakpoint_overlap_filter input
@mwalker174 mwalker174 force-pushed the mw_gatk_combine_batches branch from 54a0f97 to e3adbd0 Compare December 3, 2024 16:08
@mwalker174 mwalker174 marked this pull request as ready for review December 9, 2024 20:54
Copy link
Collaborator

@epiercehoffman epiercehoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good! Just some very minor comments and a few questions for my own understanding

@@ -30,6 +30,7 @@

"GATKSVPipelineSingleSample.linux_docker" : "${workspace.linux_docker}",
"GATKSVPipelineSingleSample.manta_docker": "${workspace.manta_docker}",
"GATKSVPipelineSingleSample.scramble_docker": "${workspace.scramble_docker}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed the cromshell test JSON has both scramble and melt dockers - may want to drop melt

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's true. I've deleted it, although it's overridden with the "use_melt" parameter

continue
else:
new_header_lines.append(line)
# Incorporate SR evidence flags
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment outdated? I don't see how it relates to the code

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a remnant from copy+paste

new_record.id = record.id
# copy INFO fields
for key in record.info:
if key not in remove_infos:
new_record.info[key] = record.info[key]
if svtype == 'CNV':
# Prior to CleanVcf, all mCNVs are DUP type
new_record.info['SVTYPE'] = 'DUP'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you envision this script ever getting used after CleanVcf? If no, might be worth documenting that it shouldn't be since the name is pretty generic. I see the help text says it's for GenerateBatchMetrics but it appears that's out of date.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I've updated the program description that this shouldn't be used after CleanVcf

@@ -174,6 +212,10 @@ def __parse_arguments(argv: List[Text]) -> argparse.Namespace:
help="Comma-delimited list of FORMAT fields to remove")
parser.add_argument("--remove-infos", type=str,
help="Comma-delimited list of INFO fields to remove")
parser.add_argument("--intervaled-ins", default=False, action='store_true',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused argument? Not referred to in script or in WDLs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted

@@ -148,6 +172,11 @@ def is_null(val):
val = record.info[k]
if is_null(val) or (isinstance(val, tuple) and len(val) == 1 and is_null(val[0])):
del record.info[k]
# Add SR flag to INFO field
if record.id in bothside_pass_vid_set:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This functionality does seem a little bit out of place in this script, but I suppose it's more efficient to add it to an existing task than create a new one

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically we do now expect these INFO fields in GATK per broadinstitute/gatk#8990

}

call HarmonizeHeaders.HarmonizeHeaders {
# Third round of clustering
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is clustering in several stages needed or could it be accomplished in one step?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's required. The first round permits clustering across track boundaries - e.g. clustering two very similar variants where one might fall just outside a track interval and the other just inside. The second and third steps cluster groups that aren't mutually exclusive - e.g. INS in simple repeats in the second and INS in any context in the third will contain some of the same variants but have different criteria applied.

@@ -105,7 +105,7 @@ workflow ClusterDepth {
algorithm=clustering_algorithm,
depth_sample_overlap=0,
depth_interval_overlap=depth_interval_overlap,
depth_breakend_window=0,
depth_breakend_window=10000000,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a big change...?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just reflects an inversion of some logic in GATK. We now require both breakend window and reciprocal overlap instead of either/or. broadinstitute/gatk#8962

@@ -20,6 +20,8 @@ workflow GenotypeDepthPart2 {
File coveragefile
File? coveragefile_index

File? reformat_script
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this input was added but is not passed to any task - here and in GenotypePESRPart2 and GenotypeBatch

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I had those for debugging but forgot to delete these

else:
alleles = ('N', alleles[1])
contig = record.contig
new_record = vcf_out.new_record(contig=contig, start=record.start, stop=record.stop, alleles=alleles)
# Change filter to PASS if requested
if set_pass and len(record.filter) == 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done in FilterGenotypes as of #736 - fine if we want to do it earlier but not necessary to do twice (though obviously lightweight enough that it doesn't really matter if we do)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't recall if having . instead of PASS breaks any specific task between CombineBatches and filtering, but this matches the current formatting. It's probably not good practice to be doing this, but I wouldn't be surprised if it would break something to leave as ..

@@ -7,6 +7,14 @@
"autosome_file" : "gs://gcp-public-data--broad-references/hg38/v0/sv-resources/resources/v1/autosome.fai",
"bin_exclude" : "gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/bin_exclude.hg38.gatkcov.bed.gz",
"bin_exclude_index" : "gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/bin_exclude.hg38.gatkcov.bed.gz.tbi",
"clustering_config_part1" : "gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/clustering_config.part_one.tsv",
Copy link
Collaborator

@epiercehoffman epiercehoffman Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To what extent are these parameters optimized?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are based on the first round of re-clustering from AoU. The only new parameter is size similarity which is set to 0.5.

@mwalker174 mwalker174 merged commit 88dbd05 into main Jan 14, 2025
9 checks passed
@mwalker174 mwalker174 deleted the mw_gatk_combine_batches branch January 14, 2025 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants