-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prioritize het calls when merging clustered SVs #9058
Conversation
Add mCNV tiebreaker and unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thanks for the update yesterday. Left a couple of minor questions but resolve as you see fit
GenotypeBuilder.create( | ||
"sample", | ||
Lists.newArrayList(Allele.NO_CALL, Allele.NO_CALL), | ||
createGenotypeTestAttributesWithCNQ(2, 2, 30) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we prefer a hom ref CN for MCNVs but non-ref for biallelic sites? That seems a bit inconsistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True is is inconsistent. I think one could argue for preferring the copy states farther from neutral, which would make the copy state distribution wider. Here I've chosen to, after considering RD quality, prefer copy states closer to neutral. This is just a more conservative prior. I don't think it makes sense, however, to have both het>hom-ref and het>hom-var for mCNVs since they are common by definition such that losing variants is not as problematic as for rare variants (which constitute most of the call set).
There are other technical reasons for wanting to prefer het over hom-ref with non-depth-only calls as well (e.g. breakpoint shifting on SR signatures), whereas mCNVs are usually depth-only.
@@ -644,25 +644,7 @@ public Object[][] collapseSampleGenotypesTestData() { | |||
createGenotypeTestAttributes(2) | |||
) | |||
}, | |||
// het preferred over hom ref even with lower gq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should still be true right? I wasn't sure why this case got dropped, unless I didn't see one it's redundant with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I added this back in
Makes some small adjustments to tiebreaker criteria for merging genotypes with the
CanonicalSVCollapser
, which is used in theSVCluster
andGroupedSVCluster
tools. We've empirically found, after considering ref vs non-ref and GQ, that prioritizing hets over hom-var genotypes improves variant quality in terms of hardy-weinberg equilibrium. This is also the preferred behavior for GATK-SV, as false hom-var genotypes (when truly a het) is considered one of the more egregious error modes, particularly in association studies.A similar adjustment is made to multi-allelic CNV merging, in which copy states closer to reference are now preferred.
In addition, a tiebreaker has been added in the case of conflicting copy states that are equally far from reference, e.g. merging a variant with copy states 1 and 3 on a diploid chromosome. In this case when copy state qualities are also equal (which should be rare), deletion copy states are prioritized over duplication. That is, copy state 1 would be chosen in the example. This is justified by the higher accuracy observed in large depth-only deletions compared to duplications. Since there are no other criteria on which to assess multi-allelic CNVs, this extra comparison is needed for stability.