-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements to indel calling. #1679
Conversation
Some performance figures on GIAB HG002.GRCh38.PacBio_CCS_15Kb.bam (approx 30x coverage) data:
There is a slight shift in FN vs FP figures here at Q30 cutoff, but the new code at a slightly reduced Q25 shows it beats develop on all 3 of FP, GT and FN figures.
In both develop and this PR, it's still very much favouring recall over precision for the CCS data. We get a much more balanced starting point with |
I see my tweak to filter bcf_cgp_find_types candidates by the -m and -F parameters was incorrect and it was passing if -m or -F were accepted, whereas the actual filtering applied post removal is -m and -F conditions must be met. It worked initially, but at some point I incorrectly changed it. Given the defaults of Tweaking this (still to be pushed) gives these plots (tweak in "PR-b"): [EDIT: fixed scaling as I had the wrong total number of indels in my gnuplot script.] The effect of this filtering is we lose some recall, pretty much as expected. I am guessing it's the case where only a single alignment has an indel according to the CIGAR pileup, but post realignment the number of candidates reported matching that indel type is >= 2. Or possibly it's for GT cases 1/2 where -m and -F are the combined AD values for all types combined and not applied per type? Regardless, given these are command line parameters that can be reduced if needed, the default behaviour looks very strong as a false positive removal mechanism. I'll do a bit more testing and then push an update (plus rebase to bring in the CI testing fix). |
The fixed +/- indel_win_size is still used in construction of the type[] array, but then we reduce that size down before doing the alignments and evaluating them. This means, where appropriate (nice unique data without lots of STRs) we can assess over a smaller window and avoid the negative impact of other nearby indels. It cures the example covid19 problem, but also reduces recall elsewhere as if we *do* still get other nearby indels (eg due to low complexity data between our candidate indel and a neighbouring one) then we're now paying a much larger normalised per-base hit as the length is smaller.
First 30MB of SynDip chr1: SNP Q>0 / Q>=30 / Filtered InDel TP 4915 / 4817 / 4796 InDel FP 334 / 266 / 226 InDel GT 265 / 249 / 245 InDel FN 1528 / 1626 / 1647 vs devel: InDel TP 4910 / 4828 / 4807 InDel FP 266 / 192 / 169 InDel GT 240 / 230 / 226 InDel FN 1533 / 1615 / 1636 So not a win apparently. However it's a starting point, and many of the FP/FN are related to differing calls at correct locations.
Multiple del types need to use ref_cons based on the biggest del not the current one, so we're not incorporating other bits of deletion in the current "run". Bug fix to seq-pos ("y") when doing cons_base vs ref_base on INS.
The define INS_PLUS_BASE is enabled, which is how the old code worked so no change. This includes the next base matching call in the last base of the insertion, so it's *either* cons_base *or* cons_ins. However we instead have an easier to understand alternative which is an easier to understand with cons_ins being the inserted bases and cons_base always being used for the matching bases. (This corrects some defects for reads than end on an insertion with no flanking M, although that is rare.) Unfortunately, for reasons unknown, it is very slightly higher on FP/GT. (Hence we still use the old method)
We have the most common consensus plus a sub-consensus for when neighbouring heterozygous indels are present. This is acting as an alternative to using the old ref_sample method.
Also tweak ins vs N merge threshold.
- We previously started our consensus with 1 depth copy of ref_sample. This brings in data from the overall consensus for where one specific indel 'type' doesn't have data spanning the entire region. We now ensure a minimum of 1-fold depth is copied from the ref_base data instead. It's still not quite as good results, but good enough and paves the way for removal of a lot of existing code. - The bcf_call_glfgen code had a heuristic to change type calls to 0 (REF) where the read doesn't have an indel at that location, under the assumption our assessment was bad. This may have been helpful in the past, but with better consensus calculation our type calls are now more robust. Instead we change such data to be unclassified types instead so they don't count to AD and PL. This is still a fudge. It may be preferable to leave them as-is, and/or filter by some other means such as whether they span STRs. We get a 7% rise in FP value, but a 2% drop in FN and 16% drop in GT assignment errors. TODO: study those new FPs in more detail.
Also a small correction to consensus rfract so it's a minimum rather than additive amendment.
Ie cnum==1 is always opposite of cnum==0 for hets.
The INS_PLUS_BASE define is complicated code, and validation now shows the differences are very marginal. So disabled for now and will cull soon so we only have the simpler variant left.
- local_band is computed only over left..right part of alignment instead of from the entire CIGAR string. - Fix for p->indel == type matching. This isn't sufficient to filter out when we're in the deletion being studied if it starts eg 1 base earlier, where p->indel is now zero, but p->is_del is true. - Use band size computed from all observed indels and STRs, not "type". - Replace bca->indel_win_size/2 for long reads with an indel/STR aware limit. - Replace the tbeg/tend calculations with a indel/STR aware setting.
- Fixed (?) out by one error in candidate del caller: Was "i >= pos-left" and now "i > pos-left". This was incorrectly not marking the last base in a het del as used when immediately followed by an insertion (or maybe it was vice versa). - The same line as above decided to add to either cons_base vs ref_base using "biggest_del", but this fails when we have nested deletions present. Eg -10 at pos 105 and -1 at pos 107. We now track containing deletions and set a skip_to coordinate instead of relying on "biggest_del". - Alter the fraction at which we label a heterozygous indel as heterozygous from .2 to .3. (Shift10h7j in notes)
Also bug fix "len==type" check in deletion. It should be "len==-type".
Also fixed a minor memory leak (approx 36 bytes per 1MB of genome).
Similarly for insertion. These aren't exact duplicates, as it's overall and per-sample, but close enough and the doubling up of work was accidental.
If the sequence being inserted has differing base-calls to the flanking sequence, then est_seqQ's analysis of homopolymer runs and how they impact the score is irrelevant as the insertion isn't ambiguous anyway. This has a small reduction to FN on PacBio CCS with minimal change to FP.
- We erroneously had the -m and -F pre-filter options applies as needing either to pass instead of both to pass in bcf_cgp_find_types. Fixing this is a significant reduction to FP rates. - The computation of which STRs spanned the indel being assessed was sometimes a little bit out, due to using "qpos" instead of "pos". This removes a few false negatives. - Improve the calculation of indel cost ("iscore") by counting the number of additional repeat units beyond the end of the sequence. The old penalty for STRs at the end of reads has been removed, meaning we regain some FNs while still penalising (more) the cases that lead to FPs. Overall there is some small FP / FN adjustments, but the biggest improvement here is a significant drop in GT errors (~10% fewer).
dfc1861
to
16834c7
Compare
Updated with some tweaks to improve things further (as mentioned above):
PacBioCCS plot is being generated atm. |
Some raw numbers, as percentages of the truth set. Columns are all (qual>0), QUAL>=30 and QUAL>=30 plus extra filters (IDV, IMF, DP).
The percentage change vs the develop branch is significant for FP and FN at all depths. GT assignment errors are up for shallow data, although in absolute numbers it's still fewer new GT errors than we've recovered in FN, so this is largely just an issue of low coverage true variants having a significant portion with the wrong genotype assignment. At higher depths the GT error really plummets. |
We already had this elsewhere, but forgot this case. It trips up on GIAB HG002.GRCh38.PacBio_CCS_15Kb.bam at chr1:112149167 which has a 13KB deletion.
PacBio CCS stats: This broke at 112MB into chr1 as it found a 12KB deletion (now fixed). Rerunning, but the shape of the graph is unlikely to change. There are very high false positive and false negative rates on this data. We can use e.g. SynDip chr1 stats: At 15x the recent additional commit has removed some true variants, so elevated FN rate, but it's the only change which is consistently better on FN vs FP tradeoff. Frankly all of them have terrible FN miss rates. At 45x the new PR diverges and starts to become poorer than current develop due to higher FP rates. However this only occurs at QUAL >=40 or so and the FN rates are still so enormous as to not be practically usable. I think this is likely just the nature of the truth set, and I'm being overly pedantic in my accuracy assessment. If I recall there is a caveat in SynDip about not trusting small indels due to the nature of the underlying PacBio assembly and homopolymers, so it may simply be plain wrong on them. |
Finally, assuming my revised pacbio test runs to completion, I think I'm done with this PR. What I haven't tested is multi-sample calling, simply because I don't know of appropriate truth sets to use (maybe a trio though?) and we don't have a written best practice guide for it so I don't know the normal method of doing this. Do we just do it in one command, or produce gVCFs and merge? We really need to document this pipeline, but that's a job for someone who knows what they're doing to do. :-) So I'll leave such validation up to you. |
This fixes a read buffer-overrun, although I'm baffled how we didn't hit this with the original code as it's never apparently known where the reference ends.
This is a combination of PR samtools#1679 from 2022 coupled with more recent changes to replace BAQ with edlib for evaluating the alignment of reads against candidate alleles. Due to the complexity of rebasing many commits with many conflicts, these have been squashed together for simplicity. Please consult the original commit log in the PR for more detailed history. PR 1679 is a major restructuring of the indel caller that produces proper consensus sequences. TODO: paste collected commit messages here. The newer edlib changes are designed to dramtically increase the speed of indel calling on long read data by use of a modern algorithm. Note edlib's alignments are very crude, with just +1 for all differences including substitutions and each indel base. However if we've computed our candidate allele consensus sequences correctly then the aligning against the real allele should be a minimal score regardless of scoring regime, so the cost of an HMM or even affine weights do not help much. The lack of quality values is compensated for by detection of minimum quality within STRs.
The main benefit of the new indel caller is the use of alignment against diploid consensus generation instead of simply patching the reference with candidate indel types. This greatly reduces false positives and incorrect allele alignment (leading to wrong genotype calls). This was the earlier PR samtools#1679, but has since acquired edlib as an alternative to BAQ for the indel alignment. However this is primarily a speed benefit (with some minor removal of false-negatives due to quality smashing), and isn't the main thing users should think about when choosing an indel caller.
This is a combination of PR samtools#1679 from 2022 coupled with more recent changes to replace BAQ with edlib for evaluating the alignment of reads against candidate alleles. Due to the complexity of rebasing many commits with many conflicts, these have been squashed together for simplicity. Please consult the original commit log in the PR for more detailed history. PR 1679 is a major restructuring of the indel caller that produces proper consensus sequences. TODO: paste collected commit messages here. The newer edlib changes are designed to dramtically increase the speed of indel calling on long read data by use of a modern algorithm. Note edlib's alignments are very crude, with just +1 for all differences including substitutions and each indel base. However if we've computed our candidate allele consensus sequences correctly then the aligning against the real allele should be a minimal score regardless of scoring regime, so the cost of an HMM or even affine weights do not help much. The lack of quality values is compensated for by detection of minimum quality within STRs.
The main benefit of the new indel caller is the use of alignment against diploid consensus generation instead of simply patching the reference with candidate indel types. This greatly reduces false positives and incorrect allele alignment (leading to wrong genotype calls). This was the earlier PR samtools#1679, but has since acquired edlib as an alternative to BAQ for the indel alignment. However this is primarily a speed benefit (with some minor removal of false-negatives due to quality smashing), and isn't the main thing users should think about when choosing an indel caller.
This is a combination of PR samtools#1679 from 2022 coupled with more recent changes to replace BAQ with edlib for evaluating the alignment of reads against candidate alleles. Due to the complexity of rebasing many commits with many conflicts, these have been squashed together for simplicity. Please consult the original commit log in the PR for more detailed history. PR 1679 is a major restructuring of the indel caller that produces proper consensus sequences. TODO: paste collected commit messages here. The newer edlib changes are designed to dramtically increase the speed of indel calling on long read data by use of a modern algorithm. Note edlib's alignments are very crude, with just +1 for all differences including substitutions and each indel base. However if we've computed our candidate allele consensus sequences correctly then the aligning against the real allele should be a minimal score regardless of scoring regime, so the cost of an HMM or even affine weights do not help much. The lack of quality values is compensated for by detection of minimum quality within STRs.
The main benefit of the new indel caller is the use of alignment against diploid consensus generation instead of simply patching the reference with candidate indel types. This greatly reduces false positives and incorrect allele alignment (leading to wrong genotype calls). This was the earlier PR samtools#1679, but has since acquired edlib as an alternative to BAQ for the indel alignment. However this is primarily a speed benefit (with some minor removal of false-negatives due to quality smashing), and isn't the main thing users should think about when choosing an indel caller. Also tidied up the usage statement and added an explicit "-X list" to print the profile parameters. - Add extra debugging defines. GLF_DEBUG reports GLF calculation in bam2bcf.c. ALIGN_DEBUG uses edlib EDLIB_TASK_PATH to report sequence alignment. NB to use this you need to link against edlib itself rather than the cutdown version in this repository. Also fix the edlib heuristics used in bcf_call_glfgen. We don't want to change the call (b=5) as this affects AD. Instead we change the quality so poor calls get filtered by QUAL rather than simply being removed. - Tweak edlib tuning for SeqQ/qual. Add quality value assessment into soft-clip recovery. Use /500 instead of /111 in indelQ assignment, and skew indel-bias accordingly. This gives better separation of FP/GT/FN generally. - Added --seqq-offset parameter so we can use it in tunables per profile. This is used as a limit on the seqQ reduction in the "VAL-5*MIN(20,depth)" formula, used for favouring data over seqQ scores when depth is sufficient. Experimentation showed no single value that worked for all platforms, but the default is in the middle. - Tidy up to cull ifdefed and commented out code. - Add test for indels-cns. It's minimal, but the whole indel calling has minimal testing. I think we have under 10 indels in total with develop (all short read and mostly duplications of each other), and no testing of indels-2.0. This tests 4 indels with indels-cns. - Added documentation for the new --indels-2.0 options - Cull more unused parts of edlib.c. This avoids clang warnings (which become errors with -Werror). We're only including the bits we need here for mpileup. If you want the whole thing, link against the upstream source library instead.
This is a combination of PR #1679 from 2022 coupled with more recent changes to replace BAQ with edlib for evaluating the alignment of reads against candidate alleles. Due to the complexity of rebasing many commits with many conflicts, these have been squashed together for simplicity. Please consult the original commit log in the PR for more detailed history. PR 1679 is a major restructuring of the indel caller that produces proper consensus sequences. TODO: paste collected commit messages here. The newer edlib changes are designed to dramtically increase the speed of indel calling on long read data by use of a modern algorithm. Note edlib's alignments are very crude, with just +1 for all differences including substitutions and each indel base. However if we've computed our candidate allele consensus sequences correctly then the aligning against the real allele should be a minimal score regardless of scoring regime, so the cost of an HMM or even affine weights do not help much. The lack of quality values is compensated for by detection of minimum quality within STRs.
The main benefit of the new indel caller is the use of alignment against diploid consensus generation instead of simply patching the reference with candidate indel types. This greatly reduces false positives and incorrect allele alignment (leading to wrong genotype calls). This was the earlier PR #1679, but has since acquired edlib as an alternative to BAQ for the indel alignment. However this is primarily a speed benefit (with some minor removal of false-negatives due to quality smashing), and isn't the main thing users should think about when choosing an indel caller. Also tidied up the usage statement and added an explicit "-X list" to print the profile parameters. - Add extra debugging defines. GLF_DEBUG reports GLF calculation in bam2bcf.c. ALIGN_DEBUG uses edlib EDLIB_TASK_PATH to report sequence alignment. NB to use this you need to link against edlib itself rather than the cutdown version in this repository. Also fix the edlib heuristics used in bcf_call_glfgen. We don't want to change the call (b=5) as this affects AD. Instead we change the quality so poor calls get filtered by QUAL rather than simply being removed. - Tweak edlib tuning for SeqQ/qual. Add quality value assessment into soft-clip recovery. Use /500 instead of /111 in indelQ assignment, and skew indel-bias accordingly. This gives better separation of FP/GT/FN generally. - Added --seqq-offset parameter so we can use it in tunables per profile. This is used as a limit on the seqQ reduction in the "VAL-5*MIN(20,depth)" formula, used for favouring data over seqQ scores when depth is sufficient. Experimentation showed no single value that worked for all platforms, but the default is in the middle. - Tidy up to cull ifdefed and commented out code. - Add test for indels-cns. It's minimal, but the whole indel calling has minimal testing. I think we have under 10 indels in total with develop (all short read and mostly duplications of each other), and no testing of indels-2.0. This tests 4 indels with indels-cns. - Added documentation for the new --indels-2.0 options - Cull more unused parts of edlib.c. This avoids clang warnings (which become errors with -Werror). We're only including the bits we need here for mpileup. If you want the whole thing, link against the upstream source library instead.
Closing as this was superceded by #2045 which has was merged in as |
The bcf_call_glfgen code had a heuristic to change indel type calls to 0 (REF) where the read doesn't have an indel at that location, under the assumption our assessment was bad.
This may have been helpful in the past, but with better consensus calculation our type calls are now more robust. Instead we change such data to be unclassified types instead so they don't count to AD and PL.
This is still a fudge, but overall it's beneficial. We get a 7% rise in FP value, but a 2% drop in FN and huge 16% drop in GT assignment errors.
Improved bcf_cgp_find_types, which determines the number of candidate indel sizes present, to consider per_sample_flt and min_fract parameters. Previously these were only filtered out after every read has been realigned, but filtering earlier reduces the number of incorrect indel assignments and improves accuracy.
It likely also the cause of PacBio calling speeding up, as we have a low-level of indel errors in the data.
Replaced the old bcf_cgp_ref_sample code with a new bcf_cgp_consensus algorithm.
The original logic starting with the real reference and by walking through read CIGAR strings amending any SNPs. It did this once per sample. Then per hypothetical indel type, this "sample reference" was amended at the indel site itself by inserting or deleting bases. Each read was then aligned against this modified ref to evaluate how well it matched.
Unfortunately this is simply incorrect when there are neighbouring indels present, as only the central under-investigation indel ever made it into the modified reference.
The new code produces up to two consensus sequences per sample, based on all of the SEQ+CIGAR elements (including indels). This copes with evaluating e.g. a homozygous insertion flanked by a heterozygous deletion (hence two consensuses). A small dose of real reference is applied to, to cope with low coverage artefacts.
This is the heart of these indel calling changes and the primary reason for fixing some obvious looking indel calls that were
previously missed. However it's still not a replacement for a full denovo reassembly and alignment errors can still lead to incorrect results, but by and large we now fail on messy data instead of occasionally on good.
bcf_cgp_align_score can now be given two references, and we return the alignment score that's best. The band sizing is also a little different, now operating on a band related to all observed indel types and not just the specific indel type being assessed and also taking into account the total insertions and deletions in nearby positions gathered during consensus construction.
Improved est_seqQ. This is a naive sequence quality assignment based on the size of indel vs the size of homopolymer at that position. This is reasonable for a deletion, but for an insertion it didn't check that the inserted sequence actually matched the homopolymer base type.
TODO: homopolymer is too naive. It should be looking at indel in STR repeat unit terms.
Improved the indel window-size adjustments for long reads. This worked because long reads rarely have alignment end effects where the aligner has produced reference bias by soft-clipping a few trailing bases. However it's now resized more accurately based on the total proportion of bases covered by STRs within the window. On average this is smaller than the old "indel_win_size/2" method so further speeds up alignment of long read data.
Lots of additional commenting. Some explaining how things work, and a few with "TODO" ideas for potential improvements to explore.
The various bits of commented out debugging code are now more cleanly wrapped up in a CONS_DEBUG define, to ease enabling and disabling.
Fixed a few minor memory leaks on failure. These likely never occurred, or if they did were immediately followed by program exit anyway.
Reduced the default indel_win_size from 110 to 80. With better consensus construction it was found that a smaller value worked better. Also there is no longer a hard lower limit of 110. It's now possible to override this down to 20.
Improved the find_STR function to be able to spot repeat units up to 14 bases. It was previously 8.
Please advise on whether you wish it squashed. It'll be a huge change then, but equally so it's not so useful to look at all the commits as they contain ideas which were subsequently removed again. Squashing just some of it, and in particular reordering to group things together, will be problematic as it'll lead to a myriad of conflicts due to adding and removing exploratory code.
Replaces #1648
Some stats on GIAB HG002 chr1 truth set. Showing percentages for all data, QUAL>=30 and QUAL>=30 with additional DP filtering.
30x ROC curve (FP/FN percentage rates)
Edit: culling 15x plot as it erroneously was for all of chr1 despite missing a chunk in my run of develop branch. Reproducing this VCF now.