Error: bcftools consensus fail with overlapping variants #132

arianna-smith · 2024-02-16T23:47:59Z

When running artic minion --medaka, we are occasionally running into an issue where there are overlapping pass variants from the different read groups that is causing bcftools consensus to fail and an empty consensus file is generated. I did see similar issues 1, 2, but those seem to be caused by an overlap of a pass and fail variant.

We have had this issue on two medaka models, r1041_e82_400bps_hac_v4.2.0 and r941_min_hac_g507, but did not encounter this issue with the model r941_min_high_g360. The models have been updated alongside our basecaller and have been confirmed with
medaka tools resolve_model --auto_model.

Log:
Running:
bcftools consensus -f sample.preconsensus.fasta sample.pass.vcf.gz -m sample.coverage_mask.txt -o sample.consensus.fasta
The fasta sequence does not match the REF allele at MN908947.3:669:
REF .vcf: [GT]
ALT .vcf: [G]
REF .fa : [GN]TAC.........
Command failed:
bcftools consensus -f sample.preconsensus.fasta sample.pass.vcf.gz -m sample.coverage_mask.txt -o sample.fasta

merged.vcf:

POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMPLE
669	.	GT	G	500	PASS	DP=1126;AC=2,44;AM=1080;MC=0;MF=0.000;MB=0.000;AQ=4.42;GM=1;PH=6.02,6.02,6.02,6.02;SC=None;	GT:GQ:DP:PS:UG:UQ	1/1:103:1126:.:1/1:102.88
670	.	T	G	403	PASS	DP=1126;AC=25,37;AM=1064;MC=0;MF=0.000;MB=0.000;AQ=6.71;GM=1;PH=6.02,6.02,6.02,6.02;SC=None;	GT:GQ:DP:PS:UG:UQ	0/1:128:1126:.:0/1:127.13

sample.1.vcf:

CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMPLE
MN908947.3	670	.	T	G	59.694	PASS	.	GT:GQ	1:60

sample.2.vcf:

CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMPLE
MN908947.3	669	.	GT	G	7.539	PASS	.	GT:GQ	1:8
MN908947.3	673	.	CG	C	5.769	PASS	.	GT:GQ	1:6

sample.primertrimmed.rg.sorted.bam:

The text was updated successfully, but these errors were encountered:

BioWilko · 2024-02-21T14:29:45Z

This is caused by a quirk of applying variants iteratively to the reference, I'll have to think about how to handle this case properly, medaka introducing nonsense frameshifts like this is a little concerning though...

There's definitely some indications that using medaka for r10.4.1 data might actually be counterproductive for hac/sup basecalling https://rrwick.github.io/2023/05/05/ont-only-accuracy-with-r10.4.1.html.

Any thoughts on the medaka side @cjw85 @SamStudio8?

cjw85 · 2024-02-21T15:21:17Z

medaka introducing nonsense frameshifts like this is a little concerning though

Medaka (like many variant callers) contains no explicit knowledge of biology so the attribution of a variant it proposes leading to a frameshift is not unsurprising. My recollection is that fieldbioinformatics explicitely excludes indels that aren't a multiple of 3 bases long?

This is caused by a quirk of applying variants iteratively to the reference

Medaka is fundamentally a consensus caller, not a variant caller in the modern sense that people typically understand. To produce a VCF file, medaka:

takes the consensus that it has calculated
compares this to the reference sequence
identifies runs of contiguous matches
outputs VCF records at breakpoints in the matches

(There's some subtleties in 4. to do with specifics of the VCF spec, and ensuring left alignment and variant normalization).

It should be the case that bcftools consensus can always be run on the output of medaka variant; I would be interested to examine a data from a case where that appears to not be true.

BioWilko · 2024-02-21T15:35:01Z

That's a good point re filtering out of frame indels, the variants in sample.2.vcf shouldn't have made it into sample.pass.vcf.gz due to the variants having been removed by vcf_filter, are you able to share the pass variants file rather than the merged one (merged VCF is produced pre vcf_filter)?

arianna-smith · 2024-02-21T20:09:16Z

@BioWilko @cjw85 Here is the only one of those three mutations from the pass.vcf file

CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMPLE
MN908947.3	669	.	GT	G	500	PASS	DP=1126;AC=2,44;AM=1080;MC=0;MF=0.0;MB=0.0;AQ=4.42;GM=1;PH=6.02,6.02,6.02,6.02;SC=None;	GT:GQ:DP:PS:UG:UQ 1/1:103:1126:.:1/1:102.88

arianna-smith · 2024-02-21T20:16:01Z

There's definitely some indications that using medaka for r10.4.1 data might actually be counterproductive for hac/sup basecalling https://rrwick.github.io/2023/05/05/ont-only-accuracy-with-r10.4.1.html.

@BioWilko Ah, I don't think we were aware of this. Thanks for linking this paper.

arianna-smith mentioned this issue Feb 17, 2024

Medaka updates for new 10.4.1 grid chemistry CDPHE-bioinformatics/CDPHE-SARS-CoV-2#12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error: bcftools consensus fail with overlapping variants #132

Error: bcftools consensus fail with overlapping variants #132

arianna-smith commented Feb 16, 2024

BioWilko commented Feb 21, 2024 •

edited

Loading

cjw85 commented Feb 21, 2024

BioWilko commented Feb 21, 2024 •

edited

Loading

arianna-smith commented Feb 21, 2024

arianna-smith commented Feb 21, 2024 •

edited

Loading

Error: bcftools consensus fail with overlapping variants #132

Error: bcftools consensus fail with overlapping variants #132

Comments

arianna-smith commented Feb 16, 2024

BioWilko commented Feb 21, 2024 • edited Loading

cjw85 commented Feb 21, 2024

BioWilko commented Feb 21, 2024 • edited Loading

arianna-smith commented Feb 21, 2024

arianna-smith commented Feb 21, 2024 • edited Loading

BioWilko commented Feb 21, 2024 •

edited

Loading

BioWilko commented Feb 21, 2024 •

edited

Loading

arianna-smith commented Feb 21, 2024 •

edited

Loading