Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VariantContext has inconsistent behavior on getting allele type with spanning deletions #1686

Open
rickymagner opened this issue Oct 2, 2023 · 0 comments

Comments

@rickymagner
Copy link
Contributor

Description of the issue:

For "multiallelic" sites containing a spanning deletion alleles (represented by *), the getType method in VariantContext behaves somewhat unexpectedly. There have been numerous old discussions surrounding spanning deletion handling in htsjdk, like here and here, but this particular issue is a little more focused. Regardless of what might be the "best" way to handle these special sites, the current situation is quite confusing.

If the * allele is next to a SNP or a deletion, the site is considered an INDEL. But if it's next to an insertion, it's not considered an INDEL. See below for concrete examples.

Your environment:

  • version of htsjdk: 3.0.1
  • version of java: Temurin-17.0.7+7
  • which OS: MacOS Ventura 13.6

Steps to reproduce

Set up the following files for a complete example.

example.fasta:

>chr1
AAAAAAAAAA

Run samtools faidx example.fasta and samtools dict example.fasta > example.dict. Then make the following vcf as example.vcf:

##fileformat=VCFv4.3
##contig=<ID=chr1,length=10>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	Sample
chr1	3	variant1	A	AAA,*	30	.	.	GT	1/2
chr1	4	variant2	A	C,*	30	.	.	GT	1/2
chr1	5	variant3	AAA	A,*	30	.	.	GT	1/2

Run bcftools view example.vcf -o example.vcf.gz to compress it and bcftools index -t example.vcf.gz. Now it's ready to run in GATK to see what happens. Run
gatk SelectVariants -V example.vcf.gz -R example.fasta -O gatk-snp.vcf.gz -select 'vc.isSNP()'
and
gatk SelectVariants -V example.vcf.gz -R example.fasta -O gatk-indel.vcf.gz -select 'vc.isIndel()'

As expected, gatk-snp.vcf.gz looks like (with bcftools view -H):

chr1	4	variant2	A	C,*	30	.	.	GT	1/2

but gatk-indel.vcf.gz looks like:

chr1	5	variant3	AAA	A,*	30	.	.	GT	1/2

In other words, the site with <INS>,* is excluded while <DEL>,* is included when looking for INDEL sites.

Expected behaviour

I don't know what's the best way once and for all to classify VariantContexts with spanning deletions, but to me this seems like a "bug" either way. Whatever is decided in classifying the <INS>,* type should equally apply to <DEL>,*. For what it's worth, my instinct is to expect these sites to both be classified as "Indel" in this way since the spanning deletion is a deletion anyway. This is consistent with the behavior that SelectVariants will also pull out a multiallelic DEL+INS site as "Indel."

Actual behaviour

See above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant