Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicated variants in a pangenome VCF #1493

Open
Han-Cao opened this issue Oct 2, 2024 · 5 comments
Open

Duplicated variants in a pangenome VCF #1493

Han-Cao opened this issue Oct 2, 2024 · 5 comments

Comments

@Han-Cao
Copy link

Han-Cao commented Oct 2, 2024

Hi,

In the output VCF of MC pipeline, there are a few duplicated variants after left-algin. Below is one example from the HPRC VCF:

Before left-align (bcftools view -r chr22:15409352-15409435 hprc-v1.1-mc-grch38.vcfbub.a100k.wave.vcf.gz):

chr22   15409352        >45360363>45360365      C       CCACACAAAGGATTGCAGAACAATGTTGCTGGGTTCTTAGTGTTTGACCCTCACATAGGATTCCAGAACATAGCTGCTGGTTCAAAGTGTTTGTCC        60      .       AC=1;AF=0.0113636;AN=88;AT=>45360363>45360365,>45360363<45360364>45360365;NS=45;LV=0    GT      0       0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     .|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|1     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0
chr22   15409435        >45360367>45360369      A       AAAGTGTTTGTCCCACACAAAGGATTGCAGAACAATGTTGCTGGGTTCTTAGTGTTTGACCCTCACATAGGATTCCAGAACATAGCTGCTGGTTCA        60      .       AC=1;AF=0.0113636;AN=88;AT=>45360367>45360369,>45360367>45360368>45360369;NS=45;LV=0    GT      0       0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     .|0     0|0     0|0     0|1     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0

After bcftools norm -f ref.fa, these 2 insertions are exactly the same:

chr22   15409341        >45360363>45360365      G       GAGTGTTTGTCCCACACAAAGGATTGCAGAACAATGTTGCTGGGTTCTTAGTGTTTGACCCTCACATAGGATTCCAGAACATAGCTGCTGGTTCAA        60      .       AC=1;AF=0.0113636;AN=88;AT=>45360363>45360365,>45360363<4536
0364>45360365;NS=45;LV=0    GT      0       0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     .|0     0|0     0|0     0|0     0|0     
0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|1     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0
chr22   15409341        >45360367>45360369      G       GAGTGTTTGTCCCACACAAAGGATTGCAGAACAATGTTGCTGGGTTCTTAGTGTTTGACCCTCACATAGGATTCCAGAACATAGCTGCTGGTTCAA        60      .       AC=1;AF=0.0113636;AN=88;AT=>45360367>45360369,>45360367>4536
0368>45360369;NS=45;LV=0    GT      0       0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     .|0     0|0     0|0     0|1     0|0     
0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0

Does is mean these 2 are the same insertion?

As the genotypes of these 2 insertions are different, to remove duplicated records from the VCF, should I keep one of them (like bcftools norm --rm-dup exact) or merge their genotypes (i.e., 2 AC=1 records merge into 1 AC=2 record)?

Thanks a lot!

@glennhickey
Copy link
Collaborator

You can try merging them with vcfcreatemulti. This tools is part of vcftools and should be available in the cactus docker image.

@Han-Cao
Copy link
Author

Han-Cao commented Oct 3, 2024

I have tried to use vcfcreatemulti to merge them. The output is:

chr22   15409341        >45360363>45360365      G       GAGTGTTTGTCCCACACAAAGGATTGCAGAACAATGTTGCTGGGTTCTTAGTGTTTGACCCTCACATAGGATTCCAGAACATAGCTGCTGGTTCAA,GAGTGTTTGTCCCACACAAAGGATTGCAGAACAATGTTGCTGGGTTCTTAGTGTTTGACCCTCACATAGGATTCCAGAACATAGCTGCTGGTTCAA  60      .       AC=1;AF=0.0113636;AN=88;AT=>45360363>45360365,>45360363<45360364>45360365;NS=45;LV=0;combined=15409341-15409341 GT      0       0|0     0|0     0|0     0|0     0|0     0|0    0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     .|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0    0|0     0|1     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0

But there are 2 problems:

  1. it create a multi-allelic record with the same ALT alleles
  2. the genotype of the second variant is not merged (maybe a bug for duplicated alleles?)

I was planning to do something like:

chr1    100    var1    G    GAA    0|0    0|1    1|0
chr1    100    var1    G    GAA    0|0    1|0    0|0

merge into

chr1    100    var1    G    GAA    0|0    1|1    1|0

If you think this strategy is OK to handle duplicated variants in MC output VCF, I can write some script to do it.

Many thanks.

@glennhickey
Copy link
Collaborator

That's disappointing about vcfcreatemulti. I was hoping it was to do something like what you said you were planning to do, which seems fine. If you do come up with a general script for doing this, please share and I will look at incorporating it into cactus...

@Han-Cao
Copy link
Author

Han-Cao commented Oct 4, 2024

I just wrote a prototype python script. You can find it here

It is now designed for phased VCF only:

  • check duplicate by CHROM, POS, REF, ALT
  • merge phased haplotypes: 1|0 + 0|1 -> 1|1, save merged variant ID in INFO/DUP_ID
  • don't merge 2 variants if genotype conflict on any haplotype: 1|0 + 1|0 -> no merge
  • raise warning when missing genotype merged with non-missing genotype: 1|0 + 0|. -> 1|0 with warning in log

For below VCF:

#CHROM	POS	ID	REF	ALT	INFO	FORMAT	CHM13	Sample1	Sample2	Sample3
chr1	5	var1	A	TTT	AN=7;AF=0.285714;AC=2	GT	0	0|0	1|0	0|1
chr1	5	var2	A	TTT	AN=7;AF=0.142857;AC=1	GT	1	0|0	0|0	0|0
chr1	5	var3	A	TTT	AN=6;AF=0;AC=0	GT	.	0|0	0|0	0|0
chr1	5	var4	A	TTT	AN=7;AF=0.142857;AC=1	GT	0	0|0	1|0	0|0
chr1	5	var5	A	TTT	AN=7;AF=0.285714;AC=2	GT	0	0|0	0|1	0|1
chr1	5	var6	A	C	AN=6;AF=0.333333;AC=2	GT	0	0|0	1|0	.|1
chr1	6	var7	A	TTT	AN=6;AF=0.333333;AC=2	GT	0	0|0	1|0	.|1

It output

#CHROM	POS	ID	REF	ALT	INFO	FORMAT	CHM13	Sample1	Sample2	Sample3
chr1	5	var1	A	TTT	AN=7;AF=0.428571;AC=3;DUP_ID=var2:var3	GT	1	0|0	1|0	0|1
chr1	5	var4	A	TTT	AN=7;AF=0.428571;AC=3;DUP_ID=var5	GT	0	0|0	1|1	0|1
chr1	5	var6	A	C	AN=6;AF=0.333333;AC=2	GT	0	0|0	1|0	.|1
chr1	6	var7	A	TTT	AN=6;AF=0.333333;AC=2	GT	0	0|0	1|0	.|1

I also tested the current script on HPRCv1.1 chr22:

> bcftools norm -r chr22 -f ref.fa hprc-v1.1-mc-grch38.vcfbub.a100k.wave.vcf.gz -Ou | bcftools sort -Oz -o hprc.chr22.vcf.gz
> python merge_duplicates.py -i hprc.chr22.vcf.gz -o hprc.chr22.out.vcf.gz

[2024-10-04 11:35:01] - [INFO]: Merge duplicated variants from: hprc.chr22.vcf.gz
[2024-10-04 11:35:01] - [WARNING]: missing genotype conflict at chr22:15226565: >45350772>45350778_5 and >45350778>45350782_3
...
[2024-10-04 11:35:18] - [INFO]: Read 439622 variants
[2024-10-04 11:35:18] - [INFO]: 2108 variants are merged
[2024-10-04 11:35:18] - [WARNING]: 28 variants have non-missing genotypes merged with missing genotypes
[2024-10-04 11:35:18] - [INFO]: 440 variants are not merged due to haplotype conflict
[2024-10-04 11:35:18] - [INFO]: Write 437514 variants to hprc.chr22.out.vcf.gz

I am still considering how to handle duplicate variants with conflict genotypes:

  1. 1|0 vs 1|0 :
  • don't merge, write both to output (now)
  • don't merge, write the first one (like bcftools norm, make sure no duplicate in output)
  • merge anyway, output 1|0
  1. 0|1 vs .|0 :
  • merge to 0|1 with warning (now, is it safe?)
  • don't merge, write both / first

For duplicate variants with 1|0 vs 1|0, do you think these are really redundantly called by cactus or different variants with the same VCF representation due to the limitation of VCF?

@glennhickey
Copy link
Collaborator

Wow, this looks really promising, thanks for sharing! I will take a closer look when I have a free minute (which unfortunately won't be for a few days), but am very interested in incorporating it if possible into cactus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants