Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with parsing a VCF file #13

Open
mashu opened this issue Sep 8, 2023 · 2 comments
Open

Problem with parsing a VCF file #13

mashu opened this issue Sep 8, 2023 · 2 comments

Comments

@mashu
Copy link

mashu commented Sep 8, 2023

Hi,

I used samtools and bcftools as well as bwa to generate VCF file.
These tools are pretty much well established and it's strange to see that they generate VCF file that I can't read with VariantCallFormat. I made a check with pyvcf 0.6.8 and I could read the file without any problems.

It appears problem was with CHROM field which contained this NC_000014.8:106641561-106641856 while POS was 1 instead of absolute position.
After writing small function that fixes first column to be NC_000014.8 and second to be 106641561+1 I could read the file.

However, accessing record.pos gives me some range which is not even correct. While it should give the position.
I checked other VCF file which didn't have any issue and is available online, and again pos returns wrong value

v[1]
VariantCallFormat.Record:
   chromosome: NC_000001.10
     position: 10001
   identifier: rs1570391677
    reference: T
    alternate: A C
      quality: <missing>
       filter: <missing>
  information: RS=1570391677 dbSNPBuildID=154 SSR=0 PSEUDOGENEINFO=DDX11L1:100287102 VC=SNV R5 GNO FREQ=KOREAN:0.9891,0.0109,.|SGDP_PRJ:0,1,.|dbGaP_PopFreq:1,.,0 COMMON 
       format: <missing>

julia> v[1].pos
14:18

whereas in Python using PyVCF record.pos gives me rightfully 10001 value.

In [2]: import vcf

In [3]: vcf_reader = vcf.Reader(open('test.vcf','r'))

In [4]: r = [r for r in vcf_reader][1]

In [5]: r.POS
Out[5]: 10002

Again this is public dbSNP dataset, something very established.
This all suggests that VariantCallFormat (VariantCallFormat v0.5.5 according to Pkg.status) is broken, so as suggested I am filing a bug report.

Hope it helps to solve the problem? Do you have any tips for temporal walk-around? Or is writing custom parser the quickest ?

@rasmushenningsson
Copy link
Owner

Use VCF.pos(r) to access the position. r.POS is an internal field related to the parsing.

I'm not at a computer right now and I cannot investigate the other issue right now, but if the file is incorrectly formatted (according to the VCF specification) I think that the best option is to write a small function to fix the problem.

Files that are not formatted according to the specification is a big problem in bioinformatics. But I much prefer a strict parser where I get an error and fix the problem I'm the file manually, than one that continues silently and maybe parsed it incorrectly.

@rasmushenningsson
Copy link
Owner

After reading the specification, I've come to the conclusion that colon should be allowed in the CHROM field. I don't have much bandwidth at the moment but I'll see if I can get a fix in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants