Problem with parsing a VCF file #13

mashu · 2023-09-08T12:56:24Z

Hi,

I used samtools and bcftools as well as bwa to generate VCF file.
These tools are pretty much well established and it's strange to see that they generate VCF file that I can't read with VariantCallFormat. I made a check with pyvcf 0.6.8 and I could read the file without any problems.

It appears problem was with CHROM field which contained this NC_000014.8:106641561-106641856 while POS was 1 instead of absolute position.
After writing small function that fixes first column to be NC_000014.8 and second to be 106641561+1 I could read the file.

However, accessing record.pos gives me some range which is not even correct. While it should give the position.
I checked other VCF file which didn't have any issue and is available online, and again pos returns wrong value

v[1]
VariantCallFormat.Record:
   chromosome: NC_000001.10
     position: 10001
   identifier: rs1570391677
    reference: T
    alternate: A C
      quality: <missing>
       filter: <missing>
  information: RS=1570391677 dbSNPBuildID=154 SSR=0 PSEUDOGENEINFO=DDX11L1:100287102 VC=SNV R5 GNO FREQ=KOREAN:0.9891,0.0109,.|SGDP_PRJ:0,1,.|dbGaP_PopFreq:1,.,0 COMMON 
       format: <missing>

julia> v[1].pos
14:18

whereas in Python using PyVCF record.pos gives me rightfully 10001 value.

In [2]: import vcf

In [3]: vcf_reader = vcf.Reader(open('test.vcf','r'))

In [4]: r = [r for r in vcf_reader][1]

In [5]: r.POS
Out[5]: 10002

Again this is public dbSNP dataset, something very established.
This all suggests that VariantCallFormat (VariantCallFormat v0.5.5 according to Pkg.status) is broken, so as suggested I am filing a bug report.

Hope it helps to solve the problem? Do you have any tips for temporal walk-around? Or is writing custom parser the quickest ?

The text was updated successfully, but these errors were encountered:

rasmushenningsson · 2023-09-08T17:05:15Z

Use VCF.pos(r) to access the position. r.POS is an internal field related to the parsing.

I'm not at a computer right now and I cannot investigate the other issue right now, but if the file is incorrectly formatted (according to the VCF specification) I think that the best option is to write a small function to fix the problem.

Files that are not formatted according to the specification is a big problem in bioinformatics. But I much prefer a strict parser where I get an error and fix the problem I'm the file manually, than one that continues silently and maybe parsed it incorrectly.

rasmushenningsson · 2023-09-12T07:15:43Z

After reading the specification, I've come to the conclusion that colon should be allowed in the CHROM field. I don't have much bandwidth at the moment but I'll see if I can get a fix in.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with parsing a VCF file #13

Problem with parsing a VCF file #13

mashu commented Sep 8, 2023

rasmushenningsson commented Sep 8, 2023

rasmushenningsson commented Sep 12, 2023

Problem with parsing a VCF file #13

Problem with parsing a VCF file #13

Comments

mashu commented Sep 8, 2023

rasmushenningsson commented Sep 8, 2023

rasmushenningsson commented Sep 12, 2023