Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCF header fields missing #189

Open
Krannich479 opened this issue May 17, 2023 · 3 comments
Open

VCF header fields missing #189

Krannich479 opened this issue May 17, 2023 · 3 comments

Comments

@Krannich479
Copy link

Hello Fritz & SURVIVOR dev team,

What.

I attempted to use the VCF file generated by SURVIVOR with bcftools, as for instance recommended in issue #173. However, bcftools suffers from a bug that originates from SURVIVOR I think.

Error.

When using bcftools (sort+index) on SURVIVOR's truthset VCF, I get warnings and an error regarding the VCF header not matching the FORMAT fields.

bcftools sort -o simulated.sorted.vcf.gz -O z simulated.vcf; bcftools index -t simulated.sorted.vcf.gz

Writing to /tmp/bcftools-sort.9eyTIT
[W::vcf_parse_format] FORMAT 'GL' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'GQ' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'FT' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'RC' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'DR' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'DV' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'RR' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'RV' is not defined in the header, assuming Type=String
[E::bcf_write] Unchecked error (2), exiting

I looked into the code at

convert << "\t.\tPASS\tPRECISE;SVMETHOD=SURVIVOR_sim;SVLEN=1\tGT:GL:GQ:FT:RC:DR:DV:RR:RV\t";
where the FORMAT field for SNP variant records is written. In case the print_vcf_header2 function above is the corresponding header than the FORMAT fields indeed do not match.

Solution.

I propose two trivial solutions here:

  1. Removing the FORMAT fields from the SNP variant records.
  2. Adding the FORMAT fields to the function that generates the header.

I am voting for solution 1 here because:
a) I think FT, RC, DR, DV, RR, RV are ancient relics of Lumpy unrelated to SNPs. Also these FORMAT fields have been commented out throughout most of SURVIVOR.
b) The missing fields are not part of the VCF4.2 standard and should not be present if not defined and used.
c) I tested that the VCF file generated by SURVIVOR works flawlessly with bcftools if the fields are removed from SNP records. (see hotfix below)

Hotfix.

sed -i 's/GT:GL:GQ:FT:RC:DR:DV:RR:RV/GT/g ' simulated.vcf where simulated.vcf is the VCF by SURVIVOR.

@rl4940
Copy link

rl4940 commented Jun 14, 2024

I am suffering the exact same problem here as you were. But I did not set SNP cuz I am not intended to study SNP, only SVs are of my interest.
So does it mean that I can use the:

sed -i 's/GT:GL:GQ:FT:RC:DR:DV:RR:RV/GT/g ' simulated.vcf

To fix it?
Thanks

@rl4940
Copy link

rl4940 commented Jun 14, 2024

OK I figured it out, besides that line of code, we also need to add headers like this manually:

##FILTER=<ID=LowQual,Description="Low quality">
##INFO=<ID=PRECISE,Number=0,Type=Flag,Description="Precise variant">

@Krannich479
Copy link
Author

Hi @rl4940,
the issue I had, the hotfix I proposed (in this issue) as well as the PR #190 aim for a format correction of SNV records only. However, I hypothesize that if your callset does not include the additional fields (GQ:FT:RC:DR:DV:RR:RV) it's probably safe to use the sed command from above.


(disclaimer: I am not a maintainer of SURVIVOR, I just tried to fix a bug here)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants