-
Notifications
You must be signed in to change notification settings - Fork 23
decomposing and subsetting vcfs
slivar expects VCFs to be decomposed so that mult-allelic variants are split into separate variants. This should result in a single variant (and alternate allele) per line.
In order to do this correctly, the VCF may need to be adjusted so that the AD
field which indicates the (A)llelic (D)epths will be decomposed properly. In older versions of GATK, the AD
header will appear as:
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
This leaves no way for tools to know how to decompose the variant. Instead, it should be changed to:
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
This was only added to the VCF spec in version 4.3 but both bcftools and vt will correctly decompose the AD field after adjusting the header to the above line.
The decomposition can be accomplished in a stream of commands with:
zcat $gatk_vcf \
| sed -e 's/ID=AD,Number=\./ID=AD,Number=R/' \
| bcftools norm -m - -w 10000 -f $fasta -O b -o $clean_bcf
it is also important to change the header before sub-setting by sample (e.g. with bcftools view -s
)
The default javascript functions in the slivar repo rely heavily on having the correct values in the AD
field in order to calculate allele balance so this is a critical step.