Skip to content

Variant Normalization

Jacobo Coll Moragón edited this page Nov 9, 2015 · 8 revisions

Overview

A genomic variant is represented by a locus (chromosome + position), reference sequence and list of alternates.

Is common, because of the VCF specification, that the reference and alternate fields contain extra bases not needed for the Variant representation. It is completely valid to specify a variation like chr1:100:AC:AT, which is absolutely the same variant that chr1:101:C:T.

The number of possible combinations to represent the same genomic variant is potentially infinite, so it is mandatory to normalize the representation of the variant in order to determine when two representations are the same or different variants.

Steps

The variant normalization perform different steps over each variant to make a full normalization.

Chromosome naming:
Trimming:

Simple trimming

chr1   .   100     CTC     CCC
chr1   .   101     T       C

Deletions

chr1   .   100     AT      A
chr1   .   101     T       -

Insertions

chr1   .   100     A       AT
chr1   .   101     -       T

Ambiguous trimming

chr1   .   100     AAA     A
chr1   .   100     AA      -

Complex trimming

chr1   .   100     ATC     ACCC
chr1   .   102     T       CC
Multi-allelic split:
Clone this wiki locally