Skip to content
Manu edited this page Jan 9, 2020 · 6 revisions

Description

HaploTypo is a pipeline designed to perform variant calling and to assign alleles to a given haplotype in the presence of a phased reference genome of a closely-related organism. With HaploTypo you can obtain genome heterozygosity and variant phasing information, with no need to pass through the demanding traditional de novo genome phasing process.

The pipeline is divided into four independent modules, each of them consisting of a separate python script. The modules perform the following steps:

(1) Independent read mapping to each of the haplotypes using BWA-MEM (Li, 2013) (2) Independent variant calling using GATK (McKenna et al. 2010), bcftools (Li et al. 2011) or freebayes (Garrison and Marth, 2012) (3) Inference of the alternative variants for each haplotype (4) Reconstruction of the final alternative haplotypes

First, HaploTypo performs read mapping and variant calling of a given sample independently on each phased reference haplotype (modules 1 and 2). Then, HaploTypo module 3 compares the phased reference haplotypes with the VCF files obtained and phases based on the called alleles and genotype. Table 1 lists all possible cases that may be encountered by HaploTypo when comparing variants called in each of the two haplotypes, and indicates how they are handled and encoded in the output files from module 3. Finally, HaploTypo module 4 takes the information provided by module 3 and generates the reconstructed haplotypes in FASTA format. Detailed information on each of these modules is in HaploTypo manual.

Scenarios

Table 1. White and grey background colors in the table represent the two haplotypes in each case scenario. Ref: reference allele provided by ‘sample.pass.snp.vcf’ files, or by ‘haplotype.fasta’ file in the absence of called SNP in one of the haplotypes; Alt: alternative allele provided by ’sample.pass.snp.vcf’ files; GT: genotype provided by ’sample.pass.snp.vcf’ files; iAlt: alternative allele inferred by HaploTypo module 3 for each haplotype of an individual different from the reference; Category: ”phased” - HaploTypo can assign an allele to each haplotype, “unphased” - HaploTypo cannot assign an allele to each haplotype, or “unsolved” - HaploTypo found an incompatibility in the alleles or genotypes reported for each haplotype; Reported: “no” - position and alternative allele for this haplotype are not reported, ‘sample.corrected_amb1/2.vcf’ - file where alternative alleles are reported using an ambiguity code or random assignation (-amb 1 and -amb 2 options respectively, depending on the user’s choice), ‘sample.unphasedVariants_amb0/1/2.bed’ - file where unphased positions are reported, ‘sample.unsolved_amb0/1/2.bed’ - file where unsolved positions are reported.

FAQS

Why should I use HaploTypo?

HaploTypo is a tool to phase variants based on previously obtained phasing information for a closely-related individual (i.e. from the same species). HaploTypo results in a fast and accurate assessment of heterozygosity levels and reconstruction of haplotypes.

When should I use HaploTypo?

If you have short-read sequencing data for a sample and a phased reference genome is available for your species, you can profit from running HaploTypo. This pipeline performs read mapping, variant calling, assesses haplotype correspondence for each variant, and reconstructs phased haplotypes. You can also use HaploTypo as a wrapper for variant calling tools in a non-phased reference (see below “Can I run HaploTypo if I do not have a phased reference genome?”).

What is the difference between running HaploTypo and a standard read mapping and variant calling pipeline on my phased reference?

Nowadays, if you are interested in performing variant calling and knowing to which haplotype a given allele belongs, and a phased genome is available for your species, the only option is to align the reads on the two haplotypes at the same time and do variant calling assuming a haploid model (as far as we know, there is not a variant caller available so far that assumes that the two copies of a chromosome are present in the reference). This strategy is more prone to mistakes, because besides the usual inter-strain variability, in the case of low divergence between the two reference haplotypes, the chances of cross-mappings are high. The alternative way is by mapping genomic reads and perform variant calling independently on each of the phased haplotypes, like HaploTypo does. In this last strategy the efficiency of read mapping and variant calling will be influenced by the distance between the analyzed individual and the one used to phase the genome, but the factor “haplotype divergence” is removed. However, although this last approach is more accurate, phasing information can be lost. HaploTypo was designed to join the best of the two worlds, allowing to retrieve phasing information after independent read mapping on the two haplotypes.

What type of data do I need to run the whole HaploTypo pipeline?

HaploTypo requires phased reference haplotypes in separate FASTA files, and filtered genomic reads in FASTQ format for the sample in analysis. Alternatively to the FASTQ files, HaploTypo accepts the respective BAM files with the independent read alignment to each of the haplotypes. In this last case, the pipeline starts in module 2.

Can I run HaploTypo if I do not have a phased reference genome?

Yes, if you do not want to obtain phasing information, a single haploid reference can be provided. In this case, you can only run modules 1 and 2.

Do my phased haplotypes have to correspond to the sample I am analyzing?

No. The phased haplotypes have to be close enough to your sample to allow the most part of your reads to be mapped.

Why is HaploTypo organized into separate modules?

HaploTypo is designed to easily run each of the modules separately, allowing for read mapping (mapping.py), variant calling (var_calling.py), phasing (VCFcorr_alleles.py) and reconstruction of the phased haplotypes in a FASTA format (haplomaker.py). Like this, if you are interested only in a specific module, there is no need to run the whole pipeline.

How were the filters applied for variant calling chosen?

By default the variant filtration process uses the best practices described by the authors of each of the variant callers. If you would like to change the parameters, you can edit the command line present in module 2.

What is the coordinates table and when do I need it?

The coordinates table is a tab separated file indicating the position correspondence of the two haplotypes. It is required as an input when the two haplotypes do not have a one-to-one correspondence due to the presence of INDELs.

Does HaploTypo deal with inversions and translocations?

Yes, just make sure that the position correspondence between chromosomes for complex variants are included in the coordinates table.

Why does not HaploTypo generate the coordinates table for me?

To obtain the coordinates correspondence between two haplotypes a genome alignment step would be necessary. This alignment is highly dependent on genomic features such as haplotype divergence, and genomic rearrangements. Because of that we believe that an automated script would be risky. Therefore, we strongly suggest that you carefully adapt your method to obtain such table according to the features of your phased genome.

Why is the running time of HaploTypo variable?

Running time depends mostly on the levels of sequence divergence between the haplotypes, strain variability, as well as on the software used for variant calling. Regarding sequence divergence, as expected, the higher the divergence between the two haplotypes (high number of SNPs), the higher the computation time. Regarding the program used for variant calling, HaplotypeCaller (GATK) is the most time consuming according to our tests.

How does HaploTypo deal with complex genomic regions which might not be correctly phased?

As any other tool, HaploTypo performance depends on the quality of the input, which in this case is a previously phased genome. As HaploTypo takes into account the allele present in each reference haplotype to attribute a variant to a given haplotype, if in a given heterozygous position the genome was not correctly phased, this naturally can impact HaploTypo performance. However, it is important to note that HaploTypo takes into account if the variants or genotypes called for each haplotype in a given position are compatible. If the information retrieved from each haplotype is not compatible (see above Table 1 for details), it likely results from a mapping or variant calling error and the position is reported in a separate output file.

What happens if the genomic reads represent other haplotypes which are very different from my reference haplotypes?

HaploTypo handles as input genomic data from any individuals, as far as they are closely-related to the reference genome (i.e. from the same species). Naturally our pipeline will only assign to the existing haplotypes in the reference. However, it allows for variants not present in the reference and even can result in unassigned variants if the genomic context of a SNP does not allow to place it into one of the haplotypes. Therefore, the existence of another haplotype very different from the ones in the reference may result in i) difficulty to map to the reference, suggesting that rather than a different haplotype we are in the presence of a different species, or ii) the presence of many unphased variants in the final result, which will already suggest that the phased reference is not representative of the genome in analysis.

My sample is triploid. May I run HaploTypo?

HaploTypo pipeline was designed to deal with diploid samples. For samples with other ploidies only modules 1 and 2 should be run.

In which version of python is HaploTypo implemented?

HaploTypo is implemented in python 2.7 and 3.5. You just need to choose the version you prefer.

I do not have an operating system compatible with HaploTypo. How can I run it?

HaploTypo is implemented in docker. Install docker in your computer and you will be able to obtain the proper environment to run HaploTypo. Check HaploTypo manual for more details.

Clone this wiki locally