Harnessing population-specific protein-truncating variants to improve the annotation of loss-of-function alleles

This repository contains all data and code pertinent to the analysis presented in the paper.

Skitchenko R.K., Kornienko J.S., Maksiutenko E.M., Glotov A.S., Predeus A.V. and Barbitoff Y.A. (2020) Harnessing population-specific protein-truncating variants to improve the annotation of loss-of-function alleles. bioRxiv

This project is dedicated to the identification of population-specific PTVs (psPTVs) and building a predictive model to enhance annotation of low-confidence loss-of-function (LoF) alleles.

psPTVs are strongly enriched for low-confidence putative LoF (pLoF) variants. An example of a notable psPTV that is a misannotated low-conf pLoF is a rs139297920 in the PAX3 gene:

psPTV_project.Rmd markdown file represents the body of the analysis and contains the complete explanation of all data analysis steps. All files included in this repository are used in this Markdown file.

LoFfeR folder contains the source code and reference materials of the LoFfeR toolkit that was developed to enhance the annotation of low-confidence pLoF alleles. See below for usage details.

shet_estimation directory contains scripts used for s-het inference and calculation of the PTV count distribution likelihood and the observation likelihood score. IG_params.py fits the inverse Gaussian distribution to the data; estimate_S.py estimates the best-fit s-het value, CPL.py calculates the likelihood of a PTV count distribution and conducts sampling.

cassa_table.csv is the main supporting table of the Cassa et al., 2017 paper.

clv*tsv are tab-separated data files containing the validation dataset of benign and confident pathogenic ClinVar variants. These data files are used during model training and evaluation.

covered_genes.tsv is the file listing coordinates of the genes that have at least 30x mean coverage of coding regions in gnomAD.

full_data_constr_pext.tsv is the main tab-separated data files containing the final annotated set of gnomAD PTVs used throughout the analysis.

full_aggregated.tsv, full_corr_wshet.tsv, full_out_boot_stats.csv, comprehensive_gene_annotation.tsv are the tab-seprated files with the aggregated gene-level PTV counts. comprehensive_gene_annotation.tsv is the final file containing all necessary annotations.

merged_counts.tsv is the GENCODE v19-derived tab-separated file containing per-gene isoform and exon counts.

genes_inheritance.txt is a simplified OMIM-derived data file listing the disease associations of the genes.

LoFfeR usage

Note. The tool is under active development. Please refer to bioRxiv for updates.

To run LoFfeR, a tool to predict low-confidence pLoF variants in the VCF file, you will need R v.3.6+ with the randomForest package and Python 3.6+ with pandas, numpy packages installed.

Installation

To install LoFfeR, simply clone this repository and grant execution permissions to the main executable file:

git clone https://github.com/mrbarbitoff/ptv_project_bi
cd ptv_project_bi
chmod +x ./LoFfeR/LoFfeR

Running LoFfeR

To run annotation of low-confidence pLoF variants, please use the following command:

./LoFfeR/LoFfeR <VCF>

By default, LoFfeR expects a VEP-annotated VCF file to be passed as the first and only argument. The file should be annotated using the LOFTEE plugin (LoF), as only variants passing LOFTEE filter will be considered. LOFTEE fields are expected to be the trailing ones in each annotation.

Currently, LoFfeR only supports GRCh37 (b37) VCF files. We are working to add the support for the GRCh38 reference as soon as possible.

LoFfeR results are written to the results.tsv tab-separated file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Harnessing population-specific protein-truncating variants to improve the annotation of loss-of-function alleles

LoFfeR usage

Installation

Running LoFfeR

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
LoFfeR		LoFfeR
shet_estimation		shet_estimation
README.md		README.md
cassa_table.csv		cassa_table.csv
clv_benign_parsed.tsv		clv_benign_parsed.tsv
clv_path_parsed.tsv		clv_path_parsed.tsv
clv_ptv_ready.tsv		clv_ptv_ready.tsv
comprehensive_gene_annotation.tsv		comprehensive_gene_annotation.tsv
covered_genes.tsv		covered_genes.tsv
full_aggregated.tsv		full_aggregated.tsv
full_corr_wshet.tsv		full_corr_wshet.tsv
full_data_constr_pext.tsv		full_data_constr_pext.tsv
full_out_boot_stats.csv		full_out_boot_stats.csv
genes_inheritance.txt		genes_inheritance.txt
gnomad.v2.1.1.lof_metrics.by_gene.txt		gnomad.v2.1.1.lof_metrics.by_gene.txt
merged_counts.tsv		merged_counts.tsv
pax3.png		pax3.png
psPTV_project.Rmd		psPTV_project.Rmd
psPTV_project.html		psPTV_project.html
testing.R		testing.R
top_vars_removed.tsv		top_vars_removed.tsv

mrbarbitoff/ptv_project_bi

Folders and files

Latest commit

History

Repository files navigation

Harnessing population-specific protein-truncating variants to improve the annotation of loss-of-function alleles

LoFfeR usage

Installation

Running LoFfeR

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages