Automatic Mitochondrial DNA Copy Number (AutoMitoC) Pipeline to infer mtDNA-CN from genotyping arrays
- Background
- Key Features
- Method Overview
- Pre-Processing
- Necessary Inputs
- Quick Start
- Areas of Future Development
- Contact information
- Citation
- Data Availability
Mitochondria have their own unique DNA due to their origins as ancient bacteria. Due to this property, the ratio of mitochondrial to nuclear DNA signals, dubbed as the mitochondrial DNA copy number (mtDNA-CN) has emerged as a popular disease biomarker of mitochondrial etiology. One way to quantify mtDNA-CN is through genotyping arrays which enables not only the estimation of mtDNA-CN but also the capability to simultaneously interrogate its common genetic determinants. Previous array-based methods involve steps that are subjective, that require manual curation or extra lab testing, and that have not been extensively validated in different ethnicities. Accordingly, this served as the motivation to develop the AutoMitoC pipeline in order to enable fast, automated, and robust estimation of mtDNA-CN from genotyping arrays in all populations.
AutoMitoC builds off of the MitoPipeline (Lane et al, 2014) with three key distinguishing features:
- Nuclear signal is approximated by globally rare variants in place of common variants
- Cross-hybridizing probes are identified by empirically assessing evidence for cross-hybridization
- The primary estimate of mitochondrial (MT) signal is ascertained using principal component analysis (PCA)
For further details on how AutoMitoC was developed or the rationale behind these features, please see the main manuscript.
This figure was created using lucid charts
- We recommend applying some basic PLINK QC for autosomal variants:
plink --bfile your_study \
--max-maf 0.01 \
--geno 0.05 \
--out your_study_qc
*Note that we have seen comparable performance at even lower max-maf thresholds (e.g. --max-maf 0.001 with ~ 10000 autosomal probes) but this is also dependent on your sample size.
-
Follow step 1 of the PennCNV pipeline to generate signal intensity files containing L2R and BAF values.
-
Correct signal intensities (MT and Auto L2R) for GC-waves using the PennCNV implementation of the Diskin et al (2008) adjustment.
-
Remove "wavy" samples with a post GC-corrected autosomal L2R standard deviation of > 0.35.
The core AutoMitoC Rscript assumes all pre-processing steps have been completed and requires the following inputs to be specified:
- Path to the Autosomal L2R matrix ("EXAMPLE.AUTO_MATRIX.csv.gz"; comma separated; rows as samples; columns as autosomal probe L2R; no header or sample IDs)
- Path to the MT L2R matrix ("EXAMPLE.MITO_MATRIX.csv.gz" comma separated; rows as samples; columns as mitochondrial probe L2R; no header or sample IDs)
- Path to the Sample descriptor file ("EXAMPLE.SAMPLE_FILE.csv.gz"; comma separated; rows as samples; header line) containing the following columns:
- Column 1: "SAMPLE" --> unique sample IDs in the order that they appear in the MT and Autosomal L2R matrices
- Column 2: "SEX" --> self-reported gender (male=1;female=2)
- Column 3: "AGE" --> age in years
- Column 4 (Optional): "MITO_BENCHMARK" --> secondary mtDNA-CN measurement for comparison (e.g. qPCR, WGS, digital PCR). Missing values are perimssible and should be coded as 'NA'.
- Number of cores to parallelize the PCA steps (10)
- Output Path / Output File Name Prefix ("TOY_EXAMPLE")
For file formatting examples, please download the toy dataset in the subsequent "Quick Start" section.
- Download the R package and decompress:
git clone https://github.com/GMELab/AutoMitoC
cd AutoMitoC/scripts
- Install requisite R libraries:
R
install.packages(c("data.table","ggplot2","parallel"))
q()
n
- Run the R script as follows:
Rscript AUTOMITOC_V1_RSCRIPTS_CLEAN.r EXAMPLE.AUTO_MATRIX.csv.gz EXAMPLE.MITO_MATRIX.csv.gz EXAMPLE.SAMPLE_FILE.csv.gz 10 EXAMPLE_TOY
- If AutoMitoC is run successfully on the toy dataset, you should see something similar to the following output:
[1] "AutoMitoC Script v1.0 Initialized @ 2021-11-26 02:26:47"
Loading required package: data.table
Loading required package: ggplot2
Loading required package: parallel
[1] "Reading in Input Arguments..."
[1] "Step 1. Preliminary PCA Correction of Autosomal Probe Intensities ... "
[1] "The Top 74 Autosomal Principal Components Explain ~70% variance in Probe Intensities and will be corrected for "
null device
1
[1] "... Running Probe Correction with 10 core(s)"
[1] "Step 2. Probe Cross-hybridization Check ... "
[1] "Removing 127 Autosomal probes with potential cross-hybridization to the sex (R > 0.05) or mitochondrial (R > 0.05) genomes "
[1] "Removing 3 Mitochondrial probes with potential cross-hybridization to the sex (R > 0.20) or autosomal (R > 0.05) genomes "
[1] "*** Note that due to the robust epidemiological association between mtDNA-CN and sex, there is an expectation that mitochondrial probes will be associated with sex and therefore the correlation coefficient threshold for removing mitochondrial probes with evidence of cross-hybridization to sex chromosomes is more stringent"
[1] "Step 3. Final PCA of Clean Autosomal Probe Intensities ..."
[1] "The Top 72 Autosomal Principal Components Explain ~70% variance in Probe Intensities and will be corrected for "
null device
1
[1] "... AutoMitoC mtDNA-CN estimates are finished!"
[1] "... Benchmarking Complete!"
[1] "AutoMitoC vs. user-supplied mtDNA-CN measurement performance: R = 0.647; Association P-value = 1.67e-164"
[1] "AutoMitoC Script Finished @ 2021-11-26 02:27:02"
[1] "Script Duration: 0 hrs, 0 mins, 16 secs ..."
AutoMitoC remains under active development and we plan to make improvements in the following areas in the near future:
- The adjustment for autosomal principal components selects k top PCs accounting for 70% of the variance in autosomal signal intensities, which approximated the inflection point on the scree plot and worked well for us in practice on two independent datasets and different arrays; however, in the future, we plan to make this threshold less arbitrary and empirically determine the k PCs corresponding to the inflection point.
- AutoMitoC has been benchmarked using multiple Affymetrix arrays and we plan to also test using Illumina arrays.
- We will allow for greater flexibility to enable benchmarking of AutoMitoC estimates to other phenotypic correlates of mtDNA-CN (e.g. blood cell composition)
Any queries pertaining to the AutoMitoC scripts can be addressed to: Michael Chong ([email protected]) and Guillaume Paré ([email protected])
Created with BioRender.com
We applied AutoMitoC to the UK Biobank and performed GWAS for buffy coat mtDNA-CN including up to 395,781 participants. Full genome-wide summary statistics (even the sub-bonferroni associations!) are available for download from the GWAS catalogue