author: Alexander Frieden date:
-
Often times we would like to check for ancestry data using Prinicpal Component Analysis (PCA) on genome data.
-
Prinicipal Component Analysis is a method used to cluster data.
-
However, genome data is not often available and exome data is. This method tries to take off target reads from exome data and the exome data and make calls and make ancestry calls based off of that new data set.
-
Principal Components are defined as the product of a weight vector a genotype vector, with weights reflecting the marginal information about ancestry.
-
Targeting sequencing tends to be bad at giving ancestry information.
- We compare each sequenced sample to a reference sample whose ancestral data is known and where the whole genome snp information is known.
-
simulate low coverage data on worldwide an european data.
-
Take targeted sequence data from 1000 genome data set.
-
Results show continental ancestry and sometimes particular areas of European ancestry.
-
To do this they built method Laser (Locating Ancestry from SEquence Reads)
-
Using data from Human Genome Diversity Panel (HGDP) consisting 938 individuals from 53 populations.
-
700 of samples used to construct PCAs. First four used to identify continential groups
-
Accuracy assessed by comparing the ancestry estimates obtained from LASER to the PCA coordinates of test individuals using original SNP genotypes and pearson correlation coefficients and Proscrustes similiarity score.
-
Although final results were fuzzy, continental groups were well seperated. Pearson correlation
$r^2$ scores ranged from 0.7396 from PC4 to 0.9506 for PC1.
-
Among samples tested, 1 European (CEU) and 1 Yoruba (YRI) nuclear family selected among the HapMap Project samples (each nuclear family included a mother, a father, and a child.)
-
When using HGDP as the reference, both these families were correctly placed in the correct positions.
-
Took 941 finnish samples from exome data set. Used 470 individuals and at 8.4 mil SNPS with
$MAF \geq 0.01$ . -
Took remaining 471 individuals on this reference map using ancestry estimates derived from whole genome sequencing data as a gold standard. How did they get whole genome data?
-
Results using this were better than using exome alone.
-
$t_0 = 09763$ and$r^2 = 0.9778$ for PC1 and$r^2=0.9259$ for PC2 for LASER -
$t_0 = 08263$ and$r^2 = 0.9411$ for PC1 and$r^2=0.4373$ for PC2 for exome
-
Off target reads as low as 0.001x we can still reconstruct worldwide continental ancestry
-
If samples genotyped at higher density or whole genome sequenced, authors expect better results.
-
Simulations show using estimates ancestry addresses imperfect matching of case and control.
-
Also shows that when population structure is stratified much more different than expected, different methods must be applied.
Extra method slides
Usually the construction of principal axes follows from the classical approach to PCA, which is applied to the scaled matrix (individuals by SNPs) of observed genotypes (AA, AB, BB; say B is the minor allele in all cases).
In this case we are doing: Let
We then define
To do this, let
Transpose of a matrix is indicated as
$$
\left[
\begin{array}{ c c }
1 & 2 \
\end{array}
\right]^T = \left[
\begin{array}{ c c }
1 \
2
\end{array}
\right] \
\left[
\begin{array}{ c c }
1 & 2 \
3 & 4
\end{array}
\right]^T = \left[
\begin{array}{ c c }
1 & 3\
2 & 4
\end{array}
\right] \
\left[
\begin{array}{ c c }
1 & 2 \
3 & 4 \
5 & 6
\end{array}
\right]^T = \left[
\begin{array}{ c c }
1 & 3 & 5\
2 & 4 & 6
\end{array}
\right] \
$$
It has a number of properties that you can learn in any linear algebra book. A number of these properties have parallels to group theory.
For our matrix
This is step two in our process. This step is unneccesary if you have the aligned data for the reference individuals. Why is this data not usually available?
For sample
We want to place the study sample in PCA reference space.
To do this we need to do Procrustes Analysis.
Many terms that mean the same thing:
Procrustes superimposition
Procrustes fitting
Generalized Procrustes Analysis (GPA)
Generalized least squares (GLS)
Least squares fitting
- Centers all shapes at the origin (0,0,0)
- Usually scales all shapes to the same size (usually “unit size” or size = 1.0)
- Rotates each shape around the origin until the sum of squared distances among them is minimized (similar to least-squares fit of a regression line)
- Ensures that the differences in shape are minimized
Sum of distances between corresponding landmarks of two shapes.
Using previous work (25,26) we find a transformation
Remember X is the
Set
Success is then quantified by Procrustes Similarity Statistic as we saw before.
$$
t(X,Y) = \sqrt{1 - D}
$$
Where D is the scaled minimum sum of squared Euclidean distances between
Lower Proscrustes similairty corresponds to greater uncertainty and less reliable