-
Notifications
You must be signed in to change notification settings - Fork 50
/
Linkage_disequilibrium.Rmd
194 lines (159 loc) · 7.58 KB
/
Linkage_disequilibrium.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
```{r,message=FALSE,echo=FALSE}
html <- TRUE
library("knitcitations")
library("knitr")
cite_options(citation_format = "pandoc", max.names = 3, style = "html", hyperlink = "to.doc")
bib <- read.bibtex("bibtexlib.bib")
opts_chunk$set(tidy = FALSE, message = FALSE, warning = FALSE, cache = TRUE, fig.width = 7, fig.height = 7)
if (html) opts_chunk$set(out.width = "700px", dpi = 300)
# use this to set knitr options:
# http://yihui.name/knitr/options #chunk_options
```
---
title: "Linkage disequilibrium"
subtitle: "*NJ Grünwald, ZN Kamvar and SE Everhart*"
---
In this chapter we will formally test if populations are in linkage
disequilibrium or not. This test is useful to determine if populations are
clonal (where significant disequilibrium is expected due to linkage among loci)
or sexual (where linkage among loci is not expected). The null hypothesis tested
is that alleles observed at different loci are not linked if populations are
sexual while alleles recombine freely into new genotypes during the process of
sexual reproduction. In molecular ecology we typically use the index of
association or related indices to test this phenomenon.
# The index of association
The index of association ($I_A$) was originally proposed by Brown et al.
`r citep(bib[c("brown1980multilocus")])` and implemented in the *poppr* R
package `r citep(bib[c("kamvar2014poppr")])` using a permutation approach to
assess if loci are linked as described previously by Agapow and Burt
`r citep(bib[c("agapow2001indices")])`. Agapow and Burt also described the index
$\bar{r}_d$ that accounts for the number of loci sampled that is less biased and
will be used here. The data we will use in this chapter are populations of
*Phytophthora infestans* from North and South America
`r citep(bib["goss2014irish"])`. We will use the index of association to test the hypothesis that Mexico is the putative origin of *P. infestans* where populations are expected to be sexual while populations in South America are expected to be clonal.
First, we need to load the packages needed for this analysis.
```{r}
library("poppr")
library("magrittr")
data(Pinf)
```
Next, we will analyze the North American population with the index of
association and use 999 permutations of the data in order to give us a p-value.
Note that the p-value is calculated with the original observation included.
```{r, eval=FALSE}
MX <- popsub(Pinf, "North America")
ia(MX, sample = 999)
```
```{r, MXia, echo=FALSE}
cat("|================================================================| 100%\n")
# the function options for ia() include:
# ia(pop, sample = 0, method = 1, quiet = FALSE, missing = "ignore", hist = TRUE)
MX <- popsub(Pinf, "North America")
set.seed(20154)
MXia <- ia(MX, sample = 999, quiet=TRUE)
MXia
```
> **For advanced users:** For reproducibility, use `set.seed()` before invoking `ia()`.
We observe 48 individuals and see that $P = `r round(MXia[4], 3)`$ for
$\bar{r}_d = `r round(MXia[3], 3)`$. We thus reject the null hypothesis of no
linkage among markers. Notice, however, that the observed $\bar{r}_d$ falls on
the right tail of the re-sampled distribution and the P value is close to $P =
0.01$. Could this population have clones? We can find out by displaying the
data.
```{r}
MX
```
## Clone correction
Indeed we observe 43 multilocus genotypes out of 48 samples. We are looking at
partial clonality and thus need to use clone-corrected (also called clone-
censored) data:
```{r, eval=FALSE}
MX %>% clonecorrect(strata= ~Continent/Country) %>% ia(sample = 999)
```
```{r, MXiacc, echo=FALSE}
cat("|================================================================| 100%\n")
set.seed(20154)
MX %>% clonecorrect(strata= ~Continent/Country) %>% ia(sample = 999, quiet = TRUE)
```
Now $\bar{r}_d$ is located more centrally in the distribution expected from
unlinked loci. Note that $P$ has improved and we fail to reject the null
hypothesis of no linkage among markers. Thus it appears that populations in
Mexico are sexual.
Next let's use the same process to evaluate the South American population:
```{r, eval=FALSE}
SA <- popsub(Pinf, "South America")
ia(SA, sample = 999)
```
```{r, SAia, echo=FALSE}
cat("|================================================================| 100%\n")
# the function options for ia() include:
# ia(pop, sample = 0, method = 1, quiet = FALSE, missing = "ignore", hist = TRUE)
SA <- popsub(Pinf, "South America")
set.seed(20154)
SAia <- ia(SA, sample=999, quiet=TRUE)
SAia
```
Here we find significant support for the hypothesis that alleles are linked
across loci with $P < 0.001$. The observed $\bar{r}_d = `r round(SAia[3], 3)`$
and falls outside of the distribution expected under no linkage. Let's look at
the clone-corrected data and make sure this is not an artifact of clonality:
```{r, eval=FALSE}
SA %>% clonecorrect(strata= ~Continent/Country) %>% ia(sample=999)
```
```{r, SAiacc, echo=FALSE}
cat("|================================================================| 100%\n")
set.seed(9999)
SA %>% clonecorrect(strata= ~Continent/Country) %>% ia(sample=999, quiet=TRUE)
```
Both clone-corrected ($N = 29$) and uncorrected data ($N = 38$) reject the
hypothesis of no linkage among markers. We thus have support for populations in
Mexico being sexual while those in South America are clonal.
This approach has been applied to provide support for Mexico as the putative
center of origin of the potato late blight pathogen *P. infestans*
`r citep(bib[c("goss2014irish")])`. At the center of origin this organism is
expected to reproduce sexually, while South American populations are clonal.
## Pairwise $\bar{r}_d$ over all loci
To ensure that the pattern of linkage disequilibrium seen is not due to a single
pair of loci, you can calculate $I_A$ and $\bar{r}_d$ over all pairs of loci.
We'll perform this on the clone-corrected samples as above.
Pairwise for the Mexican population:
```{r, eval = FALSE}
mxpair <- MX %>% clonecorrect(strata = ~Continent/Country) %>% pair.ia
```
```{r, echo = FALSE}
cat("|================================================================| 100%\n")
mxpair <- MX %>% clonecorrect(strata = ~Continent/Country) %>% pair.ia(quiet = TRUE)
```
Pairwise for the South American population:
```{r, eval = FALSE}
sapair <- SA %>% clonecorrect(strata = ~Continent/Country) %>% pair.ia
```
```{r, echo = FALSE}
cat("|================================================================| 100%\n")
sapair <- SA %>% clonecorrect(strata = ~Continent/Country) %>% pair.ia(quiet = TRUE)
```
The heatmaps produced make it look like there is more linkage in the Mexican
population! But this is where looks can be deceiving. The color palettes are
scaled to the data. We can confirm it by looking at the values:
```{r}
head(mxpair, 10) # Mexico
head(sapair, 10) # South America
```
We can see that most of the values from South America are indeed higher than in
Mexico. Notice the value that says "NaN" in the South American data? That
represents missing data. If you recall from the chapter on [Locus
Stats](Locus_Stats.html), the number of alleles at locus Pi33 for the South
American population was 1. If you try to analyze the index of association on a
locus with only one allele, you will get an undefined value. This is why the
heatmap for the South American population has grey squares in it.
### Plotting the output of pair.ia
The output of pair.ia is a matrix that also has a class of "pairia". It has a
specific plot method that we can use to plot the output again and set a standard
limit to the plot by specifying a range.
```{r}
plotrange <- range(c(mxpair, sapair), na.rm = TRUE)
plot(mxpair, limits = plotrange)
plot(sapair, limits = plotrange)
```
References
----------