Linkage_disequilibrium.Rmd

```{r,message=FALSE,echo=FALSE}
  html <- TRUE
  library("knitcitations")
  library("knitr")
  cite_options(citation_format = "pandoc", max.names = 3, style = "html", hyperlink = "to.doc")
  bib <- read.bibtex("bibtexlib.bib")
  opts_chunk$set(tidy = FALSE, message = FALSE, warning = FALSE, cache = TRUE, fig.width = 7, fig.height = 7)
  if (html) opts_chunk$set(out.width = "700px", dpi = 300)
  # use this to set knitr options:
  # http://yihui.name/knitr/options #chunk_options
```
---
title: "Linkage disequilibrium"
subtitle: "*NJ Gr&uuml;nwald, ZN Kamvar and SE Everhart*"
---

In this chapter we will formally test if populations are in linkage
disequilibrium or not. This test is useful to determine if populations are
clonal (where significant disequilibrium is expected due to linkage among loci)
or sexual (where linkage among loci is not expected). The null hypothesis tested
is that alleles observed at different loci are not linked if populations are
sexual while alleles recombine freely  into new genotypes during the process of
sexual reproduction. In molecular ecology we typically use the index of
association or related indices to test this phenomenon.

# The index of association

The index of association ($I_A$) was originally proposed by Brown et al.
`r citep(bib[c("brown1980multilocus")])` and implemented in the *poppr* R
package `r citep(bib[c("kamvar2014poppr")])` using a permutation approach to
assess if loci are linked as described previously by Agapow and Burt
`r citep(bib[c("agapow2001indices")])`. Agapow and Burt also described the index
$\bar{r}_d$ that accounts for the number of loci sampled that is less biased and
will be used here. The data we will use in this chapter are populations of
*Phytophthora infestans* from North and South America
`r citep(bib["goss2014irish"])`. We will use the index of association to test the hypothesis that Mexico is the putative origin of *P. infestans* where populations are expected to be sexual while populations in South America are expected to be clonal.

First, we need to load the packages needed for this analysis.

```{r}
library("poppr")
library("magrittr")
data(Pinf)
```

Next, we will analyze the North American population with the index of
association and use 999 permutations of the data in order to give us a p-value.
Note that the p-value is calculated with the original observation included.

```{r, eval=FALSE}
MX <- popsub(Pinf, "North America")
ia(MX, sample = 999)
```
```{r, MXia, echo=FALSE}
cat("|================================================================| 100%\n")
# the function options for ia() include:
# ia(pop, sample = 0, method = 1, quiet = FALSE, missing = "ignore", hist = TRUE)
MX <- popsub(Pinf, "North America")
set.seed(20154)
MXia <- ia(MX, sample = 999, quiet=TRUE)
MXia
```

> **For advanced users:** For reproducibility, use `set.seed()` before invoking `ia()`.

We observe 48 individuals and see that $P = `r round(MXia[4], 3)`$ for
$\bar{r}_d = `r round(MXia[3], 3)`$. We thus reject the null hypothesis of no
linkage among markers. Notice, however, that the observed $\bar{r}_d$ falls on
the right tail of the re-sampled distribution and the P value is close to $P =
0.01$. Could this population have clones? We can find out by displaying the
data.

```{r}
MX
```

## Clone correction

Indeed we observe 43 multilocus genotypes out of 48 samples. We are looking at
partial clonality and thus need to use clone-corrected (also called clone-
censored) data:

```{r, eval=FALSE}
MX %>% clonecorrect(strata= ~Continent/Country) %>% ia(sample = 999)
```
```{r, MXiacc, echo=FALSE}
cat("|================================================================| 100%\n")
set.seed(20154)
MX %>% clonecorrect(strata= ~Continent/Country) %>% ia(sample = 999, quiet = TRUE)
```

Now $\bar{r}_d$ is located more centrally in the distribution expected from
unlinked loci. Note that $P$ has improved and we fail to reject the null
hypothesis of no linkage among markers. Thus it appears that populations in
Mexico are sexual.

Next let's use the same process to evaluate the South American population:

```{r, eval=FALSE}
SA <- popsub(Pinf, "South America")
ia(SA, sample = 999)
```
```{r, SAia, echo=FALSE}
cat("|================================================================| 100%\n")
# the function options for ia() include:
# ia(pop, sample = 0, method = 1, quiet = FALSE, missing = "ignore", hist = TRUE)
SA <- popsub(Pinf, "South America")
set.seed(20154)
SAia <- ia(SA, sample=999, quiet=TRUE)
SAia
```

Here we find significant support for the hypothesis that alleles are linked
across loci with $P < 0.001$. The observed $\bar{r}_d = `r round(SAia[3], 3)`$
and falls outside of the distribution expected under no linkage. Let's look at
the clone-corrected data and make sure this is not an artifact of clonality:

```{r, eval=FALSE}
SA %>% clonecorrect(strata= ~Continent/Country) %>% ia(sample=999)
```
```{r, SAiacc, echo=FALSE}
cat("|================================================================| 100%\n")
set.seed(9999)
SA %>% clonecorrect(strata= ~Continent/Country) %>% ia(sample=999, quiet=TRUE)
```

Both clone-corrected ($N = 29$) and uncorrected data ($N = 38$) reject the
hypothesis of no linkage among markers. We thus have support for populations in
Mexico being sexual while those in South America are clonal.

This approach has been applied to provide support for Mexico as the putative
center of origin of the potato late blight pathogen *P. infestans*
`r citep(bib[c("goss2014irish")])`. At the center of origin this organism is
expected to reproduce sexually, while South American populations are clonal.

## Pairwise $\bar{r}_d$ over all loci

To ensure that the pattern of linkage disequilibrium seen is not due to a single
pair of loci, you can calculate $I_A$ and $\bar{r}_d$ over all pairs of loci.
We'll perform this on the clone-corrected samples as above.

Pairwise for the Mexican population:

```{r, eval = FALSE}
mxpair <- MX %>% clonecorrect(strata = ~Continent/Country) %>% pair.ia
```

```{r, echo = FALSE}
cat("|================================================================| 100%\n")
mxpair <- MX %>% clonecorrect(strata = ~Continent/Country) %>% pair.ia(quiet = TRUE)
```

Pairwise for the South American population:

```{r, eval = FALSE}
sapair <- SA %>% clonecorrect(strata = ~Continent/Country) %>% pair.ia
```

```{r, echo = FALSE}
cat("|================================================================| 100%\n")
sapair <- SA %>% clonecorrect(strata = ~Continent/Country) %>% pair.ia(quiet = TRUE)
```

The heatmaps produced make it look like there is more linkage in the Mexican
population! But this is where looks can be deceiving. The color palettes are
scaled to the data. We can confirm it by looking at the values:

```{r}
head(mxpair, 10) # Mexico
head(sapair, 10) # South America
```

We can see that most of the values from South America are indeed higher than in
Mexico. Notice the value that says "NaN" in the South American data? That
represents missing data. If you recall from the chapter on [Locus
Stats](Locus_Stats.html), the number of alleles at locus Pi33 for the South
American population was 1. If you try to analyze the index of association on a
locus with only one allele, you will get an undefined value. This is why the
heatmap for the South American population has grey squares in it.

### Plotting the output of pair.ia

The output of pair.ia is a matrix that also has a class of "pairia". It has a
specific plot method that we can use to plot the output again and set a standard
limit to the plot by specifying a range.

```{r}
plotrange <- range(c(mxpair, sapair), na.rm = TRUE)
plot(mxpair, limits = plotrange)
plot(sapair, limits = plotrange)
```

References
----------