Skip to content

Commit

Permalink
Modify documentation for Bruvo's distance to fix #4
Browse files Browse the repository at this point in the history
  • Loading branch information
zkamvar committed Feb 2, 2015
1 parent 5efb868 commit 8407cec
Show file tree
Hide file tree
Showing 8 changed files with 150 additions and 93 deletions.
4 changes: 2 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
Package: poppr
Type: Package
Title: Genetic Analysis of Populations With Mixed Reproduction
Version: 1.1.2.99-55
Date: 2015-01-30
Version: 1.1.2.99-56
Date: 2015-02-02
Authors@R: c(person(c("Zhian", "N."), "Kamvar", role = c("cre", "aut"),
email = "[email protected]"),
person(c("Javier", "F."), "Tabima", role = "aut",
Expand Down
94 changes: 51 additions & 43 deletions R/bruvo.r
Original file line number Diff line number Diff line change
Expand Up @@ -60,22 +60,15 @@
#' nucleotide repeats for each microsatellite locus.
#'
#' @param add if \code{TRUE}, genotypes with zero values will be treated under
#' the genome addition model presented in Bruvo et al. 2004.
#' the genome addition model presented in Bruvo et al. 2004. See the
#' \strong{Note} section for options.
#'
#' @param loss if \code{TRUE}, genotypes with zero values will be treated under
#' the genome loss model presented in Bruvo et al. 2004.
#' the genome loss model presented in Bruvo et al. 2004. See the
#' \strong{Note} section for options.
#'
#' @return a \code{distance matrix}
#' @return an object of class \code{\link{dist}}
#'
#' @note The result of both \code{add = TRUE} and \code{loss = TRUE} is that the
#' distance is averaged over both values. If both are set to \code{FALSE},
#' then the infinite alleles model is used. For genotypes with all missing
#' values, the result will be NA.
#'
#' If the user does not provide a vector of appropriate length for
#' \code{replen} , it will be estimated by taking the minimum difference among
#' represented alleles at each locus. IT IS NOT RECOMMENDED TO RELY ON THIS
#' ESTIMATION.
#'
#' @details Ploidy is irrelevant with respect to calculation of Bruvo's
#' distance. However, since it makes a comparison between all alleles at a
Expand All @@ -85,41 +78,56 @@
#' have a lower ploidy level than the organism.
#'
#' To help deal with these situations, Bruvo has suggested three methods for
#' dealing with these differences in ploidy levels: \itemize{ \item Infinite
#' Model - The simplest way to deal with it is to count all missing alleles as
#' infinitely large so that the distance between it and anything else is 1.
#' Aside from this being computationally simple, it will tend to
#' \strong{inflate distances between individuals}. \item Genome Addition Model
#' - If it is suspected that the organism has gone through a recent genome
#' expansion, \strong{the missing alleles will be replace with all possible
#' combinations of the observed alleles in the shorter genotype}. For example,
#' if there is a genotype of [69, 70, 0, 0] where 0 is a missing allele, the
#' possible combinations are: [69, 70, 69, 69], [69, 70, 69, 70], and [69, 70,
#' 70, 70]. The resulting distances are then averaged over the number of
#' comparisons. \item Genome Loss Model - This is similar to the genome
#' addition model, except that it assumes that there was a recent genome
#' reduction event and uses \strong{the observed values in the full genotype
#' to fill the missing values in the short genotype}. As with the Genome
#' Addition Model, the resulting distances are averaged over the number of
#' comparisons. \item Combination Model - Combine and average the genome
#' addition and loss models. } As mentioned above, the infinite model is
#' biased, but it is not nearly as computationally intensive as either of the
#' other models. The reason for this is that both of the addition and loss
#' models requires replacement of alleles and recalculation of Bruvo's
#' distance. The number of replacements required is equal to the multiset
#' coefficient: \eqn{\left({n \choose k}\right) == {(n+k-1) \choose
#' k}}{choose(n+k-1, k)} where \emph{n} is the number of potential
#' replacements and \emph{k} is the number of alleles to be replaced. So, for
#' the example given above, The genome addition model would require
#' \eqn{\left({2 \choose 2}\right) = 3}{choose(2+2-1, 2) == 3} calculations of
#' Bruvo's distance, whereas the genome loss model would require \eqn{\left({4
#' \choose 2}\right) = 10}{choose(4+2-1, 2) == 10} calculations.
#' dealing with these differences in ploidy levels: \itemize{ \item
#' \strong{Infinite Model} - The simplest way to deal with it is to count all
#' missing alleles as infinitely large so that the distance between it and
#' anything else is 1. Aside from this being computationally simple, it will
#' tend to \strong{inflate distances between individuals}. \item
#' \strong{Genome Addition Model} - If it is suspected that the organism has
#' gone through a recent genome expansion, \strong{the missing alleles will be
#' replace with all possible combinations of the observed alleles in the
#' shorter genotype}. For example, if there is a genotype of [69, 70, 0, 0]
#' where 0 is a missing allele, the possible combinations are: [69, 70, 69,
#' 69], [69, 70, 69, 70], and [69, 70, 70, 70]. The resulting distances are
#' then averaged over the number of comparisons. \item \strong{Genome Loss
#' Model} - This is similar to the genome addition model, except that it
#' assumes that there was a recent genome reduction event and uses \strong{the
#' observed values in the full genotype to fill the missing values in the
#' short genotype}. As with the Genome Addition Model, the resulting distances
#' are averaged over the number of comparisons. \item \strong{Combination
#' Model} - Combine and average the genome addition and loss models. } As
#' mentioned above, the infinite model is biased, but it is not nearly as
#' computationally intensive as either of the other models. The reason for
#' this is that both of the addition and loss models requires replacement of
#' alleles and recalculation of Bruvo's distance. The number of replacements
#' required is equal to the multiset coefficient: \eqn{\left({n \choose
#' k}\right) == {(n+k-1) \choose k}}{choose(n+k-1, k)} where \emph{n} is the
#' number of potential replacements and \emph{k} is the number of alleles to
#' be replaced. So, for the example given above, The genome addition model
#' would require \eqn{\left({2 \choose 2}\right) = 3}{choose(2+2-1, 2) == 3}
#' calculations of Bruvo's distance, whereas the genome loss model would
#' require \eqn{\left({4 \choose 2}\right) = 10}{choose(4+2-1, 2) == 10}
#' calculations.
#'
#' To reduce the number of calculations and assumptions otherwise, Bruvo's
#' distance will be calculated using the largest observed ploidy in pairwise
#' comparisons. This means that when comparing [69,70,71,0] and [59,60,0,0],
#' distance will be calculated using the largest observed ploidy in pairwise
#' comparisons. This means that when comparing [69,70,71,0] and [59,60,0,0],
#' they will be treated as triploids.
#'
#' @note \subsection{Model Choice}{ The \code{add} and \code{loss} arguments
#' modify the model choice accordingly: \itemize{ \item \strong{Infitine
#' Model:} \code{add = FALSE, loss = FALSE} \item \strong{Genome Addition
#' Model:} \code{add = TRUE, loss = FALSE} \item \strong{Genome Loss Model:}
#' \code{add = FALSE, loss = TRUE} \item \strong{Combination Model}
#' \emph{(DEFAULT):} \code{add = TRUE, loss = TRUE} } Details of each model
#' choice are described in the \strong{Details} section, above. Additionally,
#' genotypes containing all missing values at a locus will return a value of
#' \code{NA} and not contribute to the average across loci. }
#' \subsection{Repeat Lengths}{ If the user does not provide a vector of
#' appropriate length for \code{replen} , it will be estimated by taking the
#' minimum difference among represented alleles at each locus. IT IS NOT
#' RECOMMENDED TO RELY ON THIS ESTIMATION. }
#'
#' @export
#' @author Zhian N. Kamvar
#'
Expand Down
89 changes: 48 additions & 41 deletions man/bruvo.dist.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,15 @@ bruvo.dist(pop, replen = 1, add = TRUE, loss = TRUE)
nucleotide repeats for each microsatellite locus.}
\item{add}{if \code{TRUE}, genotypes with zero values will be treated under
the genome addition model presented in Bruvo et al. 2004.}
the genome addition model presented in Bruvo et al. 2004. See the
\strong{Note} section for options.}
\item{loss}{if \code{TRUE}, genotypes with zero values will be treated under
the genome loss model presented in Bruvo et al. 2004.}
the genome loss model presented in Bruvo et al. 2004. See the
\strong{Note} section for options.}
}
\value{
a \code{distance matrix}
an object of class \code{\link{dist}}
}
\description{
Calculate the average Bruvo's distance over all loci in a population.
Expand All @@ -33,51 +35,56 @@ Ploidy is irrelevant with respect to calculation of Bruvo's
have a lower ploidy level than the organism.

To help deal with these situations, Bruvo has suggested three methods for
dealing with these differences in ploidy levels: \itemize{ \item Infinite
Model - The simplest way to deal with it is to count all missing alleles as
infinitely large so that the distance between it and anything else is 1.
Aside from this being computationally simple, it will tend to
\strong{inflate distances between individuals}. \item Genome Addition Model
- If it is suspected that the organism has gone through a recent genome
expansion, \strong{the missing alleles will be replace with all possible
combinations of the observed alleles in the shorter genotype}. For example,
if there is a genotype of [69, 70, 0, 0] where 0 is a missing allele, the
possible combinations are: [69, 70, 69, 69], [69, 70, 69, 70], and [69, 70,
70, 70]. The resulting distances are then averaged over the number of
comparisons. \item Genome Loss Model - This is similar to the genome
addition model, except that it assumes that there was a recent genome
reduction event and uses \strong{the observed values in the full genotype
to fill the missing values in the short genotype}. As with the Genome
Addition Model, the resulting distances are averaged over the number of
comparisons. \item Combination Model - Combine and average the genome
addition and loss models. } As mentioned above, the infinite model is
biased, but it is not nearly as computationally intensive as either of the
other models. The reason for this is that both of the addition and loss
models requires replacement of alleles and recalculation of Bruvo's
distance. The number of replacements required is equal to the multiset
coefficient: \eqn{\left({n \choose k}\right) == {(n+k-1) \choose
k}}{choose(n+k-1, k)} where \emph{n} is the number of potential
replacements and \emph{k} is the number of alleles to be replaced. So, for
the example given above, The genome addition model would require
\eqn{\left({2 \choose 2}\right) = 3}{choose(2+2-1, 2) == 3} calculations of
Bruvo's distance, whereas the genome loss model would require \eqn{\left({4
\choose 2}\right) = 10}{choose(4+2-1, 2) == 10} calculations.
dealing with these differences in ploidy levels: \itemize{ \item
\strong{Infinite Model} - The simplest way to deal with it is to count all
missing alleles as infinitely large so that the distance between it and
anything else is 1. Aside from this being computationally simple, it will
tend to \strong{inflate distances between individuals}. \item
\strong{Genome Addition Model} - If it is suspected that the organism has
gone through a recent genome expansion, \strong{the missing alleles will be
replace with all possible combinations of the observed alleles in the
shorter genotype}. For example, if there is a genotype of [69, 70, 0, 0]
where 0 is a missing allele, the possible combinations are: [69, 70, 69,
69], [69, 70, 69, 70], and [69, 70, 70, 70]. The resulting distances are
then averaged over the number of comparisons. \item \strong{Genome Loss
Model} - This is similar to the genome addition model, except that it
assumes that there was a recent genome reduction event and uses \strong{the
observed values in the full genotype to fill the missing values in the
short genotype}. As with the Genome Addition Model, the resulting distances
are averaged over the number of comparisons. \item \strong{Combination
Model} - Combine and average the genome addition and loss models. } As
mentioned above, the infinite model is biased, but it is not nearly as
computationally intensive as either of the other models. The reason for
this is that both of the addition and loss models requires replacement of
alleles and recalculation of Bruvo's distance. The number of replacements
required is equal to the multiset coefficient: \eqn{\left({n \choose
k}\right) == {(n+k-1) \choose k}}{choose(n+k-1, k)} where \emph{n} is the
number of potential replacements and \emph{k} is the number of alleles to
be replaced. So, for the example given above, The genome addition model
would require \eqn{\left({2 \choose 2}\right) = 3}{choose(2+2-1, 2) == 3}
calculations of Bruvo's distance, whereas the genome loss model would
require \eqn{\left({4 \choose 2}\right) = 10}{choose(4+2-1, 2) == 10}
calculations.

To reduce the number of calculations and assumptions otherwise, Bruvo's
distance will be calculated using the largest observed ploidy in pairwise
comparisons. This means that when comparing [69,70,71,0] and [59,60,0,0],
they will be treated as triploids.
}
\note{
The result of both \code{add = TRUE} and \code{loss = TRUE} is that the
distance is averaged over both values. If both are set to \code{FALSE},
then the infinite alleles model is used. For genotypes with all missing
values, the result will be NA.
If the user does not provide a vector of appropriate length for
\code{replen} , it will be estimated by taking the minimum difference among
represented alleles at each locus. IT IS NOT RECOMMENDED TO RELY ON THIS
ESTIMATION.
\subsection{Model Choice}{ The \code{add} and \code{loss} arguments
modify the model choice accordingly: \itemize{ \item \strong{Infitine
Model:} \code{add = FALSE, loss = FALSE} \item \strong{Genome Addition
Model:} \code{add = TRUE, loss = FALSE} \item \strong{Genome Loss Model:}
\code{add = FALSE, loss = TRUE} \item \strong{Combination Model}
\emph{(DEFAULT):} \code{add = TRUE, loss = TRUE} } Details of each model
choice are described in the \strong{Details} section, above. Additionally,
genotypes containing all missing values at a locus will return a value of
\code{NA} and not contribute to the average across loci. }
\subsection{Repeat Lengths}{ If the user does not provide a vector of
appropriate length for \code{replen} , it will be estimated by taking the
minimum difference among represented alleles at each locus. IT IS NOT
RECOMMENDED TO RELY ON THIS ESTIMATION. }
}
\examples{
# Please note that the data presented is assuming that the nancycat dataset
Expand Down
2 changes: 1 addition & 1 deletion vignettes/algo-concordance.tex
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
\Sconcordance{concordance:algo.tex:algo.Rnw:%
1 64 1 46 0 1 8 413 1 4 0 22 1 11 0 48 1}
1 64 1 46 0 1 8 434 1 4 0 22 1 11 0 48 1}
23 changes: 22 additions & 1 deletion vignettes/algo.Rnw
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@
\scalebox{-1}[1]{\jala{}}
}

\title{Algorithms and equations utilized in poppr version 1.1.2.99-55}
\title{Algorithms and equations utilized in poppr version 1.1.2.99-56}
\author{Zhian N. Kamvar$^{1}$\ and Niklaus J. Gr\"unwald$^{1,2}$\\\scriptsize{1)
Department of Botany and Plant Pathology, Oregon State University, Corvallis,
OR}\\\scriptsize{2) Horticultural Crops Research Laboratory, USDA-ARS,
Expand Down Expand Up @@ -450,6 +450,27 @@ will be calculated using the largest observed ploidy in pairwise comparisons.
This means that when
comparing [69,70,71,0] and [59,60,0,0], they will be treated as triploids.

\subsubsection{Choosing a model}
\label{appendix:algorithm:bruvomodel}
By default, the implementation of Bruvo's distance in \poppr{} will utilize the
combination model. This is implemented by setting both the \texttt{add} and
\texttt{loss} arguments to \texttt{TRUE}. For other models use the following
table for reference:

\begin{table}[ht]
\centering
\begin{tabular}{ll}
\hline
Model & Arguments \\
\hline
Infinite & \texttt{add = FALSE, loss = FALSE} \\
Genome Addition & \texttt{add = TRUE, loss = FALSE} \\
Genome Loss & \texttt{add = FALSE, loss = TRUE} \\
Combination (default) & \texttt{add = TRUE, loss = TRUE} \\
\hline
\end{tabular}
\end{table}

\subsection{Tree topology}

All of these distances were designed for analysis of populations. When applying
Expand Down
Binary file modified vignettes/algo.pdf
Binary file not shown.
29 changes: 25 additions & 4 deletions vignettes/algo.tex
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@
\scalebox{-1}[1]{\jala{}}
}

\title{Algorithms and equations utilized in poppr version 1.1.2}
\title{Algorithms and equations utilized in poppr version 1.1.2.99-55}
\author{Zhian N. Kamvar$^{1}$\ and Niklaus J. Gr\"unwald$^{1,2}$\\\scriptsize{1)
Department of Botany and Plant Pathology, Oregon State University, Corvallis,
OR}\\\scriptsize{2) Horticultural Crops Research Laboratory, USDA-ARS,
Expand Down Expand Up @@ -489,6 +489,27 @@ \subsubsection{Special cases of Bruvo's distance}
This means that when
comparing [69,70,71,0] and [59,60,0,0], they will be treated as triploids.

\subsubsection{Choosing a model}
\label{appendix:algorithm:bruvomodel}
By default, the implementation of Bruvo's distance in \poppr{} will utilize the
combination model. This is implemented by setting both the \texttt{add} and
\texttt{loss} arguments to \texttt{TRUE}. For other models use the following
table for reference:

\begin{table}[ht]
\centering
\begin{tabular}{ll}
\hline
Model & Arguments \\
\hline
Infinite & \texttt{add = FALSE, loss = FALSE} \\
Genome Addition & \texttt{add = TRUE, loss = FALSE} \\
Genome Loss & \texttt{add = FALSE, loss = TRUE} \\
Combination (default) & \texttt{add = TRUE, loss = TRUE} \\
\hline
\end{tabular}
\end{table}

\subsection{Tree topology}

All of these distances were designed for analysis of populations. When applying
Expand Down Expand Up @@ -520,7 +541,7 @@ \subsection{Tree topology}
\begin{knitrout}\footnotesize
\definecolor{shadecolor}{rgb}{0.933, 0.933, 0.933}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{library}\hlstd{(poppr)}
\hlkwd{library}\hlstd{(}\hlstr{"poppr"}\hlstd{)}
\hlstd{dat.df} \hlkwb{<-} \hlkwd{data.frame}\hlstd{(}\hlkwc{Genotype} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"1/1"}\hlstd{,} \hlstr{"1/2"}\hlstd{,} \hlstr{"2/3"}\hlstd{,} \hlstr{"3/4"}\hlstd{,} \hlstr{"4/4"}\hlstd{))}
\hlstd{dat} \hlkwb{<-} \hlkwd{as.genclone}\hlstd{(}\hlkwd{df2genind}\hlstd{(dat.df,} \hlkwc{sep} \hlstd{=} \hlstr{"/"}\hlstd{,} \hlkwc{ind.names} \hlstd{= dat.df[[}\hlnum{1}\hlstd{]]))}
\end{alltt}
Expand All @@ -543,7 +564,7 @@ \subsection{Tree topology}
\hlcom{# Adding Bruvo's distance at the end because we need to specify repeat length.}
\hlstd{dists}\hlopt{$}\hlstd{Bruvo} \hlkwb{<-} \hlkwd{bruvo.dist}\hlstd{(dat,} \hlkwc{replen} \hlstd{=} \hlnum{1}\hlstd{)}
\hlkwd{library}\hlstd{(ape)}
\hlkwd{library}\hlstd{(}\hlstr{"ape"}\hlstd{)}
\hlkwd{par}\hlstd{(}\hlkwc{mfrow} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{2}\hlstd{,} \hlnum{3}\hlstd{))}
\hlstd{x} \hlkwb{<-} \hlkwd{lapply}\hlstd{(}\hlkwd{names}\hlstd{(dists),} \hlkwa{function}\hlstd{(}\hlkwc{x}\hlstd{)\{}
\hlkwd{plot}\hlstd{(}\hlkwd{nj}\hlstd{(dists[[x]]),} \hlkwc{main} \hlstd{= x,} \hlkwc{type} \hlstd{=} \hlstr{"unrooted"}\hlstd{)}
Expand All @@ -552,7 +573,7 @@ \subsection{Tree topology}
\end{alltt}
\end{kframe}

{\centering \includegraphics[width=0.95\linewidth]{figure/unnamed-chunk-3}
{\centering \includegraphics[width=0.95\linewidth]{figure/unnamed-chunk-3-1}

}

Expand Down
2 changes: 1 addition & 1 deletion vignettes/poppr_manual.Rnw
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@
\scalebox{-1}[1]{\jala{}}
}

\title{Data import and manipulation in poppr version 1.1.2.99-55}
\title{Data import and manipulation in poppr version 1.1.2.99-56}
\author{Zhian N. Kamvar$^{1}$\ and Niklaus J. Gr\"unwald$^{1,2}$\\\scriptsize{1)
Department of Botany and Plant Pathology, Oregon State University, Corvallis,
OR}\\\scriptsize{2) Horticultural Crops Research Laboratory, USDA-ARS,
Expand Down

0 comments on commit 8407cec

Please sign in to comment.