Modify documentation for Bruvo's distance to fix #4

grunwaldlab · Feb 2, 2015 · 8407cec · 8407cec
1 parent 5efb868
commit 8407cec
Show file tree

Hide file tree

Showing 8 changed files with 150 additions and 93 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,8 +1,8 @@
 Package: poppr
 Type: Package
 Title: Genetic Analysis of Populations With Mixed Reproduction
-Version: 1.1.2.99-55
-Date: 2015-01-30
+Version: 1.1.2.99-56
+Date: 2015-02-02
 Authors@R: c(person(c("Zhian", "N."), "Kamvar", role = c("cre", "aut"),
     email = "[email protected]"),
     person(c("Javier", "F."), "Tabima", role = "aut",

diff --git a/R/bruvo.r b/R/bruvo.r
@@ -60,22 +60,15 @@
 #'   nucleotide repeats for each microsatellite locus.
 #'   
 #' @param add if \code{TRUE}, genotypes with zero values will be treated under 
-#'   the genome addition model presented in Bruvo et al. 2004.
+#'   the genome addition model presented in Bruvo et al. 2004. See the
+#'   \strong{Note} section for options.
 #'   
 #' @param loss if \code{TRUE}, genotypes with zero values will be treated under 
-#'   the genome loss model presented in Bruvo et al. 2004.
+#'   the genome loss model presented in Bruvo et al. 2004. See the
+#'   \strong{Note} section for options.
 #'   
-#' @return a \code{distance matrix}
+#' @return an object of class \code{\link{dist}}
 #'   
-#' @note The result of both \code{add = TRUE} and \code{loss = TRUE} is that the
-#'   distance is averaged over both values. If both are set to \code{FALSE}, 
-#'   then the infinite alleles model is used. For genotypes with all missing 
-#'   values, the result will be NA.
-#'   
-#'   If the user does not provide a vector of appropriate length for 
-#'   \code{replen} , it will be estimated by taking the minimum difference among
-#'   represented alleles at each locus. IT IS NOT RECOMMENDED TO RELY ON THIS 
-#'   ESTIMATION.
 #'   
 #' @details Ploidy is irrelevant with respect to calculation of Bruvo's 
 #'   distance. However, since it makes a comparison between all alleles at a 
@@ -85,41 +78,56 @@
 #'   have a lower ploidy level than the organism.
 #'   
 #'   To help deal with these situations, Bruvo has suggested three methods for 
-#'   dealing with these differences in ploidy levels: \itemize{ \item Infinite 
-#'   Model - The simplest way to deal with it is to count all missing alleles as
-#'   infinitely large so that the distance between it and anything else is 1. 
-#'   Aside from this being computationally simple, it will tend to 
-#'   \strong{inflate distances between individuals}. \item Genome Addition Model
-#'   - If it is suspected that the organism has gone through a recent genome 
-#'   expansion, \strong{the missing alleles will be replace with all possible 
-#'   combinations of the observed alleles in the shorter genotype}. For example,
-#'   if there is a genotype of [69, 70, 0, 0] where 0 is a missing allele, the 
-#'   possible combinations are: [69, 70, 69, 69], [69, 70, 69, 70], and [69, 70,
-#'   70, 70]. The resulting distances are then averaged over the number of 
-#'   comparisons. \item Genome Loss Model - This is similar to the genome 
-#'   addition model, except that it assumes that there was a recent genome 
-#'   reduction event and uses \strong{the observed values in the full genotype 
-#'   to fill the missing values in the short genotype}. As with the Genome 
-#'   Addition Model, the resulting distances are averaged over the number of 
-#'   comparisons. \item Combination Model - Combine and average the genome 
-#'   addition and loss models. } As mentioned above, the infinite model is 
-#'   biased, but it is not nearly as computationally intensive as either of the 
-#'   other models. The reason for this is that both of the addition and loss 
-#'   models requires replacement of alleles and recalculation of Bruvo's 
-#'   distance. The number of replacements required is equal to the multiset 
-#'   coefficient: \eqn{\left({n \choose k}\right) == {(n+k-1) \choose 
-#'   k}}{choose(n+k-1, k)} where \emph{n} is the number of potential 
-#'   replacements and \emph{k} is the number of alleles to be replaced. So, for 
-#'   the example given above, The genome addition model would require 
-#'   \eqn{\left({2 \choose 2}\right) = 3}{choose(2+2-1, 2) == 3} calculations of
-#'   Bruvo's distance, whereas the genome loss model would require \eqn{\left({4
-#'   \choose 2}\right) = 10}{choose(4+2-1, 2) == 10} calculations.
+#'   dealing with these differences in ploidy levels: \itemize{ \item
+#'   \strong{Infinite Model} - The simplest way to deal with it is to count all
+#'   missing alleles as infinitely large so that the distance between it and
+#'   anything else is 1. Aside from this being computationally simple, it will
+#'   tend to \strong{inflate distances between individuals}. \item
+#'   \strong{Genome Addition Model} - If it is suspected that the organism has
+#'   gone through a recent genome expansion, \strong{the missing alleles will be
+#'   replace with all possible combinations of the observed alleles in the
+#'   shorter genotype}. For example, if there is a genotype of [69, 70, 0, 0]
+#'   where 0 is a missing allele, the possible combinations are: [69, 70, 69,
+#'   69], [69, 70, 69, 70], and [69, 70, 70, 70]. The resulting distances are
+#'   then averaged over the number of comparisons. \item \strong{Genome Loss
+#'   Model} - This is similar to the genome addition model, except that it
+#'   assumes that there was a recent genome reduction event and uses \strong{the
+#'   observed values in the full genotype to fill the missing values in the
+#'   short genotype}. As with the Genome Addition Model, the resulting distances
+#'   are averaged over the number of comparisons. \item \strong{Combination
+#'   Model} - Combine and average the genome addition and loss models. } As
+#'   mentioned above, the infinite model is biased, but it is not nearly as
+#'   computationally intensive as either of the other models. The reason for
+#'   this is that both of the addition and loss models requires replacement of
+#'   alleles and recalculation of Bruvo's distance. The number of replacements
+#'   required is equal to the multiset coefficient: \eqn{\left({n \choose
+#'   k}\right) == {(n+k-1) \choose k}}{choose(n+k-1, k)} where \emph{n} is the
+#'   number of potential replacements and \emph{k} is the number of alleles to
+#'   be replaced. So, for the example given above, The genome addition model
+#'   would require \eqn{\left({2 \choose 2}\right) = 3}{choose(2+2-1, 2) == 3}
+#'   calculations of Bruvo's distance, whereas the genome loss model would
+#'   require \eqn{\left({4 \choose 2}\right) = 10}{choose(4+2-1, 2) == 10}
+#'   calculations.
 #'   
 #'   To reduce the number of calculations and assumptions otherwise, Bruvo's 
-#'   distance will be calculated using the largest observed ploidy in pairwise
-#'   comparisons. This means that when comparing [69,70,71,0] and [59,60,0,0],
+#'   distance will be calculated using the largest observed ploidy in pairwise 
+#'   comparisons. This means that when comparing [69,70,71,0] and [59,60,0,0], 
 #'   they will be treated as triploids.
 #'   
+#' @note \subsection{Model Choice}{ The \code{add} and \code{loss} arguments 
+#'   modify the model choice accordingly: \itemize{ \item \strong{Infitine 
+#'   Model:}  \code{add = FALSE, loss = FALSE} \item \strong{Genome Addition 
+#'   Model:}  \code{add = TRUE, loss = FALSE} \item \strong{Genome Loss Model:} 
+#'   \code{add = FALSE, loss = TRUE} \item \strong{Combination Model}
+#'   \emph{(DEFAULT):}  \code{add = TRUE, loss = TRUE} } Details of each model
+#'   choice are described in the \strong{Details} section, above. Additionally,
+#'   genotypes containing all missing values at a locus will return a value of
+#'   \code{NA} and not contribute to the average across loci. }
+#'   \subsection{Repeat Lengths}{ If the user does not provide a vector of 
+#'   appropriate length for \code{replen} , it will be estimated by taking the 
+#'   minimum difference among represented alleles at each locus. IT IS NOT 
+#'   RECOMMENDED TO RELY ON THIS ESTIMATION. }
+#'   
 #' @export
 #' @author Zhian N. Kamvar
 #'   

diff --git a/man/bruvo.dist.Rd b/man/bruvo.dist.Rd
@@ -13,13 +13,15 @@ bruvo.dist(pop, replen = 1, add = TRUE, loss = TRUE)
   nucleotide repeats for each microsatellite locus.}
 
 \item{add}{if \code{TRUE}, genotypes with zero values will be treated under
-  the genome addition model presented in Bruvo et al. 2004.}
+  the genome addition model presented in Bruvo et al. 2004. See the
+  \strong{Note} section for options.}
 
 \item{loss}{if \code{TRUE}, genotypes with zero values will be treated under
-  the genome loss model presented in Bruvo et al. 2004.}
+  the genome loss model presented in Bruvo et al. 2004. See the
+  \strong{Note} section for options.}
 }
 \value{
-a \code{distance matrix}
+an object of class \code{\link{dist}}
 }
 \description{
 Calculate the average Bruvo's distance over all loci in a population.
@@ -33,51 +35,56 @@ Ploidy is irrelevant with respect to calculation of Bruvo's
   have a lower ploidy level than the organism.
 
   To help deal with these situations, Bruvo has suggested three methods for
-  dealing with these differences in ploidy levels: \itemize{ \item Infinite
-  Model - The simplest way to deal with it is to count all missing alleles as
-  infinitely large so that the distance between it and anything else is 1.
-  Aside from this being computationally simple, it will tend to
-  \strong{inflate distances between individuals}. \item Genome Addition Model
-  - If it is suspected that the organism has gone through a recent genome
-  expansion, \strong{the missing alleles will be replace with all possible
-  combinations of the observed alleles in the shorter genotype}. For example,
-  if there is a genotype of [69, 70, 0, 0] where 0 is a missing allele, the
-  possible combinations are: [69, 70, 69, 69], [69, 70, 69, 70], and [69, 70,
-  70, 70]. The resulting distances are then averaged over the number of
-  comparisons. \item Genome Loss Model - This is similar to the genome
-  addition model, except that it assumes that there was a recent genome
-  reduction event and uses \strong{the observed values in the full genotype
-  to fill the missing values in the short genotype}. As with the Genome
-  Addition Model, the resulting distances are averaged over the number of
-  comparisons. \item Combination Model - Combine and average the genome
-  addition and loss models. } As mentioned above, the infinite model is
-  biased, but it is not nearly as computationally intensive as either of the
-  other models. The reason for this is that both of the addition and loss
-  models requires replacement of alleles and recalculation of Bruvo's
-  distance. The number of replacements required is equal to the multiset
-  coefficient: \eqn{\left({n \choose k}\right) == {(n+k-1) \choose
-  k}}{choose(n+k-1, k)} where \emph{n} is the number of potential
-  replacements and \emph{k} is the number of alleles to be replaced. So, for
-  the example given above, The genome addition model would require
-  \eqn{\left({2 \choose 2}\right) = 3}{choose(2+2-1, 2) == 3} calculations of
-  Bruvo's distance, whereas the genome loss model would require \eqn{\left({4
-  \choose 2}\right) = 10}{choose(4+2-1, 2) == 10} calculations.
+  dealing with these differences in ploidy levels: \itemize{ \item
+  \strong{Infinite Model} - The simplest way to deal with it is to count all
+  missing alleles as infinitely large so that the distance between it and
+  anything else is 1. Aside from this being computationally simple, it will
+  tend to \strong{inflate distances between individuals}. \item
+  \strong{Genome Addition Model} - If it is suspected that the organism has
+  gone through a recent genome expansion, \strong{the missing alleles will be
+  replace with all possible combinations of the observed alleles in the
+  shorter genotype}. For example, if there is a genotype of [69, 70, 0, 0]
+  where 0 is a missing allele, the possible combinations are: [69, 70, 69,
+  69], [69, 70, 69, 70], and [69, 70, 70, 70]. The resulting distances are
+  then averaged over the number of comparisons. \item \strong{Genome Loss
+  Model} - This is similar to the genome addition model, except that it
+  assumes that there was a recent genome reduction event and uses \strong{the
+  observed values in the full genotype to fill the missing values in the
+  short genotype}. As with the Genome Addition Model, the resulting distances
+  are averaged over the number of comparisons. \item \strong{Combination
+  Model} - Combine and average the genome addition and loss models. } As
+  mentioned above, the infinite model is biased, but it is not nearly as
+  computationally intensive as either of the other models. The reason for
+  this is that both of the addition and loss models requires replacement of
+  alleles and recalculation of Bruvo's distance. The number of replacements
+  required is equal to the multiset coefficient: \eqn{\left({n \choose
+  k}\right) == {(n+k-1) \choose k}}{choose(n+k-1, k)} where \emph{n} is the
+  number of potential replacements and \emph{k} is the number of alleles to
+  be replaced. So, for the example given above, The genome addition model
+  would require \eqn{\left({2 \choose 2}\right) = 3}{choose(2+2-1, 2) == 3}
+  calculations of Bruvo's distance, whereas the genome loss model would
+  require \eqn{\left({4 \choose 2}\right) = 10}{choose(4+2-1, 2) == 10}
+  calculations.
 
   To reduce the number of calculations and assumptions otherwise, Bruvo's
   distance will be calculated using the largest observed ploidy in pairwise
   comparisons. This means that when comparing [69,70,71,0] and [59,60,0,0],
   they will be treated as triploids.
 }
 \note{
-The result of both \code{add = TRUE} and \code{loss = TRUE} is that the
-  distance is averaged over both values. If both are set to \code{FALSE},
-  then the infinite alleles model is used. For genotypes with all missing
-  values, the result will be NA.
-
-  If the user does not provide a vector of appropriate length for
-  \code{replen} , it will be estimated by taking the minimum difference among
-  represented alleles at each locus. IT IS NOT RECOMMENDED TO RELY ON THIS
-  ESTIMATION.
+\subsection{Model Choice}{ The \code{add} and \code{loss} arguments
+  modify the model choice accordingly: \itemize{ \item \strong{Infitine
+  Model:}  \code{add = FALSE, loss = FALSE} \item \strong{Genome Addition
+  Model:}  \code{add = TRUE, loss = FALSE} \item \strong{Genome Loss Model:}
+  \code{add = FALSE, loss = TRUE} \item \strong{Combination Model}
+  \emph{(DEFAULT):}  \code{add = TRUE, loss = TRUE} } Details of each model
+  choice are described in the \strong{Details} section, above. Additionally,
+  genotypes containing all missing values at a locus will return a value of
+  \code{NA} and not contribute to the average across loci. }
+  \subsection{Repeat Lengths}{ If the user does not provide a vector of
+  appropriate length for \code{replen} , it will be estimated by taking the
+  minimum difference among represented alleles at each locus. IT IS NOT
+  RECOMMENDED TO RELY ON THIS ESTIMATION. }
 }
 \examples{
 # Please note that the data presented is assuming that the nancycat dataset

diff --git a/vignettes/algo-concordance.tex b/vignettes/algo-concordance.tex
@@ -1,2 +1,2 @@
 \Sconcordance{concordance:algo.tex:algo.Rnw:%
-1 64 1 46 0 1 8 413 1 4 0 22 1 11 0 48 1}
+1 64 1 46 0 1 8 434 1 4 0 22 1 11 0 48 1}
diff --git a/vignettes/algo.Rnw b/vignettes/algo.Rnw
@@ -51,7 +51,7 @@
   \scalebox{-1}[1]{\jala{}}
 }
 
-\title{Algorithms and equations utilized in poppr version 1.1.2.99-55}
+\title{Algorithms and equations utilized in poppr version 1.1.2.99-56}
 \author{Zhian N. Kamvar$^{1}$\ and Niklaus J. Gr\"unwald$^{1,2}$\\\scriptsize{1)
 Department of Botany and Plant Pathology, Oregon State University, Corvallis,
 OR}\\\scriptsize{2) Horticultural Crops Research Laboratory, USDA-ARS,
@@ -450,6 +450,27 @@ will be calculated using the largest observed ploidy in pairwise comparisons.
 This means that when
 comparing [69,70,71,0] and [59,60,0,0], they will be treated as triploids.
 
+\subsubsection{Choosing a model}
+\label{appendix:algorithm:bruvomodel}
+By default, the implementation of Bruvo's distance in \poppr{} will utilize the
+combination model. This is implemented by setting both the \texttt{add} and 
+\texttt{loss} arguments to \texttt{TRUE}. For other models use the following 
+table for reference:
+
+\begin{table}[ht]
+\centering
+\begin{tabular}{ll}
+  \hline
+Model & Arguments \\
+  \hline
+Infinite & \texttt{add = FALSE, loss = FALSE} \\
+Genome Addition & \texttt{add = TRUE, loss = FALSE} \\
+Genome Loss & \texttt{add = FALSE, loss = TRUE} \\
+Combination (default) & \texttt{add = TRUE, loss = TRUE} \\
+  \hline
+\end{tabular}
+\end{table}
+
 \subsection{Tree topology}
 
 All of these distances were designed for analysis of populations. When applying

diff --git a/vignettes/algo.pdf b/vignettes/algo.pdf
diff --git a/vignettes/algo.tex b/vignettes/algo.tex
@@ -100,7 +100,7 @@
   \scalebox{-1}[1]{\jala{}}
 }
 
-\title{Algorithms and equations utilized in poppr version 1.1.2}
+\title{Algorithms and equations utilized in poppr version 1.1.2.99-55}
 \author{Zhian N. Kamvar$^{1}$\ and Niklaus J. Gr\"unwald$^{1,2}$\\\scriptsize{1)
 Department of Botany and Plant Pathology, Oregon State University, Corvallis,
 OR}\\\scriptsize{2) Horticultural Crops Research Laboratory, USDA-ARS,
@@ -489,6 +489,27 @@ \subsubsection{Special cases of Bruvo's distance}
 This means that when
 comparing [69,70,71,0] and [59,60,0,0], they will be treated as triploids.
 
+\subsubsection{Choosing a model}
+\label{appendix:algorithm:bruvomodel}
+By default, the implementation of Bruvo's distance in \poppr{} will utilize the
+combination model. This is implemented by setting both the \texttt{add} and 
+\texttt{loss} arguments to \texttt{TRUE}. For other models use the following 
+table for reference:
+
+\begin{table}[ht]
+\centering
+\begin{tabular}{ll}
+  \hline
+Model & Arguments \\
+  \hline
+Infinite & \texttt{add = FALSE, loss = FALSE} \\
+Genome Addition & \texttt{add = TRUE, loss = FALSE} \\
+Genome Loss & \texttt{add = FALSE, loss = TRUE} \\
+Combination (default) & \texttt{add = TRUE, loss = TRUE} \\
+  \hline
+\end{tabular}
+\end{table}
+
 \subsection{Tree topology}
 
 All of these distances were designed for analysis of populations. When applying
@@ -520,7 +541,7 @@ \subsection{Tree topology}
 \begin{knitrout}\footnotesize
 \definecolor{shadecolor}{rgb}{0.933, 0.933, 0.933}\color{fgcolor}\begin{kframe}
 \begin{alltt}
-\hlkwd{library}\hlstd{(poppr)}
+\hlkwd{library}\hlstd{(}\hlstr{"poppr"}\hlstd{)}
 \hlstd{dat.df} \hlkwb{<-} \hlkwd{data.frame}\hlstd{(}\hlkwc{Genotype} \hlstd{=} \hlkwd{c}\hlstd{(}\hlstr{"1/1"}\hlstd{,} \hlstr{"1/2"}\hlstd{,} \hlstr{"2/3"}\hlstd{,} \hlstr{"3/4"}\hlstd{,} \hlstr{"4/4"}\hlstd{))}
 \hlstd{dat} \hlkwb{<-} \hlkwd{as.genclone}\hlstd{(}\hlkwd{df2genind}\hlstd{(dat.df,} \hlkwc{sep} \hlstd{=} \hlstr{"/"}\hlstd{,} \hlkwc{ind.names} \hlstd{= dat.df[[}\hlnum{1}\hlstd{]]))}
 \end{alltt}
@@ -543,7 +564,7 @@ \subsection{Tree topology}
 
 \hlcom{# Adding Bruvo's distance at the end because we need to specify repeat length.}
 \hlstd{dists}\hlopt{$}\hlstd{Bruvo} \hlkwb{<-} \hlkwd{bruvo.dist}\hlstd{(dat,} \hlkwc{replen} \hlstd{=} \hlnum{1}\hlstd{)}
-\hlkwd{library}\hlstd{(ape)}
+\hlkwd{library}\hlstd{(}\hlstr{"ape"}\hlstd{)}
 \hlkwd{par}\hlstd{(}\hlkwc{mfrow} \hlstd{=} \hlkwd{c}\hlstd{(}\hlnum{2}\hlstd{,} \hlnum{3}\hlstd{))}
 \hlstd{x} \hlkwb{<-} \hlkwd{lapply}\hlstd{(}\hlkwd{names}\hlstd{(dists),} \hlkwa{function}\hlstd{(}\hlkwc{x}\hlstd{)\{}
   \hlkwd{plot}\hlstd{(}\hlkwd{nj}\hlstd{(dists[[x]]),} \hlkwc{main} \hlstd{= x,} \hlkwc{type} \hlstd{=} \hlstr{"unrooted"}\hlstd{)}
@@ -552,7 +573,7 @@ \subsection{Tree topology}
 \end{alltt}
 \end{kframe}
 
-{\centering \includegraphics[width=0.95\linewidth]{figure/unnamed-chunk-3} 
+{\centering \includegraphics[width=0.95\linewidth]{figure/unnamed-chunk-3-1} 
 
 }
 

diff --git a/vignettes/poppr_manual.Rnw b/vignettes/poppr_manual.Rnw
@@ -51,7 +51,7 @@
   \scalebox{-1}[1]{\jala{}}
 }
 
-\title{Data import and manipulation in poppr version 1.1.2.99-55}
+\title{Data import and manipulation in poppr version 1.1.2.99-56}
 \author{Zhian N. Kamvar$^{1}$\ and Niklaus J. Gr\"unwald$^{1,2}$\\\scriptsize{1)
 Department of Botany and Plant Pathology, Oregon State University, Corvallis,
 OR}\\\scriptsize{2) Horticultural Crops Research Laboratory, USDA-ARS,