R package to create STRUCTURE-esque bar plots and compare species delimitation models with multivariate phenotype data.
This package requires an external program, CLUMPP, which can be dowloaded here. Make note of the full path to your CLUMPP executable, you will need it below.
To download and install this package
install.packages("devtools")
library(devtools)
install_github("dmacguigan/PhenoDelimit")
library(PhenoDelimit)
Data are in "example" directory, so clone the repository first. Data were generated using simulate_data.R in the example folder. Dataset consists of 20 continuous variables for 300 idividuals divided into four groups. Data were simulated by drawing from normal disrtibutions. For each variable, means were allowed to vary between groups, but standard deviations were kept constant.
Variables 1-10 have a large range of group means (between 5 and 30) and small standard deviations (between 1 and 3). These variables are expected to be more informative about group membership.
Variables 11-20 have a small range of group means (between 18 and 22) and large standard deviations (between 3 and 10). Thus, these variables are expected to be less informative about group membership.
We will test 6 different delimitation models:
- M1 - model used to simulate the data (4 groups total)
- M2 - merge groups 1 + 2 and 3 + 4 (2 groups total)
- M3 - merge groups 1 + 2 (3 groups total)
- M4 - randomly split group 4 into 2 groups (5 groups total)
- M5 - randomly take samples from groups 3 and 4 and create a new group (5 groups total)
- M6 - keep 4 equal sized groups but randomize individual assignment (4 groups total)
Use CLUMPP to compare discriminant analysis assignment probabilities to each delimitation model. CLUMPP calculates H'
, a matrix similarity metric bounded between 0 and 1.
A value of 1 indicates a perfect match between two matrices.
See this paper in Bioinformatics for more info about the CLUMPP algorithm.
In this analysis, we provide CLUMPP with two matrices. First, a matrix where each individual is assigned (with probability 1.0) to a group based on the underlying delimitation model. Second, a matrix of individual assignment probabilities from discriminant analysis using K-means clusters, with no a priori information about group membership. By comparing the two matrices with CLUMPP, we can determine which delimitation model most closely matches the K-means clusters, which is indicative of real group differences in the data.
library(RColorBrewer)
wd = "H:/PhenoDelimit/example" # results will be written to new subdirectory "CLUMPP"
CLUMPP_exe = "C:/CLUMPP_Windows.1.1.2/CLUMPP"
data = read.table("H:/PhenoDelimit/example/sim_data.txt", header=TRUE)
n.groups = c(4,2,3,5,5,4)
model.numbers = c(1:6)
models = read.table("H:/PhenoDelimit/example/sim_models.txt", header=TRUE)
perc.var = c(70,80,90)
scale = TRUE
center = TRUE
dapc_clumpp(wd=wd,
CLUMPP_exe=CLUMPP_exe,
data=data,
n.groups=n.groups,
model.numbers=model.numbers,
models=models,
perc.var=perc.var,
scale=scale,
center=center,
apriori=FALSE)
# let's run this step again but using a priori population assignments
# we'll come back to these results later
dapc_clumpp(wd=wd,
CLUMPP_exe=CLUMPP_exe,
data=data,
n.groups=n.groups,
model.numbers=model.numbers,
models=models,
perc.var=perc.var,
scale=scale,
center=center,
apriori=TRUE)
wd = "H:/PhenoDelimit/example"
model.numbers = c(1:6)
perc.var = c(70,80,90)
clumpp_results <- read_clumpp_results(wd=wd,
perc.var=perc.var,
model.numbers=model.numbers)
wd = "H:/PhenoDelimit/example/"
clumpp.data = clumpp_results
colors = c("gray70", "gray30", "black")
plot.type = "png"
plot.name = "H_plot_example"
plot.width = 8
plot.height = 4
plot_clumpp_results(wd=wd,
clumpp.data=clumpp.data,
colors=colors,
plot.name = plot.name,
plot.type=plot.type,
plot.width=plot.width,
plot.height=plot.height)
Remember, model 1 is the "true" model which generated the data. So it's comforting to see that model 1 has a near perfect H'. Models 2-5 are tweaked versions of model 1 with groups merged or split. Model 6 randomly shuffled the "true" group assignments from model 1. So no surprise that model 6 has the lowest H'.
Here we will permute the morphological data and rerun the DAPC and CLUMPP analyses. We can use this to test the statistical significance of our models. This may take a while depending on how many models you have and how many permutations you wish to perform.
wd = "H:/PhenoDelimit/example" # results will be written to new subdirectory "CLUMPP_permuted"
CLUMPP_exe = "C:/CLUMPP_Windows.1.1.2/CLUMPP"
data = read.table("H:/PhenoDelimit/example/sim_data.txt", header=TRUE)
n.groups = c(4,2,3,5,5,4)
model.numbers = c(1:6)
models = read.table("H:/PhenoDelimit/example/sim_models.txt", header=TRUE)
perc.var = c(70,80,90)
scale = TRUE
center = TRUE
dapc_clumpp_permuted(wd=wd,
CLUMPP_exe=CLUMPP_exe,
data=data,
n.groups=n.groups,
model.numbers=model.numbers,
models=models,
perc.var=perc.var,
permutations=100,
scale=scale,
center=center)
clumpp_perm_df <- read_clumpp_results_permuted(wd=wd,
perc.var=perc.var,
model.numbers=model.numbers,
permutations=100)
H_permutation_plot(wd=wd,
clumpp.data=clumpp_results,
clumpp.data.permuted=clumpp_perm_df,
model.numbers=model.numbers,
best.perc.var=70,
plot.type="png",
plot.prefix="example",
plot.width=6,
plot.height=4,
sig.threshold=0.05)
In these figure, we can see that the observed H' (red dashed line) for model 1 is signficantly higher than the distribution of permuted H' values (gray bars). On the other hand, the observed H' (red dashed line) for model 6 is not signficantly different from the permuted H' distribution (gray bars). This makes sense, since model 1 was used to generated the simulated data, while model 6 was itself a permutation of the underlying data.
Models 2-5 all have statistically significantly higher H' values than their permuted distributions, but not as large of a differnce as we see for model 1. This also makes sense, since models 2-5 are rearrangements of the groups in model 1 used to simulate the data. But models 2-5 do not reshuffle the entire dataset like model 6.
models = read.table("H:/PhenoDelimit/example/sim_models.txt", header=TRUE)
wd = "H:/PhenoDelimit/example/"
clumpp.wd = "H:/PhenoDelimit/example/CLUMPP"
sample.names = (1:nrow(models))
sample.plot.groups = models$m1
sample.plot.groups.order = c(1,2,3,4)
sample.order = (nrow(models):1)
best.perc.var = 90
best.model.number = 1
plot.type = "png"
plot.width = 20
plot.height = 4
colors = brewer.pal(n = 5, name = "Set1")
border.color = "gray"
assign_probs_barplot(wd = wd,
clumpp.wd = clumpp.wd,
sample.names = sample.names,
sample.plot.groups = sample.plot.groups,
sample.plot.groups.order = sample.plot.groups.order,
#sample.order = sample.order,
best.perc.var = best.perc.var,
best.model.number = best.model.number,
plot.type = plot.type,
plot.width = plot.width,
plot.height = plot.height,
colors = colors,
border.color = border.color,
apriori=FALSE)
# we can also creat a bar plot based on results using a priori assignment of individuals to clusters
assign_probs_barplot(wd = wd,
clumpp.wd = clumpp.wd,
sample.names = sample.names,
sample.plot.groups = sample.plot.groups,
sample.plot.groups.order = sample.plot.groups.order,
#sample.order = sample.order,
best.perc.var = best.perc.var,
best.model.number = best.model.number,
plot.type = plot.type,
plot.width = plot.width,
plot.height = plot.height,
colors = colors,
border.color = border.color,
apriori=TRUE)
# lets make bar plots for all of the other models using a loop
for(m in 2:6){
best.model.number = m
sample.names = (1:nrow(models))
sample.plot.groups = models[,best.model.number]
sample.plot.groups.order = names(table(sample.plot.groups))
sample.order = (nrow(models):1)
best.perc.var = 90
assign_probs_barplot(wd = wd,
clumpp.wd = clumpp.wd,
sample.names = sample.names,
sample.plot.groups = sample.plot.groups,
sample.plot.groups.order = sample.plot.groups.order,
sample.order = sample.order,
best.perc.var = best.perc.var,
best.model.number = best.model.number,
plot.type = plot.type,
plot.width = plot.width,
plot.height = plot.height,
colors = colors,
border.color = border.color,
apriori=FALSE)
assign_probs_barplot(wd = wd,
clumpp.wd = clumpp.wd,
sample.names = sample.names,
sample.plot.groups = sample.plot.groups,
sample.plot.groups.order = sample.plot.groups.order,
sample.order = sample.order,
best.perc.var = best.perc.var,
best.model.number = m,
plot.type = plot.type,
plot.width = plot.width,
plot.height = plot.height,
colors = colors,
border.color = border.color,
apriori=TRUE)
}
Example bar plot of model 1 using k-means cluster assignment of individuals to 4 groups. Since model 1 was used to simulate the data, the bar plot matches the 4 clusters very well.
Example bar plot of model 5 using k-means cluster assignment of individuals to 5 groups. Model 5 randomly split group 4 into two seperate groups. We can see this reflected in the bar plot, where individuals in groups 4 and 5 are split in assignment between two groups.
Example bar plot of model 6 using k-means cluster assignment of individuals to 4 groups. Model 6 randomly shuffled the group assignment of all individuals, resulting in no clear relationship between the 4 groups and the individual assignment probabilities.
models = read.table("H:/PhenoDelimit/example/sim_models.txt", header=TRUE)
wd = "H:/PhenoDelimit/example/"
clumpp.wd = "H:/PhenoDelimit/example/CLUMPP"
sample.plot.groups = models$m1
sample.plot.groups.order = c(1,2,3,4)
sample.order = (nrow(models):1)
best.perc.var = 90
best.model.number = 1
plot.type = "png"
plot.width = 6
plot.height = 6
colors = brewer.pal(n = 5, name = "Set1")
border.color = "gray"
shapes=c(1:4)
x.axis=1
y.axis=2
plot_discriminant_axes(wd = wd,
clumpp.wd = clumpp.wd,
sample.plot.groups = sample.plot.groups,
sample.plot.groups.order = sample.plot.groups.order,
best.perc.var = best.perc.var,
best.model.number = best.model.number,
plot.type = plot.type,
plot.width = plot.width,
plot.height = plot.height,
colors = colors,
shapes = shapes,
x.axis = x.axis,
y.axis = y.axis,
apriori=FALSE)
y.axis=3
plot_discriminant_axes(wd = wd,
clumpp.wd = clumpp.wd,
sample.plot.groups = sample.plot.groups,
sample.plot.groups.order = sample.plot.groups.order,
best.perc.var = best.perc.var,
best.model.number = best.model.number,
plot.type = plot.type,
plot.width = plot.width,
plot.height = plot.height,
colors = colors,
shapes = shapes,
x.axis = x.axis,
y.axis = y.axis,
apriori=FALSE)
# now make the same plots, but using the a priori group assignments
x.axis=1
y.axis=2
plot_discriminant_axes(wd = wd,
clumpp.wd = clumpp.wd,
sample.plot.groups = sample.plot.groups,
sample.plot.groups.order = sample.plot.groups.order,
best.perc.var = best.perc.var,
best.model.number = best.model.number,
plot.type = plot.type,
plot.width = plot.width,
plot.height = plot.height,
colors = colors,
shapes = shapes,
x.axis = x.axis,
y.axis = y.axis,
apriori=TRUE)
y.axis=3
plot_discriminant_axes(wd = wd,
clumpp.wd = clumpp.wd,
sample.plot.groups = sample.plot.groups,
sample.plot.groups.order = sample.plot.groups.order,
best.perc.var = best.perc.var,
best.model.number = best.model.number,
plot.type = plot.type,
plot.width = plot.width,
plot.height = plot.height,
colors = colors,
shapes = shapes,
x.axis = x.axis,
y.axis = y.axis,
apriori=TRUE)
In this case, since model 1 was used to simulate the data, we don't see a difference in the k-means versus a priori clustering results.
wd = "H:/PhenoDelimit/example/"
clumpp.wd = "H:/PhenoDelimit/example/CLUMPP"
best.perc.var = 90
best.model.number = 1
plot.type = "png"
plot.width = 6
plot.height = 6
# Variable loading on discriminant axis 1
axis=1
discriminant_loading(wd = wd, clumpp.wd = clumpp.wd,
best.perc.var = best.perc.var, best.model.number = best.model.number,
plot.type = plot.type, plot.width = plot.width, plot.height = plot.height, axis = axis)
# Variable loading on discriminant axis 2
axis=2
discriminant_loading(wd = wd, clumpp.wd = clumpp.wd,
best.perc.var = best.perc.var, best.model.number = best.model.number,
plot.type = plot.type, plot.width = plot.width, plot.height = plot.height, axis = axis)
# Variable loading on discriminant axis 3
axis=3
discriminant_loading(wd = wd, clumpp.wd = clumpp.wd,
best.perc.var = best.perc.var, best.model.number = best.model.number,
plot.type = plot.type, plot.width = plot.width, plot.height = plot.height, axis = axis)
We can see that variables 1-10 tend to load heavily on at least one discriminant axis, while variables 11-20 do not load heavily on any axes. This is expected given that variables 1-10 are simulated to have large differences between group means and small standard deviations, while variables 11-20 have small differences between group means and large standard deviations.
In a real dataset, this could help us identify which traits are most diagnostic for which species.