From a80d345ff95e355f6d427cb48a030588e2ddbb54 Mon Sep 17 00:00:00 2001 From: ngotelli Date: Thu, 22 Feb 2024 10:58:30 -0500 Subject: [PATCH] create Data Curation and Experimental Design lectures --- Lectures/DataCuration.Rmd | 74 ++++ Lectures/DataCuration.html | 549 ++++++++++++++++++++++++ Lectures/ExperimentalDesigns.Rmd | 218 ++++++++++ Lectures/ExperimentalDesigns.html | 664 ++++++++++++++++++++++++++++++ index.Rmd | 2 +- index.html | 2 +- 6 files changed, 1507 insertions(+), 2 deletions(-) create mode 100644 Lectures/DataCuration.Rmd create mode 100644 Lectures/DataCuration.html create mode 100644 Lectures/ExperimentalDesigns.Rmd create mode 100644 Lectures/ExperimentalDesigns.html diff --git a/Lectures/DataCuration.Rmd b/Lectures/DataCuration.Rmd new file mode 100644 index 0000000..c18da43 --- /dev/null +++ b/Lectures/DataCuration.Rmd @@ -0,0 +1,74 @@ +--- +title: "Data Curation" +author: "Nicholas J. Gotelli" +date: "22 February 2024" +output: + html_document: + highlight: tango + theme: united + keep_md: yes +--- +## Data curation +* GitHub repository +* Metadata +* [Metadata template](https://gotellilab.github.io/Bio381/Scripts/Feb02/ExcelDataTemplate.xlsx) +* Example of Lauren's data + +## Exporting and importing data + +First create a tiny data set in Excel: +``` +# comments at the top +# beaucoup metadata +ID, Treatment, Biomass, Notes +1, Control, 30.3, +2, HighN, 13.0, +3, HighN, NA, broken scale +4, Control, 35.3, +``` +- Save as .csv +- Inspect in RStudio Editor + +### Use `read.table` to bring in data +```{r, eval=FALSE} +my_data <- read.table(file="path/to/data.csv", + header=TRUE, + sep=",", + comment.char="#") + +# inspect object +str(my_data) + +# now add a column +my_data$newVar <- runif(4) +head(my_data) +``` + +### Use `write.table` to export to a data file + +```{r, eval=FALSE} + +write.table(x=my_data, + file="Path/To/OutputFileName.csv", + HEADER=TRUE, + sep=",") +``` + + + +But this is not a good way to save or share data objects if we are working in R. Some researchers use the `save()` function, which preserves the whole environment, but once it is restored with `load()`, the variable names cannot be changed. It is better to use `saveRDS(). + +### `saveRDS()`: useful when you are working only in R +```{r, eval=FALSE} +saveRDS(my_data, file="Path/To/FileName.RDS") # .RDS suffix is not required, but good for clarity +``` + +This only saves a single R object as a binary, but remember, you can bundle up many things into a single list! + +Use `readRDS()` to restore it. + +### `readRDS()` +```{r, eval=FALSE} +data_in <-readRDS("FileName.RDS") +``` + diff --git a/Lectures/DataCuration.html b/Lectures/DataCuration.html new file mode 100644 index 0000000..a778017 --- /dev/null +++ b/Lectures/DataCuration.html @@ -0,0 +1,549 @@ + + + + + + + + + + + + + + + +Data Curation + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +
+

Data curation

+
    +
  • GitHub repository
    +
  • +
  • Metadata
    +
  • +
  • Metadata +template
    +
  • +
  • Example of Lauren’s data
  • +
+
+
+

Exporting and importing data

+

First create a tiny data set in Excel:

+
# comments at the top
+# beaucoup metadata
+ID, Treatment, Biomass, Notes
+1, Control, 30.3, 
+2, HighN, 13.0, 
+3, HighN, NA, broken scale
+4, Control, 35.3,
+
    +
  • Save as .csv
  • +
  • Inspect in RStudio Editor
  • +
+
+

Use read.table to bring in data

+
my_data <- read.table(file="path/to/data.csv",
+                    header=TRUE,
+                    sep=",",
+                    comment.char="#")
+
+# inspect object
+str(my_data)
+
+# now add a column
+my_data$newVar <- runif(4)
+head(my_data)
+
+
+

Use write.table to export to a data file

+
write.table(x=my_data,
+            file="Path/To/OutputFileName.csv",
+            HEADER=TRUE,
+            sep=",")
+

But this is not a good way to save or share data objects if we are +working in R. Some researchers use the save() function, +which preserves the whole environment, but once it is restored with +load(), the variable names cannot be changed. It is better +to use `saveRDS().

+
+
+

saveRDS(): useful when you are working only in R

+
saveRDS(my_data, file="Path/To/FileName.RDS") # .RDS suffix is not required, but good for clarity
+

This only saves a single R object as a binary, but remember, you can +bundle up many things into a single list!

+

Use readRDS() to restore it.

+
+
+

readRDS()

+
data_in <-readRDS("FileName.RDS") 
+
+
+ + + + +
+ + + + + + + + + + + + + + + diff --git a/Lectures/ExperimentalDesigns.Rmd b/Lectures/ExperimentalDesigns.Rmd new file mode 100644 index 0000000..c4da73d --- /dev/null +++ b/Lectures/ExperimentalDesigns.Rmd @@ -0,0 +1,218 @@ +--- +title: 'Experimental Designs' +author: "Nicholas J. Gotelli" +date: "February 22, 2024" +output: + html_document: + highlight: tango + theme: united + pdf_document: default +--- +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE,eval=FALSE) +``` + + +### Archetype Experimental Designs +- independent versus dependent variables +- discrete versus continuous variables +- continuous variables (integer and real) +- direction of cause and effect, x axis is independent +- continuous versus discrete (natural or arbitrary or statistical bins) + +### Regression (dependent: continuous, independent: continuous) +- linear model of $y = a + bx$ +- statistical tests for null of hypothesis of slope and/or intercept = 0 +- confidence and prediction intervals of uncertainty +- goodness of fit tests for linearity + +### Set-up +```{r} +library(tidyverse) + +``` + + +### Data Frame construction for Regression Data + +```{r} +n = 50 # number of observations (rows) + + +varA <- runif(n) # random uniform values (independent) +varB <- runif(n) # a second random column (dependent) +varC <- 5.5 + varA*10 # a noisy linear relationship with varA +ID <- seq_len(n) # creates a sequence from 1:n (if n > 0!) +regData <- data.frame(ID,varA,varB,varC) +head(regData) +str(regData) +``` + +# Basic regression analysis in R +```{r} +# model +regModel <- lm(varB~varA,data=regData) + +# model output +regModel # printed output is sparse +str(regModel) # complicated, but has "coefficients" +head(regModel$residuals) # contains residuals + +# 'summary' of model has elements +summary(regModel) # +summary(regModel)$coefficients +str(summary(regModel)) + +# best to examine entire matrix of coefficients: +summary(regModel)$coefficients[] #shows all + +# can pull results from this, but a little wordy +summary(regModel)$coefficients[1,4] #p value for intercept +summary(regModel)$coefficients["(Intercept)","Pr(>|t|)"] # uggh + + +# alternatively unfurl this into a 1D atomic vector with names +z <- unlist(summary(regModel)) +str(z) +z +z$coefficients7 + +# grab what we need and put into a tidy list + +regSum <- list(intercept=z$coefficients1, + slope=z$coefficients2, + interceptP=z$coefficients7, + slopeP=z$coefficients8, + r2=z$r.squared) + +# much easier to query and use +print(regSum) +regSum$r2 +regSum[[5]] + +``` + +### Basic ggplot of regression model + +```{r} +regPlot <- ggplot(data=regData,aes(x=varA,y=varB)) + + geom_point() + + stat_smooth(method=lm,se=0.99) # default se=0.95 +print(regPlot) +# ggsave(filename="Plot1.pdf",plot=regPlot,device="pdf") +``` + +### Data frame construction for one-way ANOVA + +```{r} +nGroup <- 3 # number of treatment groups +nName <- c("Control","Treat1", "Treat2") # names of groups +nSize <- c(12,17,9) # number of observations in each group +nMean <- c(40,41,60) # mean of each group +nSD <- c(5,5,5) # standardd deviation of each group + +ID <- 1:(sum(nSize)) # id vector for each row +resVar <- c(rnorm(n=nSize[1],mean=nMean[1],sd=nSD[1]), + rnorm(n=nSize[2],mean=nMean[2],sd=nSD[2]), + rnorm(n=nSize[3],mean=nMean[3],sd=nSD[3])) +TGroup <- rep(nName,nSize) +ANOdata <- data.frame(ID,TGroup,resVar) +str(ANOdata) +``` + +### Basic ANOVA in R + +```{r} +ANOmodel <- aov(resVar~TGroup,data=ANOdata) +print(ANOmodel) +print(summary(ANOmodel)) +z <- summary(ANOmodel) +str(z) +aggregate(resVar~TGroup,data=ANOdata,FUN=mean) +unlist(z) +unlist(z)[7] +ANOsum <- list(Fval=unlist(z)[7],probF=unlist(z)[9]) +ANOsum + + +``` + + +### Basic ggplot of ANOVA data + +```{r} +ANOPlot <- ggplot(data=ANOdata,aes(x=TGroup,y=resVar,fill=TGroup)) + + geom_boxplot() +print(ANOPlot) +# ggsave(filename="Plot2.pdf",plot=ANOPlot,device="pdf") +``` + + +### Data frame construction for logistic regression +```{r} + +xVar <- sort(rgamma(n=200,shape=5,scale=5)) +yVar <- sample(rep(c(1,0),each=100),prob=seq_len(200)) +lRegData <- data.frame(xVar,yVar) + +``` +### Logistic regression analysis in R + +```{r} +lRegModel <- glm(yVar ~ xVar, + data=lRegData, + family=binomial(link=logit)) +summary(lRegModel) +summary(lRegModel)$coefficients + +``` + +### Basic ggplot of logistic regression +```{r} +lRegPlot <- ggplot(data=lRegData, aes(x=xVar,y=yVar)) + + geom_point() + + stat_smooth(method=glm, method.args=list(family=binomial)) +print(lRegPlot) + +``` + +### Data for contingency table analysis +```{r} +# integer counts of different data groups +vec1 <- c(50,66,22) +vec2 <- c(120,22,30) +dataMatrix <- rbind(vec1,vec2) +rownames(dataMatrix) <- c("Cold","Warm") +colnames(dataMatrix) <-c("Aphaenogaster", + "Camponotus", + "Crematogaster") +str(dataMatrix) +``` + +### Basic contingency table analysis in R + +```{r} +print(chisq.test(dataMatrix)) +``` + +### Plotting contingency table analyses + +```{r} +# some simple plots using baseR +mosaicplot(x=dataMatrix, + col=c("goldenrod","grey","black"), + shade=FALSE) +barplot(height=dataMatrix, + beside=TRUE, + col=c("cornflowerblue","tomato")) + + +dFrame <- as.data.frame(dataMatrix) +dFrame <- cbind(dFrame,list(Treatment=c("Cold","Warm"))) +dFrame <- gather(dFrame,key=Species,Aphaenogaster:Crematogaster,value=Counts) + +p <- ggplot(data=dFrame,aes(x=Species,y=Counts,fill=Treatment)) + geom_bar(stat="identity",position="dodge",color=I("black")) + + scale_fill_manual(values=c("cornflowerblue","coral")) +print(p) +``` + diff --git a/Lectures/ExperimentalDesigns.html b/Lectures/ExperimentalDesigns.html new file mode 100644 index 0000000..ff33e1f --- /dev/null +++ b/Lectures/ExperimentalDesigns.html @@ -0,0 +1,664 @@ + + + + + + + + + + + + + + + +Experimental Designs + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +
+

Archetype Experimental Designs

+
    +
  • independent versus dependent variables
  • +
  • discrete versus continuous variables
  • +
  • continuous variables (integer and real)
  • +
  • direction of cause and effect, x axis is independent
  • +
  • continuous versus discrete (natural or arbitrary or statistical +bins)
  • +
+
+
+

Regression (dependent: continuous, independent: continuous)

+
    +
  • linear model of \(y = a + bx\)
  • +
  • statistical tests for null of hypothesis of slope and/or intercept = +0
  • +
  • confidence and prediction intervals of uncertainty
  • +
  • goodness of fit tests for linearity
  • +
+
+
+

Set-up

+
library(tidyverse)
+
+
+

Data Frame construction for Regression Data

+
n = 50  # number of observations (rows)
+
+
+varA <- runif(n) # random uniform values (independent)
+varB <- runif(n) # a second random column (dependent)
+varC <- 5.5 + varA*10 # a noisy linear relationship with varA
+ID <- seq_len(n) # creates a sequence from 1:n (if n > 0!)
+regData <- data.frame(ID,varA,varB,varC)
+head(regData)
+str(regData)
+
+
+

Basic regression analysis in R

+
# model
+regModel <- lm(varB~varA,data=regData)
+
+# model output
+regModel # printed output is sparse
+str(regModel) # complicated, but has "coefficients"
+head(regModel$residuals) # contains residuals
+
+# 'summary' of model has elements
+summary(regModel) # 
+summary(regModel)$coefficients
+str(summary(regModel))
+
+# best to examine entire matrix of coefficients:
+summary(regModel)$coefficients[] #shows all
+
+# can pull results from this, but a little wordy
+summary(regModel)$coefficients[1,4]   #p value for intercept
+summary(regModel)$coefficients["(Intercept)","Pr(>|t|)"] # uggh
+
+
+# alternatively unfurl this into a 1D atomic vector with names
+z <- unlist(summary(regModel))
+str(z)
+z
+z$coefficients7
+
+# grab what we need and put into a tidy  list
+
+regSum <- list(intercept=z$coefficients1,
+               slope=z$coefficients2,
+               interceptP=z$coefficients7,
+               slopeP=z$coefficients8,
+               r2=z$r.squared)
+
+# much easier to query and use
+print(regSum)
+regSum$r2
+regSum[[5]]
+
+

Basic ggplot of regression model

+
regPlot <- ggplot(data=regData,aes(x=varA,y=varB)) +
+           geom_point() +
+           stat_smooth(method=lm,se=0.99) # default se=0.95 
+print(regPlot)
+# ggsave(filename="Plot1.pdf",plot=regPlot,device="pdf")
+
+
+

Data frame construction for one-way ANOVA

+
nGroup <- 3 # number of treatment groups
+nName <- c("Control","Treat1", "Treat2") # names of groups
+nSize <- c(12,17,9) # number of observations in each group
+nMean <- c(40,41,60) # mean of each group
+nSD <- c(5,5,5) # standardd deviation of each group
+
+ID <- 1:(sum(nSize)) # id vector for each row
+resVar <- c(rnorm(n=nSize[1],mean=nMean[1],sd=nSD[1]),
+            rnorm(n=nSize[2],mean=nMean[2],sd=nSD[2]),
+            rnorm(n=nSize[3],mean=nMean[3],sd=nSD[3]))
+TGroup <- rep(nName,nSize)
+ANOdata <- data.frame(ID,TGroup,resVar)
+str(ANOdata)
+
+
+

Basic ANOVA in R

+
ANOmodel <- aov(resVar~TGroup,data=ANOdata)
+print(ANOmodel)
+print(summary(ANOmodel))
+z <- summary(ANOmodel)
+str(z)
+aggregate(resVar~TGroup,data=ANOdata,FUN=mean)
+unlist(z)
+unlist(z)[7]
+ANOsum <- list(Fval=unlist(z)[7],probF=unlist(z)[9])
+ANOsum
+
+
+

Basic ggplot of ANOVA data

+
ANOPlot <- ggplot(data=ANOdata,aes(x=TGroup,y=resVar,fill=TGroup)) +
+           geom_boxplot()
+print(ANOPlot)
+# ggsave(filename="Plot2.pdf",plot=ANOPlot,device="pdf")
+
+
+

Data frame construction for logistic regression

+
xVar <- sort(rgamma(n=200,shape=5,scale=5))
+yVar <- sample(rep(c(1,0),each=100),prob=seq_len(200))
+lRegData <- data.frame(xVar,yVar)
+
+
+

Logistic regression analysis in R

+
lRegModel <- glm(yVar ~ xVar,
+                 data=lRegData,
+                 family=binomial(link=logit))
+summary(lRegModel)
+summary(lRegModel)$coefficients
+
+
+

Basic ggplot of logistic regression

+
lRegPlot <- ggplot(data=lRegData, aes(x=xVar,y=yVar)) +
+            geom_point() +
+            stat_smooth(method=glm, method.args=list(family=binomial))
+print(lRegPlot)
+
+
+

Data for contingency table analysis

+
# integer counts of different data groups
+vec1 <- c(50,66,22)
+vec2 <- c(120,22,30)
+dataMatrix <- rbind(vec1,vec2)
+rownames(dataMatrix) <- c("Cold","Warm")
+colnames(dataMatrix) <-c("Aphaenogaster",
+                         "Camponotus",
+                         "Crematogaster")
+str(dataMatrix)
+
+
+

Basic contingency table analysis in R

+
print(chisq.test(dataMatrix))
+
+
+

Plotting contingency table analyses

+
# some simple plots using baseR
+mosaicplot(x=dataMatrix,
+           col=c("goldenrod","grey","black"),
+           shade=FALSE)
+barplot(height=dataMatrix,
+        beside=TRUE,
+        col=c("cornflowerblue","tomato"))
+
+
+dFrame <- as.data.frame(dataMatrix)
+dFrame <- cbind(dFrame,list(Treatment=c("Cold","Warm")))
+dFrame <- gather(dFrame,key=Species,Aphaenogaster:Crematogaster,value=Counts) 
+
+p <- ggplot(data=dFrame,aes(x=Species,y=Counts,fill=Treatment)) + geom_bar(stat="identity",position="dodge",color=I("black")) +
+  scale_fill_manual(values=c("cornflowerblue","coral"))
+print(p)
+
+
+ + + + +
+ + + + + + + + + + + + + + + diff --git a/index.Rmd b/index.Rmd index e8951d8..c633a99 100644 --- a/index.Rmd +++ b/index.Rmd @@ -47,7 +47,7 @@ February 14 | - | [Homework #5](Homeworks/Homework_05.html) February 15 | [Matrices, Lists, Data Frames II](Lectures/MatricesListsDataFrames.html) | - February 20 | [Probability Distributions](Lectures/ProbabilityDistributions.html)| - February 21 | - | [Homework # 6](Homeworks/Homework_06.html) -February 22 | [Data Curation](Lectures/Lecture_10.xhtml)| - +February 22 | [Data Curation](Lectures/DataCuration.xhtml)| - February 27 |[Functions](Lectures/Lecture_12.xhtml) | [(Experimental Designs)](Lectures/Lecture_12a.xhtml) February 28 | | [Homework # 7](Homeworks/Homework_08.html) February 29 | [Functions/Structured Programming](Lectures/) | diff --git a/index.html b/index.html index 8848126..fff3fd5 100644 --- a/index.html +++ b/index.html @@ -196,7 +196,7 @@

Lecture Outlines & Homework Assignments

February 22 -Data Curation +Data Curation -