create Data Curation and Experimental Design lectures

GotelliLab · Feb 22, 2024 · a80d345 · a80d345
1 parent e0da42c
commit a80d345
Show file tree

Hide file tree

Showing 6 changed files with 1,507 additions and 2 deletions.
diff --git a/Lectures/DataCuration.Rmd b/Lectures/DataCuration.Rmd
@@ -0,0 +1,74 @@
+---
+title: "Data Curation"
+author: "Nicholas J. Gotelli"
+date: "22 February 2024"
+output: 
+  html_document: 
+    highlight: tango
+    theme: united
+    keep_md: yes
+---
+## Data curation
+* GitHub repository   
+* Metadata   
+* [Metadata template](https://gotellilab.github.io/Bio381/Scripts/Feb02/ExcelDataTemplate.xlsx)   
+* Example of Lauren's data 
+
+## Exporting and importing data
+
+First create a tiny data set in Excel:
+```
+# comments at the top
+# beaucoup metadata
+ID, Treatment, Biomass, Notes
+1, Control, 30.3, 
+2, HighN, 13.0, 
+3, HighN, NA, broken scale
+4, Control, 35.3,
+```
+- Save as .csv
+- Inspect in RStudio Editor
+
+### Use `read.table` to bring in data
+```{r, eval=FALSE}
+my_data <- read.table(file="path/to/data.csv",
+                    header=TRUE,
+                    sep=",",
+                    comment.char="#")
+
+# inspect object
+str(my_data)
+
+# now add a column
+my_data$newVar <- runif(4)
+head(my_data)
+```
+
+### Use `write.table` to export to a data file
+
+```{r, eval=FALSE}
+
+write.table(x=my_data,
+            file="Path/To/OutputFileName.csv",
+            HEADER=TRUE,
+            sep=",")
+```
+
+
+
+But this is not a good way to save or share data objects if we are working in R. Some researchers use the `save()` function, which preserves the whole environment, but once it is restored with `load()`, the variable names cannot be changed. It is better to use `saveRDS().
+
+### `saveRDS()`: useful when you are working only in R
+```{r, eval=FALSE}
+saveRDS(my_data, file="Path/To/FileName.RDS") # .RDS suffix is not required, but good for clarity
+```
+
+This only saves a single R object as a binary, but remember, you can bundle up many things into a single list!
+
+Use `readRDS()` to restore it.
+
+### `readRDS()`
+```{r, eval=FALSE}
+data_in <-readRDS("FileName.RDS") 
+```
+
diff --git a/Lectures/DataCuration.html b/Lectures/DataCuration.html
diff --git a/Lectures/ExperimentalDesigns.Rmd b/Lectures/ExperimentalDesigns.Rmd
@@ -0,0 +1,218 @@
+---
+title: 'Experimental Designs'
+author: "Nicholas J. Gotelli"
+date: "February 22, 2024"
+output:
+  html_document:
+    highlight: tango
+    theme: united
+  pdf_document: default
+---
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE,eval=FALSE)
+```
+
+
+### Archetype Experimental Designs
+- independent versus dependent variables
+- discrete versus continuous variables
+- continuous variables (integer and real)
+- direction of cause and effect, x axis is independent
+- continuous versus discrete (natural or arbitrary or statistical bins)
+
+### Regression (dependent: continuous, independent: continuous)
+- linear model of $y = a + bx$
+- statistical tests for null of hypothesis of slope and/or intercept = 0
+- confidence and prediction intervals of uncertainty
+- goodness of fit tests for linearity
+
+### Set-up
+```{r}
+library(tidyverse)
+
+```
+
+
+### Data Frame construction for Regression Data
+
+```{r}
+n = 50  # number of observations (rows)
+
+
+varA <- runif(n) # random uniform values (independent)
+varB <- runif(n) # a second random column (dependent)
+varC <- 5.5 + varA*10 # a noisy linear relationship with varA
+ID <- seq_len(n) # creates a sequence from 1:n (if n > 0!)
+regData <- data.frame(ID,varA,varB,varC)
+head(regData)
+str(regData)
+```
+
+# Basic regression analysis in R
+```{r}
+# model
+regModel <- lm(varB~varA,data=regData)
+
+# model output
+regModel # printed output is sparse
+str(regModel) # complicated, but has "coefficients"
+head(regModel$residuals) # contains residuals
+
+# 'summary' of model has elements
+summary(regModel) # 
+summary(regModel)$coefficients
+str(summary(regModel))
+
+# best to examine entire matrix of coefficients:
+summary(regModel)$coefficients[] #shows all
+
+# can pull results from this, but a little wordy
+summary(regModel)$coefficients[1,4]   #p value for intercept
+summary(regModel)$coefficients["(Intercept)","Pr(>|t|)"] # uggh
+
+
+# alternatively unfurl this into a 1D atomic vector with names
+z <- unlist(summary(regModel))
+str(z)
+z
+z$coefficients7
+
+# grab what we need and put into a tidy  list
+
+regSum <- list(intercept=z$coefficients1,
+               slope=z$coefficients2,
+               interceptP=z$coefficients7,
+               slopeP=z$coefficients8,
+               r2=z$r.squared)
+
+# much easier to query and use
+print(regSum)
+regSum$r2
+regSum[[5]]
+
+```
+
+### Basic ggplot of regression model
+
+```{r}
+regPlot <- ggplot(data=regData,aes(x=varA,y=varB)) +
+           geom_point() +
+           stat_smooth(method=lm,se=0.99) # default se=0.95 
+print(regPlot)
+# ggsave(filename="Plot1.pdf",plot=regPlot,device="pdf")
+```
+
+### Data frame construction for one-way ANOVA
+
+```{r}
+nGroup <- 3 # number of treatment groups
+nName <- c("Control","Treat1", "Treat2") # names of groups
+nSize <- c(12,17,9) # number of observations in each group
+nMean <- c(40,41,60) # mean of each group
+nSD <- c(5,5,5) # standardd deviation of each group
+
+ID <- 1:(sum(nSize)) # id vector for each row
+resVar <- c(rnorm(n=nSize[1],mean=nMean[1],sd=nSD[1]),
+            rnorm(n=nSize[2],mean=nMean[2],sd=nSD[2]),
+            rnorm(n=nSize[3],mean=nMean[3],sd=nSD[3]))
+TGroup <- rep(nName,nSize)
+ANOdata <- data.frame(ID,TGroup,resVar)
+str(ANOdata)
+```
+
+### Basic ANOVA in R
+
+```{r}
+ANOmodel <- aov(resVar~TGroup,data=ANOdata)
+print(ANOmodel)
+print(summary(ANOmodel))
+z <- summary(ANOmodel)
+str(z)
+aggregate(resVar~TGroup,data=ANOdata,FUN=mean)
+unlist(z)
+unlist(z)[7]
+ANOsum <- list(Fval=unlist(z)[7],probF=unlist(z)[9])
+ANOsum
+
+
+```
+
+
+### Basic ggplot of ANOVA data
+
+```{r}
+ANOPlot <- ggplot(data=ANOdata,aes(x=TGroup,y=resVar,fill=TGroup)) +
+           geom_boxplot()
+print(ANOPlot)
+# ggsave(filename="Plot2.pdf",plot=ANOPlot,device="pdf")
+```
+
+
+### Data frame construction for logistic regression
+```{r}
+
+xVar <- sort(rgamma(n=200,shape=5,scale=5))
+yVar <- sample(rep(c(1,0),each=100),prob=seq_len(200))
+lRegData <- data.frame(xVar,yVar)
+
+```
+### Logistic regression analysis in R
+
+```{r}
+lRegModel <- glm(yVar ~ xVar,
+                 data=lRegData,
+                 family=binomial(link=logit))
+summary(lRegModel)
+summary(lRegModel)$coefficients
+
+```
+
+### Basic ggplot of logistic regression
+```{r}
+lRegPlot <- ggplot(data=lRegData, aes(x=xVar,y=yVar)) +
+            geom_point() +
+            stat_smooth(method=glm, method.args=list(family=binomial))
+print(lRegPlot)
+   
+```
+
+### Data for contingency table analysis
+```{r}
+# integer counts of different data groups
+vec1 <- c(50,66,22)
+vec2 <- c(120,22,30)
+dataMatrix <- rbind(vec1,vec2)
+rownames(dataMatrix) <- c("Cold","Warm")
+colnames(dataMatrix) <-c("Aphaenogaster",
+                         "Camponotus",
+                         "Crematogaster")
+str(dataMatrix)
+```
+
+### Basic contingency table analysis in R
+
+```{r}
+print(chisq.test(dataMatrix))
+```
+
+### Plotting contingency table analyses
+
+```{r}
+# some simple plots using baseR
+mosaicplot(x=dataMatrix,
+           col=c("goldenrod","grey","black"),
+           shade=FALSE)
+barplot(height=dataMatrix,
+        beside=TRUE,
+        col=c("cornflowerblue","tomato"))
+
+
+dFrame <- as.data.frame(dataMatrix)
+dFrame <- cbind(dFrame,list(Treatment=c("Cold","Warm")))
+dFrame <- gather(dFrame,key=Species,Aphaenogaster:Crematogaster,value=Counts) 
+
+p <- ggplot(data=dFrame,aes(x=Species,y=Counts,fill=Treatment)) + geom_bar(stat="identity",position="dodge",color=I("black")) +
+  scale_fill_manual(values=c("cornflowerblue","coral"))
+print(p)
+```
+
diff --git a/Lectures/ExperimentalDesigns.html b/Lectures/ExperimentalDesigns.html
diff --git a/index.Rmd b/index.Rmd
@@ -47,7 +47,7 @@ February 14  | -  | [Homework #5](Homeworks/Homework_05.html)
 February 15   | [Matrices, Lists, Data Frames II](Lectures/MatricesListsDataFrames.html) | -
 February 20 | [Probability Distributions](Lectures/ProbabilityDistributions.html)| -
 February 21  | -  | [Homework # 6](Homeworks/Homework_06.html)
-February 22 | [Data Curation](Lectures/Lecture_10.xhtml)| -
+February 22 | [Data Curation](Lectures/DataCuration.xhtml)| -
 February 27  |[Functions](Lectures/Lecture_12.xhtml) | [(Experimental Designs)](Lectures/Lecture_12a.xhtml)
 February 28  |  | [Homework # 7](Homeworks/Homework_08.html) 
 February 29  | [Functions/Structured Programming](Lectures/)  |

diff --git a/index.html b/index.html
@@ -196,7 +196,7 @@ <h2>Lecture Outlines &amp; Homework Assignments</h2>
 </tr>
 <tr class="even">
 <td>February 22</td>
-<td><a href="Lectures/Lecture_10.xhtml">Data Curation</a></td>
+<td><a href="Lectures/DataCuration.xhtml">Data Curation</a></td>
 <td>-</td>
 </tr>
 <tr class="odd">