-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path04-BasicAnalyses.Rmd
258 lines (165 loc) · 12.6 KB
/
04-BasicAnalyses.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
```{r, echo=FALSE, warning=FALSE, message=FALSE}
## Run but not shown
## Getting data ready for the examples
library(foreign)
library(furniture)
library(tidyverse)
load("~/Dropbox/GitHub/blog_rstats/assets/Data/NHANES_2012.rda")
```
# Chapter 4: Basic Analyses {-}
> "The goal is to turn data into information, and information into insight." --- Carly Fiorina
In this chapter we are going to demonstrate basic modeling in `R`. Lucky for us, `R` is built for these analyses. It is actually quite straight-forward to run these types of models and analyze the output. Not only that, but there are simple ways to compare models.
We will go through the **ANOVA** family of analyses, the **linear regression** models, and look at **diagnostics** of each.
## ANOVA {-}
ANOVA stands for **an**alysis **o**f **va**riance. It is a family of methods (e.g. ANCOVA, MANOVA) that all share the fact that they compare a continuous dependent variable by a grouping factor variable (and may have multiple outcomes or other covariates).
$$
Y_i = \alpha_0 + \alpha_1 \text{Group}_i + e_i
$$
Since the groups are compared using "effect coding," the $\alpha_0$ is the grand mean and each of the group level means are compared to it.
To run an ANOVA model, you can simply use the `aov` function. In the example below, we are analyzing whether family size (although not fully continuous it is still useful for the example) differs by race.
```{r}
df$race <- factor(df$ridreth1,
labels=c("MexicanAmerican", "OtherHispanic", "White", "Black", "Other"))
df$famsize <- as.numeric(df$dmdfmsiz)
fit <- aov(famsize ~ race, df)
anova(fit)
```
We make sure the variables are the right type, then we use the `aov` function. Inside of the function we have what is called a formula. It has the general structure: `leftside ~ rightside`. Generally, the left side is an outcome variable and the right side is the predictor (i.e. independent) variable. Here, we have `race` predicting `famsize`. We assign the model to the name `fit` which is a common way of denoting it is a model. Finally, we use the `anova` function to output a nice ANOVA table.
In the output we see the normal ANOVA table and we can see the p-value (`Pr(>F)`) is very, very small and thus is quite significant. We can look at how the groups relate using a box plot. We will be using some of the practice you got in Chapter 3 using `ggplot2` for this.
```{r}
library(ggplot2)
ggplot(df, aes(x=race, y=famsize)) +
geom_boxplot(aes(color=race)) +
scale_color_manual(guide=FALSE,
values=c("dodgerblue3", "coral2", "chartreuse4",
"darkorchid", "firebrick2")) +
theme_bw()
```
This immediately gives us an idea of where some differences may be occuring. It would appear that "White" and "MexicanAmerican" groups are different in family size.
## Assumptions {-}
We also would like to make sure the assumptions look like they are being met. In ANOVA, we want the residuals to be distributed normally, the variance of each group should be approximately the same, the groups are assumed to be randomly assigned, and the sample should be randomly selected as well.
In `R` we can get some simple graphical checks using `plot`. All we provide is our ANOVA object (here it is `fit`). The line before it `par(mfrow=c(1,2))` tells `R` to have two plots per row (the 1 means one row, 2 means two columns).
```{r, fig.height=4, fig.width=5}
par(mfrow=c(1,2))
plot(fit)
```
Here, it looks like we have a problem with normality (see the Normal Q-Q plot). Those dots should approximately follow the dotted line, which is not the case. In the first plot (Residuals vs. Fitted) suggests we have approximate homoskedasticity.
## Linear Modeling {-}
Linear regression is nearly identical to ANOVA. In fact, a linear regression with a continuous outcome and categorical predictor is exactly the same (if we use effect coding). For example, if we run the same model but with the linear regression function `lm` we get the same ANOVA table.
```{r}
fit2 <- lm(famsize ~ race, data=df)
anova(fit2)
```
Surprise! It is the same as before. Here we can also use the `summary` function and we get the coefficients in the model as well (using dummy coding). The first level of the categorical variable is the reference group (the group that the others are compared to). We also get the intercept (in this case, the average value of the reference group).
```{r}
summary(fit2)
```
## Assumptions {-}
Linear regression has a few important assumptions, often called "Gauss-Markov Assumptions". These include:
1. The model is linear in parameters.
2. Homoskedasticity (i.e. the variance of the residual is roughly uniform across the values of the independents).
3. Normality of residuals.
Numbers 2 and 3 are fairly easy to assess using the `plot()` function on the model object as we did with the ANOVA model. The linear in parameters suggests that the relationship between the outcome and independents is linear.
```{r, fig.height=4, fig.width=5}
par(mfrow=c(1,2))
plot(fit2)
```
## Comparing Models {-}
Often when running linear regression, we want to compare models and see if one fits significantly better than another. We also often want to present all the models in a table to let our readers compare the models. We will demonstrate both.
### Compare Statistically {-}
Using the `anova()` function, we can compare models statistically.
```{r}
anova(fit, fit2)
```
The `anova()` function works with all sorts of modeling schemes and can help in model selection. Not surprisingly, when we compared the ANOVA and the simple linear model, they are *exactly* the same in overall model terms (the only difference is in how the cateogrical variable is coded---either effect coding in ANOVA or dummy coding in regression). For a more interesting comparison, lets run a new model with an additional variable and then make a comparison.
```{r}
fit3 = lm(famsize ~ race + marriage, data=df)
summary(fit3)
```
Notice that the variable is associated with the outcome according to the t-test seen in the summary. So we would expect that `fit3` is better than `fit2` at explaining the outcome, which we see in the output below.
```{r}
anova(fit2, fit3)
```
### Compare in a Table {-}
We can also compare the models in a well-formatted table that makes many aspects easy to compare.
Two main packages allow us to compare models:
1. `stargazer`
2. `texreg`
Both provide simple functions to compare multiple models. For example, `stargazer` provides:
```{r}
library(stargazer)
stargazer(fit2, fit3,
type = "text")
```
## When Assumptions Fail {-}
There are many things we can try when our assumptions fail. In my opinion, the best and most interpretable way is to use a Generalized Linear Model (GLM) which is discussed in the next chapter. There are a few other things you can try which I'll show here. But, keep in mind that these things can cause other problems. For example, to fix normality we may accidentally cause heteroskedasticity. With that in mind, here are some common methods to help a model fit better.
### Log-Linear, Log-Log, Linear-Log, Other {-}
Sounds like a great tongue-twister? Well, it is but it's also three ways of specifying (i.e. deciding what is in) your model better.
**Log-Linear** is where we adjust the outcome variable by a natural log transformation. This is done easily in `R`:
```{r, eval=FALSE}
df$log_outcome <- log(df$outcome)
lm(log_outcome ~ var1, data=df)
```
**Log-Log** is where we adjust both the outcome and the predictor variable with a log transformation. This is also easily done:
```{r, eval=FALSE}
df$log_outcome <- log(df$outcome)
df$log_var1 <- log(df$var1)
lm(log_outcome ~ log_var1, data=df)
```
**Linear-Log** is where we adjsut just the predictor variable with a log transformation. And, you guessed it, this is easily done in `R`:
```{r, eval=FALSE}
df$log_var1 <- log(df$var1)
lm(outcome ~ log_var1 + var2, data=df)
```
**Other** methods such as square rooting the outcome or using some power function (e.g. square, cube) are also quite common. There are functions that look for the best transformation to use. However, I will not cover it here since I think GLM's are better. So if you want to learn about other ways to help your linear model go to the next chapter.
## Interactions {-}
Many times hypotheses dealing with human beings include interactions between effects. Interactions are when the effect of one variable depends on another variable. For example, the effect of marital status on family size may depend on whether the individual is a minority. In fact, this is the hypothesis we'll test below.
Including interactions in ANOVA and regression type models are very simple in `R`. Since interpretations of interaction effects are often best through plots, we will also show simple methods to visualize the interactions as well.
### Interactions in ANOVA {-}
In general, we refer to ANOVA's with interactions as "2-way Factorial ANOVA's". We interact race and marriage status in this ANOVA. For simplicity, we created a binary race variable called minority using the `ifelse()` function. We explain this in more depth in Chapter 5.
```{r}
df$minority <- factor(ifelse(df$race == "White", 0, 1),
labels = c("White", "Minority"))
fit_anova <- aov(famsize ~ minority*marriage, df)
anova(fit_anova)
```
Notice two things: First, the interaction is significant (p = .003). This is important since we are going to try to interpret this interaction. Second, by including `minority*marriage` we get both the main effects and the interaction. This is very important for interpretation purposes so you can thank `R` for making it a bit more easy on you.
We can check the assumptions the same way as before:
```{r, fig.height=4, fig.width=5}
par(mfrow=c(1,2))
plot(fit_anova)
```
Again, the assumptions are not met for this model. But, if we ignore that for now, we can quickly find a way to interpret the interaction.
We first create a new data set that is composed of every possible combination of the variables in the model. This allows us to get unbiased estimates for the plotting.
```{r}
newdata <- expand.grid(minority = levels(df$minority),
marriage = levels(df$marriage))
newdata$preds <- predict(fit_anova, newdata=newdata)
```
We now use `ggplot2` just as before.
```{r}
ggplot(newdata, aes(x = marriage, y = preds, group = minority)) +
geom_line(aes(color = minority)) +
geom_point(aes(color = minority)) +
labs(y = "Predicted Family Size",
x = "Marital Status") +
scale_color_manual(name = "",
values = c("dodgerblue3", "chartreuse3")) +
theme_bw()
```
The plot tells use a handful of things. For example, we see minorities generally have more children across marital statuses. However, the difference is smaller for married and divorced individuals compared to widowed, separated, never married, and living with a partner. There's certainly more to gleen from the plot, but we won't waste your time.
### Interactions in Linear Regression {-}
Interactions in linear regression is nearly identical as in ANOVA, except we use dummy coding. It provides a bit more information. For example, we get the coefficients from the linear regression whereas the ANOVA does not provide this. We can run a regression model via:
```{r}
fit_reg <- lm(famsize ~ minority*marriage, df)
summary(fit_reg)
```
We used `summary()` to see the coefficients. If we used `anova()` it would have been the same as the one for the ANOVA.
We can use the exact same methods here as we did with the ANOVA, including checking assumptions, creating a new data set, and using `ggplot2` to check the interaction. We won't repeat it here so you can move on to Chapter 5.
## Apply It {-}
[This link](https://tysonbarrett.com/DataR/Chapter4.zip) contains a folder complete with an Rstudio project file, an RMarkdown file, and a few data files. Download it and unzip it to do the following steps.
### Step 1 {-}
Open the `Chapter4.Rproj` file. This will open up RStudio for you.
### Step 2 {-}
Once RStudio has started, in the panel on the lower-right, there is a `Files` tab. Click on that to see the project folder. You should see the data files and the `Chapter4.Rmd` file. Click on the `Chapter4.Rmd` file to open it. In this file, import the data and run each type of statistical analysis presented in this chapter (there are others that are presented in Chapters 5, 6, and 7 as well that you do not need to do yet).
Once that code is in the file, click the `knit` button. This will create an HTML file with the code and output knitted together into one nice document. This can be read into any browser and can be used to show your work in a clean document.